Skip to content

Reductions in Learning, Decision Lists, and Hardness for 3‑Term DNF

1. Occam’s razor view: sample complexity from description length

Let \(\mathcal C_n\) be a hypothesis class over \(n\) Boolean variables.

  • Let \(B(n)\) be an upper bound (in bits) on the representation size of hypotheses in \(\mathcal C_n\).
  • Then (via an Occam-style argument), \(m\) i.i.d. labeled examples suffice to learn with error \(\epsilon\) and failure probability \(\delta\) provided
\[ m = O\Big(\frac{1}{\epsilon}\big(B(n) + \log \tfrac{1}{\delta}\big)\Big). \]

Intuition: smaller description length \(\Rightarrow\) fewer hypotheses to union-bound over \(\Rightarrow\) fewer samples needed.

Examples

  • Conjunctions have representation length \(O(n)\).
  • The Elimination algorithm solves \(\mathrm{Consis}_{\mathcal C_n}\) for conjunctions in polynomial time (so consistency for conjunctions is “easy” computationally).

2. Reductions between consistency problems

Goal: show how an algorithm for consistency in one class gives an algorithm for consistency in another class.

2.1 Reducing disjunction-consistency to conjunction-consistency

We want a reduction

\[ \mathrm{Consis}_{\mathrm{Disj}_n} \le_p \mathrm{Consis}_{\mathrm{Conj}_n}. \]

Setup. We are given labeled examples

\[ \big(\mathbf x^{(1)}, b^{(1)}\big), \dots, \big(\mathbf x^{(m)}, b^{(m)}\big) \]

and we want a disjunction of literals consistent with them.

Reduction idea: negate the labels. Define a new sample

  • \(\tilde{\mathbf x}^{(i)} := \mathbf x^{(i)}\)
  • \(\tilde b^{(i)} := \neg b^{(i)}\) (i.e., flip 0/1)

Suppose the original labels are explained by some disjunction

\[ h(\mathbf x) = \ell_1(\mathbf x) \vee \cdots \vee \ell_k(\mathbf x). \]

Then, for every example \(i\):

\[ \tilde b^{(i)} = \neg b^{(i)} = \neg h(\mathbf x^{(i)}) = \neg\ell_1(\mathbf x^{(i)}) \wedge \cdots \wedge \neg\ell_k(\mathbf x^{(i)}). \]

So the negated-label dataset is consistent with the conjunction

\[ g(\mathbf x) = \neg\ell_1(\mathbf x) \wedge \cdots \wedge \neg\ell_k(\mathbf x). \]

Conversely, any conjunction \(g\) consistent with the negated-label sample yields a disjunction \(h = \neg g\) consistent with the original sample.

Conclusion: If we can solve \(\mathrm{Consis}_{\mathrm{Conj}_n}\) in polytime, we can also solve \(\mathrm{Consis}_{\mathrm{Disj}_n}\) in polytime.


3. Decision lists

3.1 Definition

A decision list is like a decision tree where, at each internal node, at least one outgoing branch goes immediately to a leaf.

Equivalent (and most convenient) view: a linear sequence of tests (“if–else if–else if–...”) on literals.

A typical decision list:

  • if \(\ell_1(\mathbf x)=1\) output \(b_1\)
  • else if \(\ell_2(\mathbf x)=1\) output \(b_2\)
  • ...
  • else output \(b_0\) (default)

Convention noted: the branch where the literal is satisfied is the one that goes “to the leaf” (i.e., causes an immediate decision).

3.2 Example

A chain-like picture was drawn (roughly)

\[ X_1 \to \neg X_2 \to X_3 \to X_4 \to 1 \]

with example labels such as

  • \((1,0,0,0) \mapsto 0\)
  • \((1,0,1,0) \mapsto 1\)

(The point was to illustrate the sequential “first satisfied literal wins” behavior.)


4. Consistency for decision lists: the FindList algorithm

We want to solve

\[ \mathrm{Consis}_{\mathrm{DL}(n)} \]

given a labeled sample \(S\).

4.1 Useful literals

For a set of labeled examples \(S\) and a literal \(\ell\), define

\[ S_\ell := \lbrace (\mathbf x,b)\in S : \ell(\mathbf x)=1\rbrace . \]

We say \(\ell\) is useful for \(S\) if

  1. \(S_\ell \neq \varnothing\), and
  2. every \((\mathbf x,b)\in S_\ell\) has the same label \(b\).

So a useful literal “catches” at least one remaining example, and all caught examples share the same label.

4.2 FindList(S)

Input: labeled sample \(S\).

  1. Initialize the decision list \(h \leftarrow\) empty.
  2. While not all examples remaining in \(S\) have the same label:
  3. Iterate over literals until you find a useful literal \(\ell\).
  4. Let \(b\) be the common label of examples in \(S_\ell\).
  5. Append the pair \((\ell, b)\) as the next node of the decision list.
  6. Update \(S \leftarrow S \setminus S_\ell\).
  7. Add a final default leaf labeled with the common label of the remaining set \(S\).

4.3 Running time and size bounds

  • Theorem: FindList solves \(\mathrm{Consis}_{\mathrm{DL}(n)}\) in time
\[ O(m n^2) \]

(where \(m=|S|\), and \(n\) is the number of attributes).

  • The produced decision list has at most \(n\) nodes.
  • There are only \(2n\) possible literals (\(X_i\) or \(\neg X_i\)), and the algorithm never needs to reuse a literal to make progress.

  • Representation/description length comment:

  • A literal can be encoded using roughly \(\log_2 n\) bits for the index plus 1 bit for negation.
  • Each node also stores a 1‑bit output label.
  • So a decision list of length \(O(n)\) has description length \(O(n\log n)\) bits.

5. Why FindList works (existence of a useful literal)

Key question: why is there a useful literal at each iteration?

Assume there exists a decision list \(c\) that correctly labels all examples in the original set \(S\).

Let \(\ell^*\) be the first literal in \(c\) that is not yet in the partially-built list \(h\).

  • Because \(\ell^*\) is the first “new” literal in \(c\), every earlier literal of \(c\) has already been used to remove the examples it would capture.
  • Therefore, among the remaining examples, the ones with \(\ell^*(\mathbf x)=1\) must all be labeled according to the same output bit associated with \(\ell^*\) in \(c\).
  • Also, there must be at least one remaining example satisfying \(\ell^*\) (otherwise \(\ell^*\) is irrelevant and could be skipped in \(c\) for these examples).

So \(\ell^*\) is useful for the current residual set \(S\), and FindList can continue.

This is the core correctness argument.


6. Hardness of learning: 3‑term DNF

6.1 Definition and representation size

A term is an AND of literals. A 3‑term DNF is an OR of 3 terms:

\[ T_1 \vee T_2 \vee T_3. \]

6.2 Computational hardness connection

Unless NP has randomized polynomial‑time algorithms, there is no PAC‑learning algorithm for 3‑term DNF.

The structure is:

  1. Recall: a PAC‑learning algorithm (for suitable parameters) implies a polynomial‑time algorithm for the corresponding consistency problem \(\mathrm{Consis}_{\mathcal C_n}\).
  2. Show \(\mathrm{Consis}_{\text{3‑term DNF}}\) is NP‑complete.

6.3 NP‑hardness via reduction from 3‑COLOR (full construction)

We reduce 3‑COLOR to consistency of 3‑term DNF.

  • Input: a graph \(G=(V,E)\) with $V=\lbrace 1,\dots,n\rbrace $.
  • Output: a labeled sample \(S_G\) over Boolean attributes \(x_1,\dots,x_n\) such that:
  • if \(G\) is 3‑colorable then \(S_G\) is consistent with some 3‑term DNF, and
  • from any 3‑term DNF consistent with \(S_G\) we can recover a proper 3‑coloring of \(G\).

High‑level correspondence: vertices \(\leftrightarrow\) attributes; colors \(\leftrightarrow\) DNF terms; “\(v\) has color \(c\)\(\leftrightarrow\) the term for \(c\) does not mention \(x_v\).

6.3.1 The sample \(S_G\)

For each vertex \(i\in V\), add a positive example \(\mathbf v^{(i)}\):

  • \(\mathbf v^{(i)} \in \lbrace 0,1\rbrace ^n\) has \(x_i=0\) and \(x_k=1\) for all \(k\neq i\),
  • label \(b=1\).

For each edge \(\lbrace i,j\rbrace \in E\), add a negative example \(\mathbf e^{(i,j)}\):

  • \(\mathbf e^{(i,j)} \in \lbrace 0,1\rbrace ^n\) has \(x_i=x_j=0\) and \(x_k=1\) for all $k\notin\lbrace i,j\rbrace $,
  • label \(b=0\).

So the sample contains \(|V|+|E|\) examples and is computable in polynomial time.

6.3.2 “If” direction: a 3‑coloring \(\Rightarrow\) a consistent 3‑term DNF

Assume \(G\) has a proper 3‑coloring $\chi:V\to\lbrace 1,2,3\rbrace $.

For each color $c\in\lbrace 1,2,3\rbrace $ define a term (a conjunction of positive literals)

\[ T_c(\mathbf x) := \bigwedge_{i\in V:\,\chi(i)\neq c} x_i. \]

Define the hypothesis

\[ h(\mathbf x) := T_1(\mathbf x)\ \vee\ T_2(\mathbf x)\ \vee\ T_3(\mathbf x). \]

Now check the labels:

  • Vertex examples. Fix vertex \(i\) with \(\chi(i)=c\). In \(T_c\) the variable \(x_i\) is absent (because \(\chi(i)=c\)), while every variable that does appear in \(T_c\) is 1 on \(\mathbf v^{(i)}\). Hence \(T_c(\mathbf v^{(i)})=1\), so \(h(\mathbf v^{(i)})=1\) as required.

  • Edge examples. Fix edge \(\lbrace i,j\rbrace \in E\) with (say) \(\chi(i)=c\) and \(\chi(j)=c'\), where \(c\neq c'\).

  • In term \(T_c\), vertex \(j\) is not color \(c\), so \(x_j\) appears in \(T_c\); but on \(\mathbf e^{(i,j)}\) we have \(x_j=0\), so \(T_c(\mathbf e^{(i,j)})=0\).
  • Symmetrically, \(T_{c'}\) contains \(x_i\), so \(T_{c'}(\mathbf e^{(i,j)})=0\).
  • For the third color $c''\notin\lbrace c,c'\rbrace $, both endpoints are “not \(c''\)”, so both \(x_i\) and \(x_j\) appear in \(T_{c''}\), and since both are 0 on \(\mathbf e^{(i,j)}\), we also get \(T_{c''}(\mathbf e^{(i,j)})=0\).

Thus all three terms are 0 on every edge example, so \(h(\mathbf e^{(i,j)})=0\), matching the label.

Therefore if \(G\) is 3‑colorable, the constructed sample is consistent with a 3‑term DNF.

6.3.3 “Only if” direction: a consistent 3‑term DNF \(\Rightarrow\) a 3‑coloring

Assume there exists a 3‑term DNF

\[ h(\mathbf x)=T_1(\mathbf x)\vee T_2(\mathbf x)\vee T_3(\mathbf x) \]

that is consistent with \(S_G\).

Consider any vertex \(i\). The vertex example \(\mathbf v^{(i)}\) has label 1, so at least one term, say \(T_c\), must satisfy it: \(T_c(\mathbf v^{(i)})=1\).

On \(\mathbf v^{(i)}\) we have \(x_i=0\) and \(x_k=1\) for all \(k\neq i\). Therefore:

  • \(T_c\) cannot contain the positive literal \(x_i\) (it would evaluate to 0),
  • \(T_c\) cannot contain any negated literal \(\neg x_k\) for \(k\neq i\) (since \(x_k=1\) makes \(\neg x_k=0\)).

So any term that satisfies \(\mathbf v^{(i)}\) consists of positive literals on variables in $V\setminus\lbrace i\rbrace $, and it may (or may not) also include the literal \(\neg x_i\).

Define a coloring by letting

  • \(\chi(i)\) be the first index $c\in\lbrace 1,2,3\rbrace $ such that \(T_c(\mathbf v^{(i)})=1\). We claim this is a proper coloring. Suppose for contradiction there is an edge \(\lbrace i,j\rbrace \in E\) with \(\chi(i)=\chi(j)=c\).

By definition of \(\chi\), the same term \(T_c\) satisfies both vertex examples: \(T_c(\mathbf v^{(i)})=T_c(\mathbf v^{(j)})=1\). Using the constraints above:

  • From \(T_c(\mathbf v^{(i)})=1\), \(T_c\) cannot contain \(x_i\) and cannot contain any \(\neg x_k\) for \(k\neq i\).
  • From \(T_c(\mathbf v^{(j)})=1\), \(T_c\) cannot contain \(x_j\) and cannot contain any \(\neg x_k\) for \(k\neq j\).

Combining these, \(T_c\) contains no negated literals at all (since any \(\neg x_k\) would violate at least one of the two examples), and it contains neither \(x_i\) nor \(x_j\). Therefore every literal in \(T_c\) is a positive variable \(x_k\) with $k\notin\lbrace i,j\rbrace $.

But the edge example \(\mathbf e^{(i,j)}\) sets every such \(x_k\) to 1, so \(T_c(\mathbf e^{(i,j)})=1\), hence \(h(\mathbf e^{(i,j)})=1\)—contradicting that \(\mathbf e^{(i,j)}\) is labeled 0 in \(S_G\).

Therefore adjacent vertices cannot share the same color, and \(\chi\) is a proper 3‑coloring.

6.3.4 NP‑completeness conclusion

  • Membership in NP: guess the three terms and verify they match all labels in polynomial time.
  • NP‑hardness: shown by the reduction above.

Hence \(\mathrm{Consis}_{\text{3‑term DNF}}\) is NP‑complete (and therefore, under standard assumptions, efficiently PAC‑learning 3‑term DNF would imply unlikely collapses of complexity classes).