Reductions in Learning, Decision Lists, and Hardness for 3‑Term DNF
1. Occam’s razor view: sample complexity from description length
Let \(\mathcal C_n\) be a hypothesis class over \(n\) Boolean variables.
- Let \(B(n)\) be an upper bound (in bits) on the representation size of hypotheses in \(\mathcal C_n\).
- Then (via an Occam-style argument), \(m\) i.i.d. labeled examples suffice to learn with error \(\epsilon\) and failure probability \(\delta\) provided
Intuition: smaller description length \(\Rightarrow\) fewer hypotheses to union-bound over \(\Rightarrow\) fewer samples needed.
Examples
- Conjunctions have representation length \(O(n)\).
- The Elimination algorithm solves \(\mathrm{Consis}_{\mathcal C_n}\) for conjunctions in polynomial time (so consistency for conjunctions is “easy” computationally).
2. Reductions between consistency problems
Goal: show how an algorithm for consistency in one class gives an algorithm for consistency in another class.
2.1 Reducing disjunction-consistency to conjunction-consistency
We want a reduction
Setup. We are given labeled examples
and we want a disjunction of literals consistent with them.
Reduction idea: negate the labels. Define a new sample
- \(\tilde{\mathbf x}^{(i)} := \mathbf x^{(i)}\)
- \(\tilde b^{(i)} := \neg b^{(i)}\) (i.e., flip 0/1)
Suppose the original labels are explained by some disjunction
Then, for every example \(i\):
So the negated-label dataset is consistent with the conjunction
Conversely, any conjunction \(g\) consistent with the negated-label sample yields a disjunction \(h = \neg g\) consistent with the original sample.
Conclusion: If we can solve \(\mathrm{Consis}_{\mathrm{Conj}_n}\) in polytime, we can also solve \(\mathrm{Consis}_{\mathrm{Disj}_n}\) in polytime.
3. Decision lists
3.1 Definition
A decision list is like a decision tree where, at each internal node, at least one outgoing branch goes immediately to a leaf.
Equivalent (and most convenient) view: a linear sequence of tests (“if–else if–else if–...”) on literals.
A typical decision list:
- if \(\ell_1(\mathbf x)=1\) output \(b_1\)
- else if \(\ell_2(\mathbf x)=1\) output \(b_2\)
- ...
- else output \(b_0\) (default)
Convention noted: the branch where the literal is satisfied is the one that goes “to the leaf” (i.e., causes an immediate decision).
3.2 Example
A chain-like picture was drawn (roughly)
with example labels such as
- \((1,0,0,0) \mapsto 0\)
- \((1,0,1,0) \mapsto 1\)
(The point was to illustrate the sequential “first satisfied literal wins” behavior.)
4. Consistency for decision lists: the FindList algorithm
We want to solve
given a labeled sample \(S\).
4.1 Useful literals
For a set of labeled examples \(S\) and a literal \(\ell\), define
We say \(\ell\) is useful for \(S\) if
- \(S_\ell \neq \varnothing\), and
- every \((\mathbf x,b)\in S_\ell\) has the same label \(b\).
So a useful literal “catches” at least one remaining example, and all caught examples share the same label.
4.2 FindList(S)
Input: labeled sample \(S\).
- Initialize the decision list \(h \leftarrow\) empty.
- While not all examples remaining in \(S\) have the same label:
- Iterate over literals until you find a useful literal \(\ell\).
- Let \(b\) be the common label of examples in \(S_\ell\).
- Append the pair \((\ell, b)\) as the next node of the decision list.
- Update \(S \leftarrow S \setminus S_\ell\).
- Add a final default leaf labeled with the common label of the remaining set \(S\).
4.3 Running time and size bounds
- Theorem: FindList solves \(\mathrm{Consis}_{\mathrm{DL}(n)}\) in time
(where \(m=|S|\), and \(n\) is the number of attributes).
- The produced decision list has at most \(n\) nodes.
-
There are only \(2n\) possible literals (\(X_i\) or \(\neg X_i\)), and the algorithm never needs to reuse a literal to make progress.
-
Representation/description length comment:
- A literal can be encoded using roughly \(\log_2 n\) bits for the index plus 1 bit for negation.
- Each node also stores a 1‑bit output label.
- So a decision list of length \(O(n)\) has description length \(O(n\log n)\) bits.
5. Why FindList works (existence of a useful literal)
Key question: why is there a useful literal at each iteration?
Assume there exists a decision list \(c\) that correctly labels all examples in the original set \(S\).
Let \(\ell^*\) be the first literal in \(c\) that is not yet in the partially-built list \(h\).
- Because \(\ell^*\) is the first “new” literal in \(c\), every earlier literal of \(c\) has already been used to remove the examples it would capture.
- Therefore, among the remaining examples, the ones with \(\ell^*(\mathbf x)=1\) must all be labeled according to the same output bit associated with \(\ell^*\) in \(c\).
- Also, there must be at least one remaining example satisfying \(\ell^*\) (otherwise \(\ell^*\) is irrelevant and could be skipped in \(c\) for these examples).
So \(\ell^*\) is useful for the current residual set \(S\), and FindList can continue.
This is the core correctness argument.
6. Hardness of learning: 3‑term DNF
6.1 Definition and representation size
A term is an AND of literals. A 3‑term DNF is an OR of 3 terms:
6.2 Computational hardness connection
Unless NP has randomized polynomial‑time algorithms, there is no PAC‑learning algorithm for 3‑term DNF.
The structure is:
- Recall: a PAC‑learning algorithm (for suitable parameters) implies a polynomial‑time algorithm for the corresponding consistency problem \(\mathrm{Consis}_{\mathcal C_n}\).
- Show \(\mathrm{Consis}_{\text{3‑term DNF}}\) is NP‑complete.
6.3 NP‑hardness via reduction from 3‑COLOR (full construction)
We reduce 3‑COLOR to consistency of 3‑term DNF.
- Input: a graph \(G=(V,E)\) with $V=\lbrace 1,\dots,n\rbrace $.
- Output: a labeled sample \(S_G\) over Boolean attributes \(x_1,\dots,x_n\) such that:
- if \(G\) is 3‑colorable then \(S_G\) is consistent with some 3‑term DNF, and
- from any 3‑term DNF consistent with \(S_G\) we can recover a proper 3‑coloring of \(G\).
High‑level correspondence: vertices \(\leftrightarrow\) attributes; colors \(\leftrightarrow\) DNF terms; “\(v\) has color \(c\)” \(\leftrightarrow\) the term for \(c\) does not mention \(x_v\).
6.3.1 The sample \(S_G\)
For each vertex \(i\in V\), add a positive example \(\mathbf v^{(i)}\):
- \(\mathbf v^{(i)} \in \lbrace 0,1\rbrace ^n\) has \(x_i=0\) and \(x_k=1\) for all \(k\neq i\),
- label \(b=1\).
For each edge \(\lbrace i,j\rbrace \in E\), add a negative example \(\mathbf e^{(i,j)}\):
- \(\mathbf e^{(i,j)} \in \lbrace 0,1\rbrace ^n\) has \(x_i=x_j=0\) and \(x_k=1\) for all $k\notin\lbrace i,j\rbrace $,
- label \(b=0\).
So the sample contains \(|V|+|E|\) examples and is computable in polynomial time.
6.3.2 “If” direction: a 3‑coloring \(\Rightarrow\) a consistent 3‑term DNF
Assume \(G\) has a proper 3‑coloring $\chi:V\to\lbrace 1,2,3\rbrace $.
For each color $c\in\lbrace 1,2,3\rbrace $ define a term (a conjunction of positive literals)
Define the hypothesis
Now check the labels:
-
Vertex examples. Fix vertex \(i\) with \(\chi(i)=c\). In \(T_c\) the variable \(x_i\) is absent (because \(\chi(i)=c\)), while every variable that does appear in \(T_c\) is 1 on \(\mathbf v^{(i)}\). Hence \(T_c(\mathbf v^{(i)})=1\), so \(h(\mathbf v^{(i)})=1\) as required.
-
Edge examples. Fix edge \(\lbrace i,j\rbrace \in E\) with (say) \(\chi(i)=c\) and \(\chi(j)=c'\), where \(c\neq c'\).
- In term \(T_c\), vertex \(j\) is not color \(c\), so \(x_j\) appears in \(T_c\); but on \(\mathbf e^{(i,j)}\) we have \(x_j=0\), so \(T_c(\mathbf e^{(i,j)})=0\).
- Symmetrically, \(T_{c'}\) contains \(x_i\), so \(T_{c'}(\mathbf e^{(i,j)})=0\).
- For the third color $c''\notin\lbrace c,c'\rbrace $, both endpoints are “not \(c''\)”, so both \(x_i\) and \(x_j\) appear in \(T_{c''}\), and since both are 0 on \(\mathbf e^{(i,j)}\), we also get \(T_{c''}(\mathbf e^{(i,j)})=0\).
Thus all three terms are 0 on every edge example, so \(h(\mathbf e^{(i,j)})=0\), matching the label.
Therefore if \(G\) is 3‑colorable, the constructed sample is consistent with a 3‑term DNF.
6.3.3 “Only if” direction: a consistent 3‑term DNF \(\Rightarrow\) a 3‑coloring
Assume there exists a 3‑term DNF
that is consistent with \(S_G\).
Consider any vertex \(i\). The vertex example \(\mathbf v^{(i)}\) has label 1, so at least one term, say \(T_c\), must satisfy it: \(T_c(\mathbf v^{(i)})=1\).
On \(\mathbf v^{(i)}\) we have \(x_i=0\) and \(x_k=1\) for all \(k\neq i\). Therefore:
- \(T_c\) cannot contain the positive literal \(x_i\) (it would evaluate to 0),
- \(T_c\) cannot contain any negated literal \(\neg x_k\) for \(k\neq i\) (since \(x_k=1\) makes \(\neg x_k=0\)).
So any term that satisfies \(\mathbf v^{(i)}\) consists of positive literals on variables in $V\setminus\lbrace i\rbrace $, and it may (or may not) also include the literal \(\neg x_i\).
Define a coloring by letting
- \(\chi(i)\) be the first index $c\in\lbrace 1,2,3\rbrace $ such that \(T_c(\mathbf v^{(i)})=1\). We claim this is a proper coloring. Suppose for contradiction there is an edge \(\lbrace i,j\rbrace \in E\) with \(\chi(i)=\chi(j)=c\).
By definition of \(\chi\), the same term \(T_c\) satisfies both vertex examples: \(T_c(\mathbf v^{(i)})=T_c(\mathbf v^{(j)})=1\). Using the constraints above:
- From \(T_c(\mathbf v^{(i)})=1\), \(T_c\) cannot contain \(x_i\) and cannot contain any \(\neg x_k\) for \(k\neq i\).
- From \(T_c(\mathbf v^{(j)})=1\), \(T_c\) cannot contain \(x_j\) and cannot contain any \(\neg x_k\) for \(k\neq j\).
Combining these, \(T_c\) contains no negated literals at all (since any \(\neg x_k\) would violate at least one of the two examples), and it contains neither \(x_i\) nor \(x_j\). Therefore every literal in \(T_c\) is a positive variable \(x_k\) with $k\notin\lbrace i,j\rbrace $.
But the edge example \(\mathbf e^{(i,j)}\) sets every such \(x_k\) to 1, so \(T_c(\mathbf e^{(i,j)})=1\), hence \(h(\mathbf e^{(i,j)})=1\)—contradicting that \(\mathbf e^{(i,j)}\) is labeled 0 in \(S_G\).
Therefore adjacent vertices cannot share the same color, and \(\chi\) is a proper 3‑coloring.
6.3.4 NP‑completeness conclusion
- Membership in NP: guess the three terms and verify they match all labels in polynomial time.
- NP‑hardness: shown by the reduction above.
Hence \(\mathrm{Consis}_{\text{3‑term DNF}}\) is NP‑complete (and therefore, under standard assumptions, efficiently PAC‑learning 3‑term DNF would imply unlikely collapses of complexity classes).