Reductions in Learning, Decision Lists, and Hardness for 3‑Term DNF

1. Occam’s razor view: sample complexity from description length

Let $\mathcal C_n$ be a hypothesis class over $n$ Boolean variables.

Let $B(n)$ be an upper bound (in bits) on the representation size of hypotheses in $\mathcal C_n$.
Then (via an Occam-style argument), $m$ i.i.d. labeled examples suffice to learn with error $\epsilon$ and failure probability $\delta$ provided

\[ m = O\Big(\frac{1}{\epsilon}\big(B(n) + \log \tfrac{1}{\delta}\big)\Big). \]

Intuition: smaller description length $\Rightarrow$ fewer hypotheses to union-bound over $\Rightarrow$ fewer samples needed.

Examples

Conjunctions have representation length $O(n)$.
The Elimination algorithm solves $\mathrm{Consis}_{\mathcal C_n}$ for conjunctions in polynomial time (so consistency for conjunctions is “easy” computationally).

2. Reductions between consistency problems

Goal: show how an algorithm for consistency in one class gives an algorithm for consistency in another class.

2.1 Reducing disjunction-consistency to conjunction-consistency

We want a reduction

\[ \mathrm{Consis}_{\mathrm{Disj}_n} \le_p \mathrm{Consis}_{\mathrm{Conj}_n}. \]

Setup. We are given labeled examples

\[ \big(\mathbf x^{(1)}, b^{(1)}\big), \dots, \big(\mathbf x^{(m)}, b^{(m)}\big) \]

and we want a disjunction of literals consistent with them.

Reduction idea: negate the labels. Define a new sample

$\tilde{\mathbf x}^{(i)} := \mathbf x^{(i)}$
$\tilde b^{(i)} := \neg b^{(i)}$ (i.e., flip 0/1)

Suppose the original labels are explained by some disjunction

\[ h(\mathbf x) = \ell_1(\mathbf x) \vee \cdots \vee \ell_k(\mathbf x). \]

Then, for every example $i$:

\[ \tilde b^{(i)} = \neg b^{(i)} = \neg h(\mathbf x^{(i)}) = \neg\ell_1(\mathbf x^{(i)}) \wedge \cdots \wedge \neg\ell_k(\mathbf x^{(i)}). \]

So the negated-label dataset is consistent with the conjunction

\[ g(\mathbf x) = \neg\ell_1(\mathbf x) \wedge \cdots \wedge \neg\ell_k(\mathbf x). \]

Conversely, any conjunction $g$ consistent with the negated-label sample yields a disjunction $h = \neg g$ consistent with the original sample.

Conclusion: If we can solve $\mathrm{Consis}_{\mathrm{Conj}_n}$ in polytime, we can also solve $\mathrm{Consis}_{\mathrm{Disj}_n}$ in polytime.

3. Decision lists

3.1 Definition

A decision list is like a decision tree where, at each internal node, at least one outgoing branch goes immediately to a leaf.

Equivalent (and most convenient) view: a linear sequence of tests (“if–else if–else if–...”) on literals.

A typical decision list:

if $\ell_1(\mathbf x)=1$ output $b_1$
else if $\ell_2(\mathbf x)=1$ output $b_2$
...
else output $b_0$ (default)

Convention noted: the branch where the literal is satisfied is the one that goes “to the leaf” (i.e., causes an immediate decision).

3.2 Example

A chain-like picture was drawn (roughly)

\[ X_1 \to \neg X_2 \to X_3 \to X_4 \to 1 \]

with example labels such as

$(1,0,0,0) \mapsto 0$
$(1,0,1,0) \mapsto 1$

(The point was to illustrate the sequential “first satisfied literal wins” behavior.)

4. Consistency for decision lists: the FindList algorithm

We want to solve

\[ \mathrm{Consis}_{\mathrm{DL}(n)} \]

given a labeled sample $S$.

4.1 Useful literals

For a set of labeled examples $S$ and a literal $\ell$, define

\[ S_\ell := \lbrace (\mathbf x,b)\in S : \ell(\mathbf x)=1\rbrace . \]

We say $\ell$ is useful for $S$ if

$S_\ell \neq \varnothing$, and
every $(\mathbf x,b)\in S_\ell$ has the same label $b$.

So a useful literal “catches” at least one remaining example, and all caught examples share the same label.

4.2 FindList(S)

Input: labeled sample $S$.

Initialize the decision list $h \leftarrow$ empty.
While not all examples remaining in $S$ have the same label:
Iterate over literals until you find a useful literal $\ell$.
Let $b$ be the common label of examples in $S_\ell$.
Append the pair $(\ell, b)$ as the next node of the decision list.
Update $S \leftarrow S \setminus S_\ell$.
Add a final default leaf labeled with the common label of the remaining set $S$.

4.3 Running time and size bounds

Theorem: FindList solves $\mathrm{Consis}_{\mathrm{DL}(n)}$ in time

\[ O(m n^2) \]

(where $m=|S|$, and $n$ is the number of attributes).

The produced decision list has at most $n$ nodes.
There are only $2n$ possible literals ($X_i$ or $\neg X_i$), and the algorithm never needs to reuse a literal to make progress.
Representation/description length comment:
A literal can be encoded using roughly $\log_2 n$ bits for the index plus 1 bit for negation.
Each node also stores a 1‑bit output label.
So a decision list of length $O(n)$ has description length $O(n\log n)$ bits.

5. Why FindList works (existence of a useful literal)

Key question: why is there a useful literal at each iteration?

Assume there exists a decision list $c$ that correctly labels all examples in the original set $S$.

Let $\ell^*$ be the first literal in $c$ that is not yet in the partially-built list $h$.

Because $\ell^*$ is the first “new” literal in $c$, every earlier literal of $c$ has already been used to remove the examples it would capture.
Therefore, among the remaining examples, the ones with $\ell^*(\mathbf x)=1$ must all be labeled according to the same output bit associated with $\ell^*$ in $c$.
Also, there must be at least one remaining example satisfying $\ell^*$ (otherwise $\ell^*$ is irrelevant and could be skipped in $c$ for these examples).

So $\ell^*$ is useful for the current residual set $S$, and FindList can continue.

This is the core correctness argument.

6. Hardness of learning: 3‑term DNF

6.1 Definition and representation size

A term is an AND of literals. A 3‑term DNF is an OR of 3 terms:

\[ T_1 \vee T_2 \vee T_3. \]

6.2 Computational hardness connection

Unless NP has randomized polynomial‑time algorithms, there is no PAC‑learning algorithm for 3‑term DNF.

The structure is:

Recall: a PAC‑learning algorithm (for suitable parameters) implies a polynomial‑time algorithm for the corresponding consistency problem $\mathrm{Consis}_{\mathcal C_n}$.
Show $\mathrm{Consis}_{\text{3‑term DNF}}$ is NP‑complete.

6.3 NP‑hardness via reduction from 3‑COLOR (full construction)

We reduce 3‑COLOR to consistency of 3‑term DNF.

Input: a graph $G=(V,E)$ with $V=\lbrace 1,\dots,n\rbrace $.
Output: a labeled sample $S_G$ over Boolean attributes $x_1,\dots,x_n$ such that:
if $G$ is 3‑colorable then $S_G$ is consistent with some 3‑term DNF, and
from any 3‑term DNF consistent with $S_G$ we can recover a proper 3‑coloring of $G$.

High‑level correspondence: vertices $\leftrightarrow$ attributes; colors $\leftrightarrow$ DNF terms; “$v$ has color $c$” $\leftrightarrow$ the term for $c$ does not mention $x_v$.

6.3.1 The sample $S_G$

For each vertex $i\in V$, add a positive example $\mathbf v^{(i)}$:

$\mathbf v^{(i)} \in \lbrace 0,1\rbrace ^n$ has $x_i=0$ and $x_k=1$ for all $k\neq i$,
label $b=1$.

For each edge $\lbrace i,j\rbrace \in E$, add a negative example $\mathbf e^{(i,j)}$:

$\mathbf e^{(i,j)} \in \lbrace 0,1\rbrace ^n$ has $x_i=x_j=0$ and $x_k=1$ for all $k\notin\lbrace i,j\rbrace $,
label $b=0$.

So the sample contains $|V|+|E|$ examples and is computable in polynomial time.

6.3.2 “If” direction: a 3‑coloring $\Rightarrow$ a consistent 3‑term DNF

Assume $G$ has a proper 3‑coloring $\chi:V\to\lbrace 1,2,3\rbrace $.

For each color $c\in\lbrace 1,2,3\rbrace $ define a term (a conjunction of positive literals)

\[ T_c(\mathbf x) := \bigwedge_{i\in V:\,\chi(i)\neq c} x_i. \]

Define the hypothesis

\[ h(\mathbf x) := T_1(\mathbf x)\ \vee\ T_2(\mathbf x)\ \vee\ T_3(\mathbf x). \]

Now check the labels:

Vertex examples. Fix vertex $i$ with $\chi(i)=c$. In $T_c$ the variable $x_i$ is absent (because $\chi(i)=c$), while every variable that does appear in $T_c$ is 1 on $\mathbf v^{(i)}$. Hence $T_c(\mathbf v^{(i)})=1$, so $h(\mathbf v^{(i)})=1$ as required.
Edge examples. Fix edge $\lbrace i,j\rbrace \in E$ with (say) $\chi(i)=c$ and $\chi(j)=c'$, where $c\neq c'$.
In term $T_c$, vertex $j$ is not color $c$, so $x_j$ appears in $T_c$; but on $\mathbf e^{(i,j)}$ we have $x_j=0$, so $T_c(\mathbf e^{(i,j)})=0$.
Symmetrically, $T_{c'}$ contains $x_i$, so $T_{c'}(\mathbf e^{(i,j)})=0$.
For the third color $c''\notin\lbrace c,c'\rbrace $, both endpoints are “not $c''$”, so both $x_i$ and $x_j$ appear in $T_{c''}$, and since both are 0 on $\mathbf e^{(i,j)}$, we also get $T_{c''}(\mathbf e^{(i,j)})=0$.

Thus all three terms are 0 on every edge example, so $h(\mathbf e^{(i,j)})=0$, matching the label.

Therefore if $G$ is 3‑colorable, the constructed sample is consistent with a 3‑term DNF.

6.3.3 “Only if” direction: a consistent 3‑term DNF $\Rightarrow$ a 3‑coloring

Assume there exists a 3‑term DNF

\[ h(\mathbf x)=T_1(\mathbf x)\vee T_2(\mathbf x)\vee T_3(\mathbf x) \]

that is consistent with $S_G$.

Consider any vertex $i$. The vertex example $\mathbf v^{(i)}$ has label 1, so at least one term, say $T_c$, must satisfy it: $T_c(\mathbf v^{(i)})=1$.

On $\mathbf v^{(i)}$ we have $x_i=0$ and $x_k=1$ for all $k\neq i$. Therefore:

$T_c$ cannot contain the positive literal $x_i$ (it would evaluate to 0),
$T_c$ cannot contain any negated literal $\neg x_k$ for $k\neq i$ (since $x_k=1$ makes $\neg x_k=0$).

So any term that satisfies $\mathbf v^{(i)}$ consists of positive literals on variables in $V\setminus\lbrace i\rbrace $, and it may (or may not) also include the literal $\neg x_i$.

Define a coloring by letting

$\chi(i)$ be the first index $c\in\lbrace 1,2,3\rbrace $ such that $T_c(\mathbf v^{(i)})=1$. We claim this is a proper coloring. Suppose for contradiction there is an edge $\lbrace i,j\rbrace \in E$ with $\chi(i)=\chi(j)=c$.

By definition of $\chi$, the same term $T_c$ satisfies both vertex examples: $T_c(\mathbf v^{(i)})=T_c(\mathbf v^{(j)})=1$. Using the constraints above:

From $T_c(\mathbf v^{(i)})=1$, $T_c$ cannot contain $x_i$ and cannot contain any $\neg x_k$ for $k\neq i$.
From $T_c(\mathbf v^{(j)})=1$, $T_c$ cannot contain $x_j$ and cannot contain any $\neg x_k$ for $k\neq j$.

Combining these, $T_c$ contains no negated literals at all (since any $\neg x_k$ would violate at least one of the two examples), and it contains neither $x_i$ nor $x_j$. Therefore every literal in $T_c$ is a positive variable $x_k$ with $k\notin\lbrace i,j\rbrace $.

But the edge example $\mathbf e^{(i,j)}$ sets every such $x_k$ to 1, so $T_c(\mathbf e^{(i,j)})=1$, hence $h(\mathbf e^{(i,j)})=1$—contradicting that $\mathbf e^{(i,j)}$ is labeled 0 in $S_G$.

Therefore adjacent vertices cannot share the same color, and $\chi$ is a proper 3‑coloring.

6.3.4 NP‑completeness conclusion

Membership in NP: guess the three terms and verify they match all labels in polynomial time.
NP‑hardness: shown by the reduction above.

Hence $\mathrm{Consis}_{\text{3‑term DNF}}$ is NP‑complete (and therefore, under standard assumptions, efficiently PAC‑learning 3‑term DNF would imply unlikely collapses of complexity classes).