Unboundedness, learning, and POS

🕑 6 min • 👤 Thomas Graf • 📆 February 26, 2020 in Discussions • 🏷 learnability, poverty of stimulus, lattices

Ignas Rudaitis left a comment under my unboundedness post that touches on an important issue: the interaction of unboundedness and the poverty of the stimulus (POS). My reply there had to be on the short side, so I figured I’d fill in the gaps with a short follow-up post.

Learnability : POS = Complexity theory : chess

Ignas worries that if we do not assume unboundedness, then a lot of POS arguments lose their bite because unboundedness is what makes the learning problem hard. I do not agree with the latter, at least not in the literal sense. I think the relation between learnability results and actual language acquisition is a lot more subtle, and it is comparable to the relation between complexity theory and algorithm design.

The most famous results in complexity theory are about asymptotic worst-case complexity. That’s a fancy way of saying: how hard is it to solve a problem if everything that can go wrong does go wrong and we do not get to put an a priori bound on the size of the problem? To give a flowery analogy, complexity theorists don’t study the complexity of chess, they study the complexity of chess on an arbitrarily large chess board. But since we always play chess on an 8-by-8 board, results about arbitrary $n$-by-$n$ boards seem pretty pointless. Except that, in practice, they’re not. The interesting thing is that many of the problems that complexity theory tells us are hard in the unbounded case aren’t really any easier in the bounded case. Hard problems tend to strike both hard and early. Boundedness doesn’t fix the issue. What you have to do is pinpoint the true source of complexity, and unboundedness is a useful assumption in doing so.

Something similar is going on with learnability results. Learnability is all about figuring out the structure of the hypothesis space and how that can be exploited to find the target language with as little input as possible. Unboundedness can be useful in that enterprise. But the insights that are gained this way are largely independent of unboundedness because the learning challenges strike hard and early.

An example from learning SL-2

Let’s see how this general point works out in a concrete case, the learning of strictly 2-local languages (SL-2). As you might remember from an earlier post, a language is strictly local iff it can be described by a strictly local grammar, which is a finite set of forbidden substrings. An SL language is SL-2 iff all forbidden substrings are of length 2. Suppose that our alphabet only contains the symbol $a$, and that we use $\$$ to indicate the edge of a string. Then there’s 4 distinct forbidden substrings of length 2:

$aa$: don’t have $a$ next to $a$
$\$a$: don’t start with $a$
$a\$$: don’t end with $a$
$\$\$$: don’t have a string without any symbols (the empty string)

There are $2^4 = 16$ distinct grammars that we can build from those 4 forbidden bigrams. For instance, $\{ \$\$ \}$ permits all strings over $a$ except the empty string. The larger grammar $\{ \$\$, aa \}$, on the other hand, generates the string $a$ and nothing else ( $\$\$$ rules out the empty string, and $aa$ rules out all longer strings). Since there’s only finitely many distinct SL-2 grammars, we know that the space of of SL-2 languages is finite, which means that the class of SL-2 languages can be learned in the limit from positive text. But the standard learning algorithm for finite language classes isn’t all that efficient. We can do better if we pay attention to the structure of the space.

To this end, we take all 16 SL-2 grammars and order them by the subset relation.

Why, that’s a nice powerset lattice we’ve got there. The learner can exploit this structure as follows:

Start with the grammar at the top of the lattice as the initial conjecture. That is to say, assume that everything is forbidden.
If you see an input i, extract all bigrams from i and remove them from the currently conjectured grammar. The intuition: bigrams we see in the input cannot be illicit and thus must not be in the set of forbidden bigrams.
Continue doing this until we converge onto a specific point in that lattice.

You might say this is a really obvious learning algorithm: just discard forbidden bigrams that you see in the input. What’s so great about that? Well, nothing. The interesting tidbit is the lattice structure that the learner operates over. This structure tells us that the learner can converge on the target grammar really, really fast. The number of steps the learner has to take is bounded by the number of “levels” in the lattice. That’s because removing a bigram will always move us down by at least one level in the lattice, and we can only go so far before we hit rock bottom. Instead of 16 grammars, we have to test at most 4 grammars after the initial hypothesis that everything is forbidden. The lattice structure of the hypothesis space allows us to rule out tons of potential grammars with just one piece of data.

This combinatorial fact gets a lot more impressive as we move beyond bigrams. The space of SL-$n$ grammars over the alphabet $a$ has $2^{2^n}$ grammars, but the learner has to test at most $2^n$. With 5-grams, that’s already a huge difference — at most 32 grammars have to be tested in the space of 4,294,967,296 grammars. Talk about exploiting the structure of the learning space!

(Un)Boundedness doesn’t matter… again

Crucially, the lattice structure of the space is completely independent of unboundedness. Suppose that we take all SL-2 languages and limit their strings to a length of 10. This does not change anything about the structure of the hypothesis space. To wit, here’s two diagrams, one showing the mapping from SL-2 grammars to SL-2 string languages, the other one the mapping from SL-2 grammars to SL-2 string languages with string length less than 10. To avoid clutter, I omit the arrows for grammars that generate the empty language (which is deviant as a natural language anyways).

The function f maps SL-2 grammars to the corresponding SL-2 string languages — The function $f$ maps SL-2 grammars to the corresponding SL-2 string languages

The function g maps SL-2 grammars to the corresponding SL-2 string languages, restricted to strings up to length 10 — The function $g$ maps SL-2 grammars to the corresponding SL-2 string languages, restricted to strings up to length 10

Do you see a difference? Because I certainly don’t. The boundedness has no impact on the hypothesis space, only on the relation between grammars and generated languages.

We could have gone with a different hypothesis space, of course. For instance, we could have used the full space of SL-10 local languages, in which case the boundedness can be directly encoded in the SL grammar. But then the space will be much larger, there will be $2^{2^{10}} = 2^{1024} = \text{enormously many}$ grammars, and even the efficient lattice space learner may take up to 1024 steps to find the target language. Beyond that, we will also miss crucial generalizations (e.g. that no language should allow aa while forbidding aaa). And that’s the POS argument right there: the SL-10 hypothesis space furnishes grammars/languages that we really do not want if the target space is SL-2 with an upper bound on string length. And of course the same problem would hold if we lift the upper bound on string length, the lattice of SL-10 grammars would still be a bad hypothesis space for learning SL-2 languages. Bounded, unbounded, once again it does not matter.

As I said, a red herring

Whether you assume that natural language is bounded or not, the fact remains that even within the bounded portion there is a lot of systematicity that linguistic proposals have to capture. The systematicity reflects a very constrained and richly structured hypothesis space, and exploiting this structure makes the learning problem much easier. This is what drives POS arguments. Whether the mapping from the hypothesis space to the generated languages incorporates (un)boundedness does not change this basic fact. If anything, shifting aspects like boundedness into the hypothesis space just messes things up. That’s what we saw with the step from SL-2 to SL-10 above. It’s FSAs all over again, missing generalizations and furnishing hypothesis that really shouldn’t be in that space.

POS arguments are driven by the structure of the space, the need to capture specific means of generalization. Boundedness is immaterial for that, it simply does not change the picture. A red herring indeed.

Previous post in series Next post in series