outdexhttps://outde.xyz/2020-05-19T00:00:00-04:00language ⊗ computationMR movement: Freezing effects & monotonicity2020-05-19T00:00:00-04:002020-05-19T00:00:00-04:00Thomas Graftag:outde.xyz,2020-05-19:/2020-05-19/mr-movement-freezing-effects-monotonicity.html<p>As you might know, I love reanalyzing linguistic phenomena in terms of monotonicity (see <a href="https://outde.xyz/2019-05-31/omnivorous-number-and-kiowa-inverse-marking-monotonicity-trumps-features.html">this earlier post</a>, <a href="http://dx.doi.org/10.15398/jlm.v7i2.211">my JLM paper</a>, and <a href="https://github.com/somoradi/somoradi/blob/master/nels49_Moradi.pdf">this NELS paper by my student Sophie Moradi</a>). I’m now in the middle of writing another paper on this topic, and it currently includes a section on freezing effects. You see, freezing effects are obviously just bog-standard monotonicity, and I’m shocked that nobody else has pointed that out before. But perhaps the reason nobody’s pointed that out before is simple: my understanding of freezing effects does not match the facts. In the middle of writing the paper, I realized that I don’t know just how much freezing effects limit movement. So I figured I’d reveal my ignorance to the world and hopefully crowd source some sorely needed insight. </p>
<p>As you might know, I love reanalyzing linguistic phenomena in terms of monotonicity (see <a href="https://outde.xyz/2019-05-31/omnivorous-number-and-kiowa-inverse-marking-monotonicity-trumps-features.html">this earlier post</a>, <a href="http://dx.doi.org/10.15398/jlm.v7i2.211">my JLM paper</a>, and <a href="https://github.com/somoradi/somoradi/blob/master/nels49_Moradi.pdf">this NELS paper by my student Sophie Moradi</a>). I’m now in the middle of writing another paper on this topic, and it currently includes a section on freezing effects. You see, freezing effects are obviously just bog-standard monotonicity, and I’m shocked that nobody else has pointed that out before. But perhaps the reason nobody’s pointed that out before is simple: my understanding of freezing effects does not match the facts. In the middle of writing the paper, I realized that I don’t know just how much freezing effects limit movement. So I figured I’d reveal my ignorance to the world and hopefully crowd source some sorely needed insight. </p>
<h1 id="freezing-effects-primer">Freezing effects primer</h1>
<p>Freezing is the idea that once a phrase starts moving, it becomes opaque to extraction. Below you have a prototypical example of a sentence that violates the freezing condition — to keep things readable, I’m using copies instead of traces, but that’s just a descriptive device.</p>
<ol class="example" type="1">
<li>* [<sub>CP</sub> [which car] did [<sub>TP</sub> [the driver of <del>which car</del>] T [<sub><em>v</em>P</sub> <del>the driver of which car</del> <em>v</em> cause a scandal]]]</li>
</ol>
<p>Here the subject DP <em>the driver of which car</em> undergoes movement from the base subject position Spec,<em>v</em>P to the surface subject position in Spec,TP. As a result, the DP effectively turns into an island, which makes it impossible to move the wh-phrase <em>which car</em> from within the subject into Spec,CP. That’s the essence of freezing, and it can be summarized in the form of a catchy slogan:</p>
<ol start="2" class="example" type="1">
<li><strong>Freezing in a nutshell</strong><br />
Once you’ve escaped, nothing escapes from you.</li>
</ol>
<p>Freezing is the Citizen Kane of movement: a free-spirited phrase that is eager to move finally achieves success but is corrupted by it and now uses its power to keep down all the other free-spirited phrases in its domain that would like to move.</p>
<p>Freezing has a well-known loophole: since a phrase P isn’t opaque to extraction until it starts moving, other movers can escape from P as long as they do so before P moves. This still allows for instances of remnant movement as in the German example below.</p>
<ol start="3" class="example" type="1">
<li>[<sub>CP</sub> [<sub>VP</sub> <del>das Buch</del> gelesen] hat [<sub>TP</sub> das Buch der Hans T [<sub><em>v</em>P</sub> <del>der Hans</del> <em>v</em> <del>[<sub>VP</sub> das Buch gelesen]</del>]]]<br />
[<sub>CP</sub> [<sub>VP</sub> <del>the book</del> read] has [<sub>TP</sub> the book the Hans T [<sub><em>v</em>P</sub> <del>the Hans</del> <em>v</em> <del>[<sub>VP</sub> the book read]</del>]]]<br />
`Hans <strong>read</strong> the book.’</li>
</ol>
<p>Yeah, unless you’re already familiar with the analysis, this example is a lot harder to make sense of (2). Let’s switch out the German for English glosses, just to make things a bit easier. Then the sentence starts out with the structure [<sub><em>v</em>P</sub> the Hans <em>v</em> [<sub>VP</sub> the book read]], where <em>the book</em> is the object of the finite verb <em>read</em> and <em>the Hans</em> is the subject. At this point, <em>the Hans</em> undergoes the usual subject movement to Spec,TP. Then, the object <em>the book</em> moves out of the VP into some part of what’s called the <em>Mittelfeld</em>, which may be some kind of TP-specifier position. Both movement steps are allowed because neither phrase was extracted from a moving phrase. Now, finally, the whole VP moves to Spec,CP. This, too, is a licit step — freezing effects do not say that you cannot move once something has moved out of you, they say that nothing can move out of you once you start moving. And that’s definitely not the case here, nothing moves out of the VP once the VP starts moving. So the whole VP gets to move to the left edge of the sentence without any issues. Since the object had already moved out of the VP before, only the head of the VP is visible at the left edge of the surface string, giving us a sentence where it looks like just the V-head underwent movement.</p>
<p>If you’re still confused, here’s the bare phrase structure trees for (2) and (3).</p>
<figure>
<img src="https://outde.xyz/img/thomas/monotonicity_freezing/bpstree_eng.svg" alt="Bare phrase structure tree for (2)" /><figcaption>Bare phrase structure tree for (2)</figcaption>
</figure>
<figure>
<img src="https://outde.xyz/img/thomas/monotonicity_freezing/bpstree_ger.svg" alt="Bare phrase structure tree for (3)" /><figcaption>Bare phrase structure tree for (3)</figcaption>
</figure>
<h1 id="connection-to-monotonicity">Connection to monotonicity</h1>
<p>For the two examples above, there is a straight-forward account in terms of monotonicity. Remember that monotonicity is an order preservation principle (<a href="https://outde.xyz/2019-05-31/omnivorous-number-and-kiowa-inverse-marking-monotonicity-trumps-features.html">check this earlier post for details</a>). Given two structures <span class="math inline">\(A\)</span> and <span class="math inline">\(B\)</span> with orders <span class="math inline">\(\leq_A\)</span> and <span class="math inline">\(\leq_B\)</span>, a function <span class="math inline">\(f\)</span> from <span class="math inline">\(A\)</span> to <span class="math inline">\(B\)</span> is monotonically increasing iff <span class="math inline">\(x \leq_A y\)</span> implies <span class="math inline">\(f(x) \leq_B f(y)\)</span>. For our purposes, it will be sufficient to think of monotonicity as a generalized ban against crossing branches.</p>
<p>We can apply the notion of monotonicity directly to the dependency tree representation provided by Minimalist grammars (MGs). In this format, the phrase structure trees above are represented as the trees below, except that I have simplified things a bit by omitting all features and instead indicating movement dependencies via arrows.</p>
<figure>
<img src="https://outde.xyz/img/thomas/monotonicity_freezing/deptree_eng.svg" alt="MG dependency tree for (2)" /><figcaption>MG dependency tree for (2)</figcaption>
</figure>
<figure>
<img src="https://outde.xyz/img/thomas/monotonicity_freezing/deptree_ger.svg" alt="MG dependency tree for (3)" /><figcaption>MG dependency tree for (3)</figcaption>
</figure>
<p>Each dependency tree defines a partial order over the lexical items in the sentence. Intuitively, this partial order encodes syntactic prominence in terms of head-argument relations, or in Minimalist terms, (external) Merge. That is to say, if X is the daughter of Y, then Y is more prominent than X, and so is the mother of Y, and the mother of the mother of Y, and so on. Okay, so our first order for monotonicity comes straight from the MG dependency trees. Strictly speaking there’s some extra steps to be taken for mathematical reasons, but I’ll ignore those here to keep things simple. So MG dependency trees will be our way of getting a partial order that I call the <strong>Merge order</strong>.</p>
<p>For our second order we construct a truncated version of the dependency trees that encodes prominence with respect to movement (internal Merge). The construction is a bit more complicated, but putting aside some edge cases it’s enough to take the dependency tree and remove all lexical items that don’t provide the landing site for some mover. This gives us the reduced structures below. I’ll call orders of this kind <strong>Move orders</strong>.<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a></p>
<figure>
<img src="https://outde.xyz/img/thomas/monotonicity_freezing/movetree_eng.svg" alt="Move order for (2)" /><figcaption>Move order for (2)</figcaption>
</figure>
<figure>
<img src="https://outde.xyz/img/thomas/monotonicity_freezing/movetree_ger.svg" alt="Move order for (3)" /><figcaption>Move order for (3)</figcaption>
</figure>
<p>Now we define a mapping <em>f</em> from the Move order to the Merge order such that each node M in the move order is mapped to the node N in the Merge order iff M provides the final landing site for N. Again it helps to look at this in terms of pictures. As you can see, <em>f</em> essentially encodes the reverse of the arrows I added to the original dependency trees.</p>
<figure>
<img src="https://outde.xyz/img/thomas/monotonicity_freezing/mapping_eng.svg" alt="Mapping for (2)" /><figcaption>Mapping for (2)</figcaption>
</figure>
<figure>
<img src="https://outde.xyz/img/thomas/monotonicity_freezing/mapping_ger.svg" alt="Mapping for (3)" /><figcaption>Mapping for (3)</figcaption>
</figure>
<p>Notice how the lines for the verb and the object cross in the illicit English sentence, but not in the well-formed German one (the crossing of lines with the subject is just an artefact of how we draw relations in two-dimensional space). So perhaps crossing branches aren’t okay, and since monotonicity is essentially a ban against crossing branches, that would suggest that the problem with the English sentence is that it does not obey monotonicity. Freezing effects, then, amount to the requirement that a sentence’s Move order must preserve its Merge order. The only permitted form of movement is <strong>m</strong>onotonicity <strong>r</strong>especting movement, or simply MR movement.</p>
<h1 id="why-this-might-not-work">Why this might not work</h1>
<p>Alright, that’s a nifty story, but it might not actually work. MR movement is both more limited and more permissive than freezing effects. And I’m not sure if that’s a problem.</p>
<p>Let’s first look at why MR movement is more restrictive. Freezing effects tell us that once N has been extracted from M, it is free to move to wherever it pleases. MR movement, on the other hand, can never move N to a position that’s higher than the final landing site of M. Does this ever happen? I’m not sure. German certainly furnishes cases that look like that.</p>
<ol start="4" class="example" type="1">
<li>[<sub>CP</sub> [<sub>DP</sub> das Buch] hat [<sub>TP</sub> [<sub>VP</sub> <del>das Buch</del> gelesen] <del>[<sub>DP</sub> das Buch]</del> der Hans T [<sub><em>v</em>P</sub> <del>der Hans</del> <em>v</em> <del>[<sub>VP</sub> das Buch gelesen]</del>]]]<br />
[<sub>CP</sub> [<sub>DP</sub> the book] has [<sub>TP</sub> [<sub>VP</sub> <del>the book</del> read] <del>[<sub>DP</sub> the book]</del> the Hans T [<sub><em>v</em>P</sub> <del>the Hans</del> <em>v</em> <del>[<sub>VP</sub> the book read]</del>]]]<br />
`The book, Hans read.’</li>
</ol>
<p>But to be frank, German is a bad example to begin with because scrambling can do all kinds of stuff that won’t fly for standard movement. I can’t think of cases for other languages, but I’m also pretty bad at remembering data points, so that’s not saying much. So, yes, MR movement might be too restrictive if it is stated with respect to the final landing site.</p>
<p>One way to fix that is to redefine the Move order so that it keeps track of the first landing sites instead of the final ones. But for some reason I find that more ad-hoc. It should either be all landing sites, or the last one, there is no reason why the first one should enjoy some privileged status. But that’s neither here nor there, so I don’t know, maybe? Nah, I’d rather stick to my guns and reanalyze data that conflicts with MR movement.</p>
<p>But then there’s also the fact that MR movement is less restrictive. Once again it’s because I chose to focus on the final landing site instead of the first one. This means that MR movement can extract N from M after M has already started movement, provided that M eventually winds up in a higher position than N. Again I’m not sure if that’s a problem. Whenever such a case arises, one could also make it fit with freezing movement by simply positing an additional movement step that extracts N at the very beginning before M starts to move. Testing for the presence or absence of this initial movement step would be hard, so I’m not sure how things would pan out empirically. Again I’m inclined to stick with MR movement simply because it provides a different perspective on freezing effects. Maybe MR movement works, maybe it doesn’t, but either result would provide useful insights into the nature of freezing effects.</p>
<h1 id="the-crowd-sourcing-part">The crowd sourcing part</h1>
<p>Overall, freezing effects can be regarded as an instance of monotonicity, just not in the way I prefer. I define the move order in terms of the final landing site, but to get an exact match for the standard definition of freezing one has to use the initial landing site. That’s still noteworthy as it allows us to reduce freezing to the more general principle of monotonicity, and I have argued many times that monotonicity really has a fundamental role to play in language.</p>
<p>But I’d really like to push for the MR movement perspective instead. I just find it more pleasing, and I like that it differs from the standard view of freezing on some edge cases. So what do you think? Does MR movement have a shot, or is there robust evidence against it?</p>
<section class="footnotes">
<hr />
<ol>
<li id="fn1"><p>While the Move orders in these examples are linear orders, more complex examples would produces partial orders. An example of that is <em>John slept and Mary snored</em>.<a href="#fnref1" class="footnote-back">↩</a></p></li>
</ol>
</section>
Martian substructures2020-05-06T00:00:00-04:002020-05-06T00:00:00-04:00Thomas Graftag:outde.xyz,2020-05-06:/2020-05-06/martian-substructures.html<p>Sometimes students get hung up on the difference between <strong>substring</strong> and <strong>subsequence</strong>. But the works of Edgar Rice Burroughs have given me an idea for an exercise that might just be silly enough to permanently edge itself into students’ memory. </p>
<p>Sometimes students get hung up on the difference between <strong>substring</strong> and <strong>subsequence</strong>. But the works of Edgar Rice Burroughs have given me an idea for an exercise that might just be silly enough to permanently edge itself into students’ memory. </p>
<p>Enter John Carter, Jeddak of Jeddaks, Warlord of Mars, depicted here with his wife, Dejah Thoris, Princess of Helium, daughter of Mors Kajak, who is Jed of Helium and son to Tardos Mors, Jeddak of Helium.</p>
<figure>
<img src="https://vignette.wikia.nocookie.net/barsoom/images/2/26/Frazetta_PoM.jpg/revision/latest?cb=20090528221427" alt="John Carter and Dejah Thoris, as portrayed by Frank Frazetta" /><figcaption>John Carter and Dejah Thoris, as portrayed by Frank Frazetta</figcaption>
</figure>
<p>And here we have two of John Carter’s most loyal friends. First, Tars Tarkas, Jeddak of Thark.</p>
<figure>
<img src="https://vignette.wikia.nocookie.net/barsoom/images/a/ac/Tars_Tarkas_and_John_Carter.jpg/revision/latest?cb=20101128213452" alt="Tars Tarkas helps John Carter in his fight against the Plant Men of the Valley Dor, at the end of the River Iss; art by Michael Whelan" /><figcaption>Tars Tarkas helps John Carter in his fight against the Plant Men of the Valley Dor, at the end of the River Iss; art by Michael Whelan</figcaption>
</figure>
<p>And next Kantos Kan, Jedwar of the Heliumetic navy.</p>
<figure>
<img src="https://vignette.wikia.nocookie.net/barsoom/images/a/a2/Kantos-Kan.jpg/revision/latest?cb=20120713232546" alt="Kantos Kan, in a Barsoomian painting of unmatched verisimilitude" /><figcaption>Kantos Kan, in a Barsoomian painting of unmatched verisimilitude</figcaption>
</figure>
<p>Kantos Kan has the rare distinction of having a last name that is a <strong>substring</strong> of the first name. After all, <strong>Kan</strong><em>tos</em> = <strong>Kan</strong> + <em>tos</em>.</p>
<p>That’s not the case for Tars Tarkas. While the first name and last name share a common <strong>prefix</strong> <em>Tar</em>, there is no string that we can put before or after one of those strings to get the other one. But Tars is a <strong>subsequence</strong> of Tarkas, as the latter is <strong>Tar</strong> + <em>ka</em> + <strong>s</strong>. We can build Tarkas from Tars by splicing in additional material.</p>
<p>So remember, substrings are <strong>Kantos Kan strings</strong>, and subsequences are <strong>Tars Tarkas sequences</strong>. Okay, your turn. For each one of the following pairs, say whether one is</p>
<ul>
<li>a Kantos Kan string of the other,</li>
<li>a Tars Tarkas sequence of the other,</li>
<li>neither.</li>
</ul>
<ol type="1">
<li><em>Jed</em> and <em>Jeddak</em></li>
<li><em>Tars</em> and <em>Thark</em></li>
<li><em>Jeddak</em> and <em>Jedwar</em></li>
<li><em>Thoris</em> and <em>Tardos Mors</em></li>
<li><em>Tars</em> and <em>Tardos Mors</em></li>
<li><em>Dejah Thoris</em> and <em>Dor</em></li>
</ol>
<p><strong>Bonus exercise</strong>: Explain why I didn’t allow a fourth option “both” in the exercise.</p>
<p><strong>Bonus bonus exercise</strong>: Design similar exercises for other entries from <a href="https://goodman-games.com/blog/2018/03/26/what-is-appendix-n/">Appendix N</a>.</p>
Categorical statements about gradience2020-04-28T00:00:00-04:002020-04-28T00:00:00-04:00Thomas Graftag:outde.xyz,2020-04-28:/2020-04-28/categorical-statements-about-gradience.html<p>Omer has a <a href="https://omer.lingsite.org/blogpost-on-so-called-degrees-of-grammaticality/">great post on gradience in syntax</a>. I left a comment there that briefly touches on why gradience isn’t really that big of a deal thanks to <strong>monoids</strong> and <strong>semirings</strong>. But in a vacuum that remark might not make a lot of sense, so here’s some more background. </p>
<p>Omer has a <a href="https://omer.lingsite.org/blogpost-on-so-called-degrees-of-grammaticality/">great post on gradience in syntax</a>. I left a comment there that briefly touches on why gradience isn’t really that big of a deal thanks to <strong>monoids</strong> and <strong>semirings</strong>. But in a vacuum that remark might not make a lot of sense, so here’s some more background. </p>
<h2 id="gradience-in-the-broad-sense">Gradience in the broad sense</h2>
<p>My central claim is that linguists’ worries about gradience are overblown because there isn’t that much of a difference between categorical systems, which only distinguish between well-formed and ill-formed, and gradient systems, which have more shades of gray than that. In particular, the difference doesn’t matter for those aspects of grammar that linguists really care about. A grammar with only a categorical distinction isn’t irredeemably impoverished, and if your formalism gets the linguistic fundamentals wrong adding gradience won’t fix that for you.</p>
<p>Brief note: In practice, gradient systems are usually probabilistic, but there’s no need for that. The familiar system of rating sentences as well-formed, <code>?</code>, <code>??</code>, <code>?*</code>, and <code>*</code> would also be gradient. This is an important fact that’s frequently glossed over. I really wish researchers wouldn’t always jump right to probabilistic systems when they want to make something gradient. Sure, probabilities are nice because they are easy to extract from the available data, but that doesn’t mean that this is the right notion of gradience.</p>
<p>That said, this post will frequently use probabilistic grammars to illustrate more general points about gradience. The take-home message, though, applies equally to all gradient systems, whether they’re probabilistic or not.</p>
<h2 id="a-formula-for-categorical-grammars">A formula for categorical grammars</h2>
<p>Let’s start with a very simple example in the form of a <a href="https://outde.xyz/2019-08-19/the-subregular-locality-zoo-sl-and-tsl.html">strictly local grammar</a>. SL grammars are usually negative, which means that they list all the <em>n</em>-grams that must not occur in a string. But for the purposes of this post, it is preferable to convert the negative grammar into an equivalent positive grammar, which lists all the <em>n</em>-grams that may occur in a string. For example, the positive SL-2 grammar <span class="math inline">\(G\)</span> below generates the language <span class="math inline">\((ab)^+\)</span>, which contains the strings <span class="math inline">\(\mathit{ab}\)</span>, <span class="math inline">\(\mathit{abab}\)</span>, <span class="math inline">\(\mathit{ababab}\)</span>, and so on.</p>
<ol class="example" type="1">
<li><strong>Positive SL-2 grammar for <span class="math inline">\(\mathbf{(ab)^*}\)</span></strong>
<ol type="1">
<li><span class="math inline">\(\mathit{\$a}\)</span>: the string may start with <span class="math inline">\(a\)</span></li>
<li><span class="math inline">\(\mathit{ab}\)</span>: <span class="math inline">\(a\)</span> may be followed by <span class="math inline">\(b\)</span></li>
<li><span class="math inline">\(\mathit{ba}\)</span>: <span class="math inline">\(b\)</span> may be followed by <span class="math inline">\(a\)</span></li>
<li><span class="math inline">\(\mathit{b\$}\)</span>: the string may end with <span class="math inline">\(b\)</span></li>
</ol></li>
</ol>
<p>Now let’s consider how one actually decides whether a given string is well-formed with respect to this grammar. There’s many equivalent ways of thinking about this, but right now we want one that emphasizes the algebraic nature of grammars.</p>
<p>Suppose we are given the string <span class="math inline">\(\mathit{abab}\)</span>. As always with an SL grammar, we first add edge markers to it, giving us <span class="math inline">\(\mathit{\$abab\$}\)</span>. That’s just a mathematical trick to clearly distinguish the first and last symbol of the string. The SL grammar decides the well-formedness of the string <span class="math inline">\(\mathit{\$abab\$}\)</span> based on whether the bigrams that occur in it are well-formed. Those bigrams are (including repetitions)</p>
<ol type="1">
<li><span class="math inline">\(\mathit{\$a}\)</span>,</li>
<li><span class="math inline">\(\mathit{ab}\)</span>,</li>
<li><span class="math inline">\(\mathit{ba}\)</span>,</li>
<li><span class="math inline">\(\mathit{ab}\)</span>,</li>
<li><span class="math inline">\(\mathit{b\$}\)</span>.</li>
</ol>
<p>We can write this as a single formula that doesn’t make a lick of sense at this point:</p>
<p><span class="math display">\[G(\mathit{\$abab\$}) := f(\$a) \otimes f(ab) \otimes f(ba) \otimes f(ab) \otimes f(b\$)\]</span></p>
<p>It sure looks fancy, but I haven’t really done anything substantial here. Let’s break this formula down into its components:</p>
<ul>
<li><span class="math inline">\(G(\mathit{\$abab\$})\)</span> is the value that the grammar <span class="math inline">\(G\)</span> assigns to the string <span class="math inline">\(\mathit{\$abab\$}\)</span>. Since <span class="math inline">\(G\)</span> is categorical, this can be <span class="math inline">\(1\)</span> for <em>well-formed</em> or <span class="math inline">\(0\)</span> for <em>ill-formed</em>.</li>
<li><span class="math inline">\(:=\)</span> means “is defined as”.</li>
<li><span class="math inline">\(f\)</span> is some mystery function that maps each bigram to some value.</li>
<li><span class="math inline">\(\otimes\)</span> is some mystery operation that combines the values produced by <span class="math inline">\(f\)</span>.</li>
</ul>
<p>The formula expresses in mathematical terms the most fundamental rule of SL grammars: the value that <span class="math inline">\(G\)</span> assigns to <span class="math inline">\(\mathit(\$abab\$)\)</span> depends on the bigrams that occur in the string. Each bigram in the string is mapped to some value, and then all these values are combined into an aggregate value for the string. The only reason the formula looks weird is because I haven’t told you what <span class="math inline">\(f\)</span> and <span class="math inline">\(\otimes\)</span> are.</p>
<p>The cool thing is, <span class="math inline">\(f\)</span> and <span class="math inline">\(\otimes\)</span> can be lots of things. That’s exactly what will allow us to unify categorical and gradient grammars. But let’s not get ahead of ourselves, let’s just focus on <span class="math inline">\(f\)</span> and <span class="math inline">\(\otimes\)</span> for our categorical example grammar <span class="math inline">\(G\)</span>.</p>
<p>We start with <span class="math inline">\(f\)</span>. This function maps a bigram <span class="math inline">\(b\)</span> to <span class="math inline">\(1\)</span> if it is a licit bigram according to our grammar <span class="math inline">\(G\)</span>. If <span class="math inline">\(b\)</span> is not a licit bigram, <span class="math inline">\(f\)</span> maps it to <span class="math inline">\(0\)</span>.</p>
<p><span class="math display">\[
f(b) :=
\begin{cases}
1 & \text{if } b \text{ is a licit bigram of } G\\
0 & \text{otherwise}
\end{cases}
\]</span></p>
<p>Let’s go back to the formula above and fill in the corresponding values according to <span class="math inline">\(f\)</span> and <span class="math inline">\(G\)</span>.</p>
<p><span class="math display">\[
\begin{align*}
G(\mathit{\$abab\$}) := & f(\$a) \otimes f(ab) \otimes f(ba) \otimes f(ab) \otimes f(b\$)\\
= & 1 \otimes 1 \otimes 1 \otimes 1 \otimes 1\\
\end{align*}
\]</span></p>
<p>Compare this to the formula for the illicit string <span class="math inline">\(\mathit{\$abba\$}\)</span>.</p>
<p><span class="math display">\[
\begin{align*}
G(\mathit{\$abba\$}) := & f(\$a) \otimes f(ab) \otimes f(bb) \otimes f(ba) \otimes f(a\$)\\
= & 1 \otimes 1 \otimes 0 \otimes 1 \otimes 0\\
\end{align*}
\]</span></p>
<p>Notice how we get <span class="math inline">\(1\)</span> or <span class="math inline">\(0\)</span> depending on whether the bigram is licit according to grammar <span class="math inline">\(G\)</span>.</p>
<p>This only leaves us with <span class="math inline">\(\otimes\)</span>. The job of this operation is to combine the values produced by <span class="math inline">\(f\)</span> such that we get <span class="math inline">\(1\)</span> if the string is well-formed, and <span class="math inline">\(0\)</span> otherwise. A string is well-formed iff it does not contain even one illicit bigram, or equivalently, iff there isn’t a single bigram that was mapped to <span class="math inline">\(0\)</span> by <span class="math inline">\(f\)</span>. If there is even one <span class="math inline">\(0\)</span>, the whole aggregate value must be <span class="math inline">\(0\)</span>. We can replace <span class="math inline">\(\otimes\)</span> with any operation that satisfies this property — multiplication, for instance, will do just fine.</p>
<p><span class="math display">\[
\begin{align*}
G(\mathit{\$abab\$}) := & f(\$a) \otimes f(ab) \otimes f(ba) \otimes f(ab) \otimes f(b\$)\\
= & 1 \otimes 1 \otimes 1 \otimes 1 \otimes 1\\
= & 1 \times 1 \times 1 \times 1 \times 1\\
= & 1\\
\end{align*}
\]</span> <span class="math display">\[
\begin{align*}
G(\mathit{\$abba\$}) := & f(\$a) \otimes f(ab) \otimes f(bb) \otimes f(ba) \otimes f(a\$)\\
= & 1 \otimes 1 \otimes 0 \otimes 1 \otimes 0\\
= & 1 \times 1 \times 0 \times 1 \times 0\\
= & 0\\
\end{align*}
\]</span></p>
<p>Tada, the well-formed string gets a 1, the ill-formed string a 0, just as intended. Any string that contains at least one illicit bigram will be mapped to 0 because whenever you multiply by 0, you get 0. The only way for a string to get mapped to 1 is if only consists of well-formed bigrams. This is exactly the intuition we started out with: the well-formedness of a string is contingent on the well-formedness of its parts; in this case, bigrams.</p>
<h2 id="a-formula-for-gradient-grammars">A formula for gradient grammars</h2>
<p>While it’s certainly refreshing to think of a grammar as a device for multiplying <span class="math inline">\(1\)</span>s and <span class="math inline">\(0\)</span>s, there is a deeper purpose to this view. Here’s the crucial twist: the formula above also works for gradient SL grammars, we just have to change <span class="math inline">\(f\)</span> and <span class="math inline">\(\otimes\)</span>. If we use probabilities, we can even keep <span class="math inline">\(\otimes\)</span> the same. The math works exactly the same for categorical and probabilistic grammars.</p>
<p>First, let’s turn our categorical example grammar into a probabilistic one by assigning each bigram a probability. I’ll use arbitrary numbers here, in the real world those probabilities would usually come from a corpus.</p>
<ol start="2" class="example" type="1">
<li><strong>Probabilistic SL-2 grammar for <span class="math inline">\(\mathbf{(ab)^*}\)</span></strong>
<ol type="1">
<li><span class="math inline">\(\mathit{\$a}\)</span>: the probability that a string starts with <span class="math inline">\(a\)</span> is 100%</li>
<li><span class="math inline">\(\mathit{ab}\)</span>: the probability that <span class="math inline">\(a\)</span> is followed by <span class="math inline">\(b\)</span> is 100%</li>
<li><span class="math inline">\(\mathit{ba}\)</span>: the probability that <span class="math inline">\(b\)</span> is followed by <span class="math inline">\(a\)</span> is 75%</li>
<li><span class="math inline">\(\mathit{b\$}\)</span>: the probability that <span class="math inline">\(b\)</span> is not followed by anything is 25%</li>
</ol></li>
</ol>
<p>Now that the grammar is probabilistic, we also have to change our formula. Except that we don’t! We keep everything the way it is and only interpret <span class="math inline">\(f\)</span> differently. The function <span class="math inline">\(f\)</span> no longer tells us whether a bigram is licit, it instead gives us the probability of the bigram according to <span class="math inline">\(G\)</span>. The probability for bigrams that aren’t listed in the grammar is set to <span class="math inline">\(0\)</span>.</p>
<p><span class="math display">\[
\begin{align*}
G(\mathit{\$abab\$}) := & f(\$a) \otimes f(ab) \otimes f(ba) \otimes f(ab) \otimes f(b\$)\\
= & 1 \otimes 1 \otimes .75 \otimes 1 \otimes .25\\
= & 1 \times 1 \times .75 \times 1 \times .25\\
= & .1875\\
\end{align*}
\]</span> <span class="math display">\[
\begin{align*}
G(\mathit{\$abba\$}) := & f(\$a) \otimes f(ab) \otimes f(bb) \otimes f(ba) \otimes f(a\$)\\
= & 1 \otimes 1 \otimes 0 \otimes .7 \otimes 0\\
= & 1 \times 1 \times 0 \times .7 \times 0\\
= & 0\\
\end{align*}
\]</span></p>
<p>Compare that to the formula we had for the categorical grammar — it’s exactly the same mechanism! Nothing here has changed except the values. The value of the whole is still computed from the values of the same parts.</p>
<h2 id="a-trivalent-sl-grammar">A trivalent SL grammar</h2>
<p>What if we want to do a trivalent system, with well-formed, borderline, and ill-formed? Let’s modify our categorical grammar so that it marginally allows <span class="math inline">\(\mathit{bb}\)</span>.</p>
<ol start="3" class="example" type="1">
<li><strong>Trivalent SL-2 grammar for <span class="math inline">\(\mathbf{(ab)^*}\)</span></strong>
<ol type="1">
<li><span class="math inline">\(\mathit{\$a}\)</span>: the string may start with <span class="math inline">\(a\)</span></li>
<li><span class="math inline">\(\mathit{ab}\)</span>: <span class="math inline">\(a\)</span> may be followed by <span class="math inline">\(b\)</span></li>
<li><span class="math inline">\(\mathit{ba}\)</span>: <span class="math inline">\(b\)</span> may be followed by <span class="math inline">\(a\)</span></li>
<li><span class="math inline">\(\mathit{b\$}\)</span>: the string may end with <span class="math inline">\(b\)</span></li>
<li><span class="math inline">\(\mathit{bb}\)</span>: <span class="math inline">\(b\)</span> may be marginally followed by <span class="math inline">\(b\)</span></li>
</ol></li>
</ol>
<p>The corresponding formula once again will stay the same. But instead of <span class="math inline">\(0\)</span> and <span class="math inline">\(1\)</span>, we will use three values:</p>
<ul>
<li><span class="math inline">\(1\)</span>: well-formed</li>
<li><span class="math inline">\(?\)</span>: borderline</li>
<li><span class="math inline">\(*\)</span>: ill-formed</li>
</ul>
<p>Instead of multiplication, <span class="math inline">\(\otimes\)</span> is now an operation <span class="math inline">\(\mathrm{min}\)</span> that always returns the least licit value, as specified in the table below.</p>
<table>
<thead>
<tr class="header">
<th style="text-align: right;"><span class="math inline">\(\mathrm{min}\)</span></th>
<th style="text-align: center;"><span class="math inline">\(\mathbf{1}\)</span></th>
<th style="text-align: center;"><span class="math inline">\(\mathbf{?}\)</span></th>
<th style="text-align: center;"><span class="math inline">\(\mathbf{*}\)</span></th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: right;"><span class="math inline">\(\mathbf{1}\)</span></td>
<td style="text-align: center;"><span class="math inline">\(1\)</span></td>
<td style="text-align: center;"><span class="math inline">\(?\)</span></td>
<td style="text-align: center;"><span class="math inline">\(*\)</span></td>
</tr>
<tr class="even">
<td style="text-align: right;"><span class="math inline">\(\mathbf{?}\)</span></td>
<td style="text-align: center;"><span class="math inline">\(?\)</span></td>
<td style="text-align: center;"><span class="math inline">\(?\)</span></td>
<td style="text-align: center;"><span class="math inline">\(*\)</span></td>
</tr>
<tr class="odd">
<td style="text-align: right;"><span class="math inline">\(\mathbf{*}\)</span></td>
<td style="text-align: center;"><span class="math inline">\(*\)</span></td>
<td style="text-align: center;"><span class="math inline">\(*\)</span></td>
<td style="text-align: center;"><span class="math inline">\(*\)</span></td>
</tr>
</tbody>
</table>
<p>And here are the corresponding formulas for our familiar example strings <span class="math inline">\(\mathit{\$abab\$}\)</span> and <span class="math inline">\(\mathit{\$abba\$}\)</span></p>
<p><span class="math display">\[
\begin{align*}
G(\mathit{\$abab\$}) := & f(\$a) \otimes f(ab) \otimes f(ba) \otimes f(ab) \otimes f(b\$)\\
= & 1 \otimes 1 \otimes 1 \otimes 1 \otimes 1\\
= & 1 \mathrel{\mathrm{min}} 1 \mathrel{\mathrm{min}} 1 \mathrel{\mathrm{min}} 1 \mathrel{\mathrm{min}} 1\\
= & 1\\
\end{align*}
\]</span> <span class="math display">\[
\begin{align*}
G(\mathit{\$abba\$}) := & f(\$a) \otimes f(ab) \otimes f(bb) \otimes f(ba) \otimes f(a\$)\\
= & 1 \otimes 1 \otimes ? \otimes 1 \otimes *\\
= & 1 \mathrel{\mathrm{min}} 1 \mathrel{\mathrm{min}} ? \mathrel{\mathrm{min}} 1 \mathrel{\mathrm{min}} *\\
= & *\\
\end{align*}
\]</span></p>
<p>Note how the second string is still considered ill-formed. While the presence of the bigram <span class="math inline">\(\mathit{bb}\)</span> degrades it to borderline status, the presence of the illicit bigram <span class="math inline">\(\mathit{a\$}\)</span> means that we cannot assign a higher value than <span class="math inline">\(*\)</span>.</p>
<h2 id="beyond-acceptability">Beyond acceptability</h2>
<p>We can even use this formula to calculate aspects of the string that have nothing at all to do with well-formedness or acceptability. Suppose that <span class="math inline">\(f\)</span> once again maps each bigram to <span class="math inline">\(1\)</span> or <span class="math inline">\(0\)</span> depending on whether it is licit according to <span class="math inline">\(G\)</span>. Next, we instantiate <span class="math inline">\(\otimes\)</span> as addition. Then we have a formula that calculates the number of licit bigrams in the string.</p>
<p><span class="math display">\[
\begin{align*}
G(\mathit{\$abab\$}) := & f(\$a) \otimes f(ab) \otimes f(ba) \otimes f(ab) \otimes f(b\$)\\
= & 1 \otimes 1 \otimes 1 \otimes 1 \otimes 1\\
= & 1 + 1 + 1 + 1 + 1\\
= & 5\\
\end{align*}
\]</span> <span class="math display">\[
\begin{align*}
G(\mathit{\$abba\$}) := & f(\$a) \otimes f(ab) \otimes f(bb) \otimes f(ba) \otimes f(a\$)\\
= & 1 \otimes 1 \otimes 0 \otimes 1 \otimes 0\\
= & 1 + 1 + 0 + 1 + 0\\
= & 3\\
\end{align*}
\]</span></p>
<p>Or maybe <span class="math inline">\(f\)</span> replaces each bigram <span class="math inline">\(g\)</span> with the singleton set <span class="math inline">\(\{g\}\)</span>. And <span class="math inline">\(\otimes\)</span> will be <span class="math inline">\(\cup\)</span>, the set union operation. Then the formula maps each string to the set of bigrams that occur in it.</p>
<p><span class="math display">\[
\begin{align*}
G(\mathit{\$abab\$}) := & f(\$a) \otimes f(ab) \otimes f(ba) \otimes f(ab) \otimes f(b\$)\\
= & \{\$a\} \otimes \{ab\} \otimes \{ba\} \otimes \{ab\} \otimes \{b\$\}\\
= & \{\$a\} \cup \{ab\} \cup \{ba\} \cup \{ab\} \cup \{b\$\}\\
= & \{\$a, ab, ba, b\$\}\\
\end{align*}
\]</span> <span class="math display">\[
\begin{align*}
G(\mathit{\$abba\$}) := & f(\$a) \otimes f(ab) \otimes f(bb) \otimes f(ba) \otimes f(a\$)\\
= & \{\$a\} \otimes \{ab\} \otimes \{bb\} \otimes \{ba\} \otimes \{a\$\}\\
= & \{\$a\} \cup \{ab\} \cup \{bb\} \cup \{ba\} \cup \{a\$\}\\
= & \{\$a, ab, bb, ba, a\$\}\\
\end{align*}
\]</span></p>
<p>Is there a point to these instantiations of <span class="math inline">\(f\)</span> and <span class="math inline">\(\otimes\)</span>? They can be useful for certain computational tasks, but from a linguistic perspective there really isn’t much point to them. But, you know what, I’d say the same is true for all the other instantiations we’ve seen so far. If you’re a linguist, you shouldn’t worry at all about how <span class="math inline">\(f\)</span> and <span class="math inline">\(\otimes\)</span> are instantiated.</p>
<h2 id="grammars-combine-they-dont-calculate">Grammars combine, they don’t calculate</h2>
<p>The general upshot is this: a grammar is a mechanism for determining the values of the whole from values of its parts. The difference between grammars is what parts they look at and how they relate them to each other.</p>
<p>A TSL grammar, for instance, would have a different formula. In a TSL grammar, we ignore irrelevant symbols in the string. So if we have a grammar that cares about <span class="math inline">\(a\)</span> but not <span class="math inline">\(b\)</span>, the corresponding formula for the string <span class="math inline">\(\mathit{abba}\)</span> would be <span class="math inline">\(f(\$a) \otimes f(aa) \otimes f(a\$)\)</span>. This is only a minor change because TSL grammars are very similar to SL grammars. The formula for, say, a finite-state automaton would differ by quite a bit more. That’s what linguistic analysis is all about. Linguistics is about determining the <strong>shape of the formula</strong>!</p>
<p>But that’s not what the categorical VS gradience divide is about. That only kicks in once you have determined the overall shape of the formula and need to define <span class="math inline">\(f\)</span> and <span class="math inline">\(\otimes\)</span>. And that choice simply isn’t very crucial from a linguistic perspective.</p>
<p>There’s many different choices for <span class="math inline">\(f\)</span> and <span class="math inline">\(\otimes\)</span> depending on what you want to do. But the choices that are useful for a linguist will always be limited in such way that they form a particular kind of algebraic structure that’s called a <strong>monoid</strong>. I won’t bug you with <a href="https://en.wikipedia.org/wiki/Monoid">the mathematical details of monoids</a>. Whether you prefer a categorical system or a gradient system, rest assured there’s a suitable monoid for that. And that’s all that matters. That’s why linguists shouldn’t worry about the categorical VS gradience divide — linguistic insights are about the overall shape of the formula, not about calculating the result.</p>
<h2 id="from-string-to-trees-semirings">From string to trees: semirings</h2>
<p>Okay, there’s one minor complication that I’d like to cover just to cross all <em>t</em>s and dot all <em>i</em>s. If you’re already worn out, just skip ahead to the wrap-up.</p>
<p>Beyond the pleasant valleys of string land lies the thicket of tree land. In tree land, things can get a bit more complicated depending on what your question is. Not always, though. It really depends on what kind of value you’re trying to compute.</p>
<p>If you just want to know whether a specific tree is well-formed, nothing really changes. Take your standard phrase structure grammar. A rewrite rule of the form <code>S -> NP VP</code> is a tree bigram where the mother is <code>S</code> and the daughters are <code>NP</code> and <code>VP</code>. Just like we can break down a string into its string bigrams, we can break down a tree into its tree bigrams. And the value of the whole tree according to a phrase structure grammar is computed by combining the values of its tree bigrams. With more expressive formalisms like MGs, things are once again more complicated, just like a finite-state automaton uses a more complicated formula in string land than the one for SL grammars above. But the general principle remains the same: once you have a formula for how the parts interact, you can plug in the operators you want. As before, we can switch between gradient and categorical systems by tweaking the values of <span class="math inline">\(f\)</span> and <span class="math inline">\(\otimes\)</span>, under the condition that this still gets us a monoid.</p>
<p>I think this is actually enough for syntax. But perhaps you want to talk about the value of a string, rather than a tree. This is a more complex value because one string can correspond to multiple trees. For instance, in probabilistic syntax the probability of the string</p>
<ol start="4" class="example" type="1">
<li>I eat sushi with edible chopsticks.</li>
</ol>
<p>is the sum of the probabilities of two distinct trees:</p>
<ol start="5" class="example" type="1">
<li>[I eat [sushi with edible chopsticks]]</li>
<li>[I [[eat sushi] [with edible chopsticks]]</li>
</ol>
<p>So <span class="math inline">\(\otimes\)</span> by itself is not enough, there is yet another operation. For probabilistic grammars it’s <span class="math inline">\(+\)</span>, but we may again replace it with a more general mystery operation <span class="math inline">\(\oplus\)</span>. The job of <span class="math inline">\(\oplus\)</span> is to combine all the values computed by <span class="math inline">\(\otimes\)</span>. Like <span class="math inline">\(\otimes\)</span>, <span class="math inline">\(\oplus\)</span> has to yield a monoid of some kind, and the combination of <span class="math inline">\(\oplus\)</span> and <span class="math inline">\(\otimes\)</span> has to form a <strong>semiring</strong>. Again I’ll completely <a href="https://en.wikipedia.org/wiki/Semiring">gloss over the math</a>. Let’s focus only on the essential point: once again the split between categorical systems and gradient systems is not very large because either way we end up with a semiring. The nature of the grammar stays the same, only the system for computing compound values uses different functions and operators.</p>
<p>You might be wondering what a categorical grammar looks like from the semiring perspective. What is the mysterious operation <span class="math inline">\(\oplus\)</span> in that case? It can’t be addition because <span class="math inline">\(1 + 1\)</span> would give us <span class="math inline">\(2\)</span>, which isn’t a possible value in a categorical system. No, with categorical systems, <span class="math inline">\(\oplus\)</span> behaves like logical <em>or</em>: it returns 1 if there is at least one 1. Suppose, then, that we want to know if some string <em>s</em> is well-formed according to some categorical grammar <span class="math inline">\(G\)</span>. Here is how this would work in a very simplified manner:</p>
<ol type="1">
<li>We look at all possible trees that yield the string <em>s</em>, even if those strings are ill-formed according to <span class="math inline">\(G\)</span>.</li>
<li>We use <span class="math inline">\(\otimes\)</span> to compute the compound value for each tree. As before, <span class="math inline">\(\otimes\)</span> is multiplication (but it could also be logical <em>and</em>, if you find that more pleasing). Well-formed tress will evaluate to <span class="math inline">\(1\)</span>, ill-formed ones to <span class="math inline">\(0\)</span>.</li>
<li>We then use <span class="math inline">\(\oplus\)</span>, i.e. logical <em>or</em>, to combine all those compound values into a single value for the string <em>s</em>. Then <em>s</em> will get the value <span class="math inline">\(1\)</span>, and hence be deemed well-formed, iff there is at least one well-formed tree that yields <em>s</em>.</li>
</ol>
<p>Okay, that’s not how we usually think about well-formedness. We view the grammar as a system for specifying a specific set of well-formed trees, rather than a function that maps every logically conceivable tree to some value. But as you hopefully remember from your semantics intro, there is no difference between a set and its characteristic function. The procedure above treats the grammar as the characteristic function of the set of well-formed trees. Most of the time that’s not very illuminating for linguistics, but when it comes to the split between categorical and gradient it is really useful because it reveals the monoid/semiring structure of the grammar formalism.</p>
<h2 id="wrapping-up-dont-worry-be-happy">Wrapping up: Don’t worry, be happy</h2>
<p>Monoids and semirings are a very abstract perspective of grammars, and I rushed through them in a (failed?) attempt to keep the post at a manageable length. But behind all that math is the simple idea that syntacticians, and linguists in general, don’t need to worry that a categorical grammar formalism is somehow irreconcilable with the fact that acceptability judgments are gradient. Even if we don’t factor out gradience as a performance phenomenon, even if we want to place it in the heart of grammar, that does not require us to completely retool our grammar formalisms. The change is largely mathematical in nature and doesn’t touch on the things that linguists care about. Linguists care about representations and how specific parts of those representations can interact. In the mathematical terms I used in this post, those issues are about the shape of the formula for computing <span class="math inline">\(G(o)\)</span> for some object <span class="math inline">\(o\)</span>. It is not about the specific values or operators that appear in the formula.</p>
<p>In many cases, there’s actually many different operators that give the same result. We interpreted <span class="math inline">\(\otimes\)</span> as multiplication for categorical SL grammars, but we could have also used logical <em>and</em> or the <em>min</em> function. They all produce exactly the same values. No linguist would ever worry about which one of those functions is the right choice. The choice between categorical, probabilistic, or some other kind of gradient system isn’t all that different. Again you are needlessly worrying about the correct way of instantiating <span class="math inline">\(f\)</span>, <span class="math inline">\(\otimes\)</span>, and possibly <span class="math inline">\(\oplus\)</span>.</p>
<p>That’s not to say that switching out, say, a categorical semiring for a probabilistic one is a trivial affair. It can create all kinds of problems. But those are mathematical problems, computational problems, they are not linguistic problems. It’s stuff like computing infinite sums of bounded reals. It’s decidedly not a linguistic issue. So don’t worry, be happy.</p>
Just your regular regular expression2020-04-24T00:00:00-04:002020-04-24T00:00:00-04:00Thomas Graftag:outde.xyz,2020-04-24:/2020-04-24/just-your-regular-regular-expression.html<p>Outdex posts can be a dull affair, always obsessed with language and computation (it’s the official blog motto, you know). Today, I will deviate from this with a post that’s obsessed with, wait for it, computation and language. Big difference. Our juicy topic will be regular expressions. And don’t you worry, we’ll get to the “and language” part. </p>
<p>Outdex posts can be a dull affair, always obsessed with language and computation (it’s the official blog motto, you know). Today, I will deviate from this with a post that’s obsessed with, wait for it, computation and language. Big difference. Our juicy topic will be regular expressions. And don’t you worry, we’ll get to the “and language” part. </p>
<h2 id="some-simple-boring-examples">Some simple, boring examples</h2>
<p>If you don’t know what a regular expression is, think of it as search (or search and replace) on steroids. If you work with a lot of text files — surprise, I do — regular expressions can make your life a lot easier, but they also have a nasty habit of turning into byzantine symbol salad that’s impossible to decipher. Allow me to demonstrate. Or maybe skip ahead to the next section, this one here is just a slow introductory build-up to the interesting stuff.</p>
<p>Suppose you want to change every instance of <em>regular expression</em> to the shorter <em>regex</em>. If you’re like me, you will use <code>sed</code> for this, the <strong>s</strong>tream <strong>ed</strong>itor. Here’s the relevant command.</p>
<div class="sourceCode" id="cb1"><pre class="sourceCode bash"><code class="sourceCode bash"><a class="sourceLine" id="cb1-1" title="1"><span class="ex">s/regular</span> expression/regex/g</a></code></pre></div>
<p>Okay, that’s not exactly user-friendly in these days of GUIs and colorful buttons to click on, but it’s manageable. The command breaks down into a few simple components.</p>
<ol type="1">
<li><code>s</code>: substitute</li>
<li><code>/</code>: argument separator</li>
<li><code>regular expression</code>: anything matching this regular expression should be replaced</li>
<li><code>/</code>: argument separator</li>
<li><code>regex</code>: replace the match with this string instead</li>
<li><code>/</code>: argument separator</li>
<li><code>g</code>: do a global replace; that is to say, process the whole line, don’t just stop after the first match on the line</li>
</ol>
<p>Suppose we have the input text below.</p>
<ol class="example" type="1">
<li>A Note on Regular Expressions: Since “regular expression” is a long term, regular expressions are also called regexes.</li>
</ol>
<p>This will be rewritten as follows.</p>
<ol start="2" class="example" type="1">
<li>A Note on Regular Expressions: Since “regex” is a long term, regexs are also called regexes.</li>
</ol>
<p>Note that in either case the rewriting targets every instance of <em>regular expression</em>, even if it is followed by other characters like <em>s</em>. But without <code>g</code>, only the first instance of <em>regular expression</em> would have been replaced.</p>
<ol start="3" class="example" type="1">
<li>A Note on Regular Expressions: Since “regex” is a long term, regular expressions are also called regexes.</li>
</ol>
<p>As you can see in the examples above, capitalization matters, so by default <code>regular expression</code> does not match <code>Regular Expression</code>. We can fix that by specifying alternatives (there’s better ways for handling upper and lower case, but then I wouldn’t get to demonstrate alternatives).</p>
<div class="sourceCode" id="cb2"><pre class="sourceCode bash"><code class="sourceCode bash"><a class="sourceLine" id="cb2-1" title="1"><span class="ex">s</span>/[<span class="ex">Rr</span>]egular [Ee]xpression/regex/g</a></code></pre></div>
<p>Here <code>[Rr]egular [Ee]expression</code> will match <em>Regular Expression</em>, <em>Regular expression</em>, <em>regular Expression</em>, and <em>regular expression</em>. So now we would get this output</p>
<ol start="4" class="example" type="1">
<li>A note on regexs: Since “regex” is a long term, regexs are also called regexes.</li>
</ol>
<p>But these instances of <em>regexs</em> are pretty ugly. Let’s extend the match so that it can include an optional <em>s</em>.</p>
<div class="sourceCode" id="cb3"><pre class="sourceCode bash"><code class="sourceCode bash"><a class="sourceLine" id="cb3-1" title="1"><span class="ex">s</span>/[<span class="ex">Rr</span>]egular [Ee]xpressions\?/regex/g</a></code></pre></div>
<p>We use <code>\?</code> to indicate that the preceding character is optional for the match. If there is an <em>s</em>, include it in the rewriting, otherwise ignore whatever comes after the <em>n</em>. With this, we get yet another output.</p>
<ol start="5" class="example" type="1">
<li>A note on regex: Since “regex” is a long term, regex are also called regexes.</li>
</ol>
<p>We could play this game for a few more rounds, but I think you get the gist. Now let’s look at how quickly regular expressions can get nasty.</p>
<h2 id="cranking-up-the-weird">Cranking up the weird</h2>
<p>Things have been perfectly reasonable so far. Just to mix things up a bit, here’s a regular expression I use a lot to rewrite things like <code>**foo**</code> as <code><b>foo</b></code> (don’t ask why I need to do that, it’s a quick hack while the long-term solution is still being worked on).</p>
<div class="sourceCode" id="cb4"><pre class="sourceCode bash"><code class="sourceCode bash"><a class="sourceLine" id="cb4-1" title="1"><span class="ex">s</span>/<span class="dt">\*\*\(</span>[^<span class="ex">*</span>]*<span class="dt">\)\*\*</span>/<span class="op"><</span>b<span class="op">></span>\<span class="op">1<</span>\/b<span class="op">></span>/g</a></code></pre></div>
<p>If you’re curious, here’s how you read that regex:<a href="#fn1" class="footnote-ref" id="fnref1"><sup>1</sup></a></p>
<ol type="1">
<li><code>s</code>: substitute</li>
<li><code>/</code>: argument separator</li>
<li><code>\*\*</code>: match **</li>
<li><code>\(</code>: start matching group 1</li>
<li><code>[^*]*</code>: match any string of 0 or more characters that are not *</li>
<li><code>\)</code>: close matching group 1</li>
<li><code>\*\*</code>: match **</li>
<li><code>/</code>: argument separator</li>
<li><code><b></code>: insert <b></li>
<li><code>\1</code>: insert the content of matching group 1</li>
<li><code><\/b></code>: insert </b></li>
<li><code>/</code>: argument separator</li>
<li><code>g</code>: do a global replace</li>
</ol>
<p>It actually makes a lot of sense if you come up with it yourself and remind yourself every 5 minutes how it works. In all other cases, it’s a Lovecraftian nightmare that will drive you mad.</p>
<p>But this is just the tip of the iceberg. True regex wizards can do stuff that is so crazy it tears apart the fabric of reality. Did you ever wonder how you can match lines of the form <code>a b c</code> such that <code>a + b = c</code>? Well, <a href="http://www.drregex.com/2018/11/how-to-match-b-c-where-abc-beast-reborn.html?m=1">somebody wrote a <code>sed</code> program for that</a>, because why wouldn’t they? Here’s a part of the very first command:</p>
<div class="sourceCode" id="cb5"><pre class="sourceCode bash"><code class="sourceCode bash"><a class="sourceLine" id="cb5-1" title="1"><span class="kw">(</span><span class="ex">?</span>=[-+]?(?:0\B<span class="kw">)</span><span class="ex">*+</span>(\d*?)<span class="kw">((</span>?:(?=\d+(?:\.\d+)?<span class="dt">\ </span>[-+]?(?:0\B)*+(\d*?)(\d(?(4)\4))(?:\.\d+)?<span class="dt">\ </span>[-+]?(?:0\B)*+(\d*?)(\d(?(6)\6))(?:\.\d+)?$)\d)++)\b)</a></code></pre></div>
<p>Don’t look at me, I have no clue what’s going on here. But it works, somehow. If you want to figure it out, it might help to use <a href="https://github.com/SoptikHa2/desed/">desed</a>, a debugger for <code>sed</code>. If you give it a try, please also try this <a href="https://tildes.net/~comp/b2k/programming_challenge_find_path_from_city_a_to_city_b_with_least_traffic_controls_inbetween#comment-2run"><code>sed</code>-based solution for a shortest path problem</a>. I’d really appreciate an in-depth explanation.</p>
<p>Regular expressions weren’t designed to handle any of that stuff. But somebody with way too much time on their hand hunkered down and pushed them to their limits. It’s insane, but it works. And that’s the point that gets me to the bit of linguistics I need as an excuse for showing off some cool regex stuff.</p>
<h2 id="regexes-in-linguistics">Regexes in linguistics</h2>
<p>The regex examples above show that there is a big difference between what a system can do and what a system can do in a manner that’s easily digestible for a human reader. And that distinction is too often glossed over in linguistics. The literature is full of claims of the form “proposal X cannot account for phenomenon Y”. And very often, that’s not true, just like it isn’t true that you can’t use regular expressions to calculate a shortest path. For instance, you don’t need copy movement to produce pronounced copies, but oh boy will the grammar look weird. What these claims actually mean is “proposal X cannot elegantly account for phenomenon Y”. And that’s a big difference.</p>
<p>Elegance is a tricky criterion. For one thing, elegance depends a lot on the specification language. What may look clunky as a (standard) regular expression may be very elegant as a formula of monadic second-order logic, even though the two are intertranslatable. And the elegance of a specification language depends a lot on how much has been abstracted away. Specification X may be better than Y as long as you only have to account for phenomena P, Q, and R, but throw in S and T and all of a sudden Y scales much better and wins. It’s all very fuzzy, very tentative, mostly based on hunches, personal taste, aesthetics.</p>
<p>That’s okay. In general, researchers should do whatever makes them more productive, and in general it is the case that elegance = simplicity = productivity. But we should acknowledge that this is a methodological criterion. Lack of elegance is not a knockout argument and does not tell us much about the cognitive reality of a proposal. Reality might in fact be messy. Even though that <code>a + b = c</code> program is just a one-liner in Python, the human brain might actually be using the <a href="http://www.drregex.com/2018/11/how-to-match-b-c-where-abc-beast-reborn.html?m=1">humongous <code>sed</code> clusterfuck</a>. That doesn’t mean our theories have to be ugly — there’s nothing wrong with being better than reality — but we should be much more cautious with the use of elegance criteria in theory comparison.</p>
<p>And if you think learning considerations provide a natural push towards elegance, may I introduce you to <a href="https://github.com/MaLeLabTs/RegexGenerator">this lovely regex generator</a> that infers the intended regex from a data sample? Yes, I only brought up learning so that I could link to that.</p>
<section class="footnotes">
<hr />
<ol>
<li id="fn1"><p>The second part isn’t a regular expression in the original sense of formal language theory because it uses backreferences, which require unbounded copying and simply aren’t regular. For the specific rewriting step I’m doing there, it would be trivial to specify a finite-state transducer, though. And the existence of backreferences is an interesting point in its own right: Even though every regex (in the formal language theory sense) can be converted to an equivalent deterministic finite-state automaton, most regex implementations actually use a context-free parsing mechanism — and once you have that, backreferences are an easy addition. Sometimes, a powerful thing can be more efficient than a very restricted thing.<a href="#fnref1" class="footnote-back">↩</a></p></li>
</ol>
</section>
Against math: When sets are a bad setup2020-04-06T00:00:00-04:002020-04-06T00:00:00-04:00Thomas Graftag:outde.xyz,2020-04-06:/2020-04-06/against-math-when-sets-are-a-bad-setup.html<p>Last time I gave you a piece of my mind when it comes to <a href="https://outde.xyz/2020-03-30/against-math-kuratowskis-spectre.html">the Kuratowski definition of pairs and ordered sets</a>, and why we should stay away from it in linguistics. The thing is, that was a conceptual argument, and those tend to fall flat with most researchers. Just like most mathematicians weren’t particularly fazed by Gödel’s incompleteness results because it didn’t impact their daily work, the average researcher doesn’t care about some impurities in their approach as long as it gets the job done. So this post will discuss a concrete case where a good linguistic insight got buried under mathematical rubble. </p>
<p>Last time I gave you a piece of my mind when it comes to <a href="https://outde.xyz/2020-03-30/against-math-kuratowskis-spectre.html">the Kuratowski definition of pairs and ordered sets</a>, and why we should stay away from it in linguistics. The thing is, that was a conceptual argument, and those tend to fall flat with most researchers. Just like most mathematicians weren’t particularly fazed by Gödel’s incompleteness results because it didn’t impact their daily work, the average researcher doesn’t care about some impurities in their approach as long as it gets the job done. So this post will discuss a concrete case where a good linguistic insight got buried under mathematical rubble. </p>
<p>Our case study is a <a href="https://doi.org/10.1353/lan.2018.0037">2018 paper</a> by <a href="http://departament-filcat-linguistica.ub.edu/directori-organitzatiu/jordi-fortuny-andreu">Jordi Fortuny</a>, which refines the ideas first presented in <span class="citation" data-cites="Fortuny08">Fortuny (2008)</span> and <span class="citation" data-cites="FortunyCorominasMurtra09">Fortuny and Corominas-Murtra (2009)</span>. The paper wrestles with one of the foundational issues of syntax: the interplay of structure and linear order, and why the latter seems to play second fiddle at best in syntax. Let’s first reflect a bit on the nature of this problem before we look at Fortuny’s proposed answer.</p>
<h2 id="structure-vs-linear-order">Structure VS linear order</h2>
<p>The primacy of structure is pretty much old hat to linguists. You’ve all seen the standard auxiliary fronting example before:</p>
<ol class="example" type="1">
<li>The man who is talking is tall.</li>
<li>Is the man who is talking _ tall?</li>
<li>*Is the man who _ talking is tall.</li>
</ol>
<p>Why is there no language that defines auxiliary fronting in terms of linear precedence such that the leftmost — or alternatively the rightmost — auxiliary in the sentence is fronted? Quite generally, why doesn’t syntax allow constraints that are based entirely on linear order, such as:</p>
<ol type="1">
<li><strong>Sentence-final decasing</strong><br />
Don’t display case if you are the last word in the sentence.</li>
<li><strong>RHOL subject placement</strong><br />
The subject of a clause <em>C</em> is the rightmost DP with at least two lexical items. If no such DP exists in <em>C</em>, the subject is the leftmost DP instead.</li>
<li><strong>Linear movement blocking</strong><br />
No adjunct may linearly intervene between a mover and its trace.</li>
<li><strong>Modulo binding</strong><br />
Every reflexive must be an even number of words away from the left edge of the sentence.</li>
</ol>
<p>That’s exactly the kind of question that keeps me up at night. As you know, I like the idea that syntax and phonology are actually very similar at a computational level, so the non-existence of the constraints above is problematic because they are all modeled after phenomena from the phonological literature. How can we explain the absence of those constraints?</p>
<p>There’s two answers here, and neither one is satisfying. We might reject the initial assumption that linear order doesn’t matter in syntax. That’s basically <a href="https://udel.edu/~bruening/">Benjamin Bruening</a>’s <a href="https://www.linguisticsociety.org/sites/default/files/342-388_0.pdf">story for binding</a> <span class="citation" data-cites="Bruening14">(Bruening 2014)</span>. I have a weak spot for contrarian takes, but the Bruening stance doesn’t answer why we still can’t get constraints like the ones listed above. Perhaps Bruening is right and linear order matters to some extent, but if so the extent seems to be much more limited than one could imagine.</p>
<p>This leaves us with option 2, which is the standard story: syntactic representations have no linear order, hence syntax can’t make reference to linear order. The idea goes back to <span class="citation" data-cites="Kayne94">Kayne (1994)</span> and is also a major reason for the use of sets in Bare Phrase Structure <span class="citation" data-cites="Chomsky95">(Chomsky 1995)</span>. But it simply doesn’t work because syntax is inherently asymmetric. And this is where <span class="citation" data-cites="Fortuny18">Fortuny (2018)</span> enters the picture.</p>
<h2 id="order-from-computation">Order from computation</h2>
<p>I take Fortuny’s basic point to be one that I’m very sympathetic to: linear order emerges naturally from the asymmetry that is implicit in syntactic computation. Hence it is hopeless to stipulate linear order out of syntax, the best one can do is to systematically bound the role of linear order in syntax.</p>
<p>Here’s my take on this, which I think is close in spirit to what Fortuny is driving at, but without any reliance on sets. A (strict) linear order arises when you have a relation that is transitive, irreflexive, asymmetric, and trichotomous:</p>
<ul>
<li><strong>transitive</strong>: whatever can be reached in <span class="math inline">\(n\)</span> steps can be reached in one step<br />
(<span class="math inline">\(x \mathrel{R} y\)</span> and <span class="math inline">\(y \mathrel{R} z\)</span> implies <span class="math inline">\(x \mathrel{R} z\)</span>)</li>
<li><strong>irreflexive</strong>: nothing is related to itself<br />
(<span class="math inline">\(x \not\mathrel{R} x\)</span> for all <span class="math inline">\(x\)</span>)</li>
<li><strong>asymmetric</strong>: no two elements are mutually related<br />
(<span class="math inline">\(x \mathrel{R} y \rightarrow y \not\mathrel{R} x\)</span>)</li>
<li><strong>trichotomous</strong>: no two elements are unrelated<br />
(<span class="math inline">\(x \mathrel{R} y \vee y \mathrel{R} x \vee x = y\)</span>)</li>
</ul>
<p>If you look at syntax in terms of binary Merge (or something <a href="https://outde.xyz/2019-05-15/underappreciated-arguments-the-inverted-t-model.html">slightly</a> <a href="https://outde.xyz/2019-09-18/the-subregular-complexity-of-merge-and-move.html">more</a> <a href="https://outde.xyz/2020-03-06/trees-for-free-with-tree-free-syntax.html">abstract</a>), you already have an order that satisfies three of those properties: transitivity, irreflexivity, and asymmetry. That’s the (strict) partial order we usually call <strong>proper dominance</strong>, but you can also think of it as <strong>derivational prominence</strong> or some other, more abstract concept. Not really the point here. Either way you are already dealing with something that’s inherently asymmetric and ordered, and this asymmetry can be inherited by any relation that piggybacks on this. This recourse to proper dominance (<span class="math inline">\(\triangleleft^+\)</span>) is exactly how linear order (<span class="math inline">\(\prec\)</span>) is usually defined in formal terms: <span class="math display">\[
x \prec y \Leftrightarrow
x \mathrel{S} y
\vee
\exists z_x, z_y [
z_x \triangleleft^+ x
\wedge
z_y \triangleleft^+ y
\wedge
z_x \mathrel{S} z_y
]
\]</span> This says that precedence is inherited via dominance. If <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span> are properly dominated by <span class="math inline">\(z_x\)</span> and <span class="math inline">\(z_y\)</span>, respectively, and <span class="math inline">\(z_x\)</span> precedes <span class="math inline">\(z_y\)</span>, then <span class="math inline">\(x\)</span> precedes <span class="math inline">\(y\)</span>. But hold on a second, that’s circular, we’re defining precedence in terms of precedence. And if you look at the formula, there’s actually a completely different symbol there, <span class="math inline">\(S\)</span>, which isn’t the precedence symbol <span class="math inline">\(\prec\)</span>. So what’s <span class="math inline">\(S\)</span>? It’s the successor relation, and in contrast to precedence it’s only defined for siblings. So <span class="math inline">\(x \mathrel{S} y\)</span> is true iff <span class="math inline">\(y\)</span> is both a sibling of <span class="math inline">\(x\)</span> and its successor. Aha, so this is where it all breaks down, syntax doesn’t actually have such a successor relation because there is no linear order in syntax!</p>
<p>Nope, sorry, this particular hook you can’t get off that easily. You see, <span class="math inline">\(S\)</span> doesn’t actually need to tell us anything about linear order. The term successor applies to any asymmetric order. So <span class="math inline">\(S\)</span> just needs to be some relation that establishes an asymmetry between <span class="math inline">\(x\)</span> and <span class="math inline">\(y\)</span>. And there is a relation in syntax that does this for us, it’s the head-argument relation. Merge tends to be presented as a symmetric operation, but it’s not. One of the guys is more important because it has a bigger influence on the behavior of the newly formed constituent. That’s the head. Instead of successor, you may interpret <span class="math inline">\(S\)</span> as some kind of <strong>superior</strong> relation, and the formula above will still work fine.</p>
<p>What this shows us is that linear order cannot be simply stipulated away. Syntax furnishes all the asymmetries that make up linear order, and thus a computation device that can keep track of these asymmetries is perfectly aware of linear order. The problem, then, has to be with the computational complexity of determining linear order from those asymmetries. That is to say, the formula for <span class="math inline">\(\prec\)</span> above is too hard to compute. Something about inheritance via proper dominance is beyond the computational means of syntax. If so, this would dovetail quite nicely with my pet idea that syntax overall has very low subregular complexity. For instance, I’ve argued together with <a href="https://aniellodesanto.github.io/">Aniello De Santo</a> that <a href="https://www.aclweb.org/anthology/W19-5702.pdf">sensing tree automata furnish an upper bound for syntax</a>, and these automata are indeed incapable of fully tracking linear order. So, yes, sign me up for the idea that linear order constraints don’t show up in syntax because linear order is too hard to compute from the combination of proper dominance and head-argument asymmetries. But that’s very different from the standard story that syntax lacks linear order because its representations don’t directly encode linear order.</p>
<p>Fortuny provides a technically different answer, but the core idea is very similar in nature to the story I just sketched. He first shows that syntax naturally produces linear orders, and then tries to explain why the impact of that is so limited. But he does it with sets, and that opens up a whole can of worms.</p>
<h2 id="fortunys-approach-in-detail">Fortuny’s approach in detail</h2>
<p>Fortuny starts out with a generalization of the Kuratowski definition from pairs to tuples (btw, upon rereading his paper I noticed that he actually cites <span class="citation" data-cites="Kuratowski21">Kuratowski (1921)</span>; kudos!). With this generalized definition, an <span class="math inline">\(n\)</span>-tuple <span class="math inline">\(\langle a_1, \ldots, a_n \rangle\)</span> is encoded as the <strong>nest</strong> <span class="math display">\[
\{ \{a_1\}, \{a_1, a_2\}, \{a_1, a_2, a_3\}, \ldots, \{a_1, a_2, a_3, \ldots, a_n\} \}
\]</span> Based on earlier work <span class="citation" data-cites="Fortuny08 FortunyCorominasMurtra09">(Fortuny 2008; Fortuny and Corominas-Murtra 2009)</span>, he then postulates that syntactic derivations produce sets of this form, rather than the standard bare phrase structure sets. Think of it as follows: suppose that the syntactic workspace currently contains only <span class="math inline">\(a\)</span>, which by itself forms the constituent <span class="math inline">\(\{ a \}\)</span>. Now if we Merge some <span class="math inline">\(b\)</span> into this constituent, we get <span class="math inline">\(\{ a, b \}\)</span> as the output. Fortuny then says that the actual constituent is the set of the sets built by Merge, i.e. <span class="math inline">\(\{ \{a\}, \{a,b\} \}\)</span>. Personally, I’d say it makes more sense to think of this as a derivation rather than a constituent, but my infatuation with derivation trees should be well-known by now. I won’t quibble over terminology and just follow Fortuny in calling these complex sets constituents. So we have an <strong>output</strong> <span class="math inline">\(\{a,b\}\)</span>, but a <strong>constituent</strong> <span class="math inline">\(\{ \{a\}, \{a,b\} \}\)</span>. If we merge a <span class="math inline">\(c\)</span> into the current output, we get <span class="math inline">\(\{ a, b, c\}\)</span>, and the constituent grows to <span class="math inline">\(\{ \{a\}, \{a,b\}, \{a,b,c\} \}\)</span>. In a nutshell, Merge just inserts a lexical item into a set, and the nested structure arises if we collect all the outputs into a single set, which Fortuny calls a constituent.</p>
<p>But that’s also where we run into the first problem. Well, two problems. Three, actually. First, <a href="https://outde.xyz/2020-03-30/against-math-kuratowskis-spectre.html">and at the risk of repeating myself</a>, this kind of definition only works for specific axiomatizations of sets, and that’s a lot of baggage to attach to your linguistic proposal. Second, redefining Merge in that way means that large parts of the audience will immediately check out. A major deviation from established machinery is always a tough sell, so you should avoid that if you can. And then there’s still problem three, which in a sense is the worst because it brings with it a rats tail of other problems.</p>
<p>You see, the set-theoretic representation of tuples used by Fortuny doesn’t work in full generality. Consider the following tuples and their set-theoretic representation as nests:</p>
<ol type="1">
<li><span class="math inline">\(\langle a, b \rangle = \{ \{a\}, \{a,b\} \}\)</span></li>
<li><span class="math inline">\(\langle a, b, a \rangle = \{ \{a\}, \{a,b\}, \{a,b,a\} \} = \{ \{a\}, \{a,b\}, \{a,b\} \} = \{ \{a\}, \{a,b\} \}\)</span></li>
<li><span class="math inline">\(\langle a, b, b \rangle = \{ \{a\}, \{a,b\}, \{a,b,b\} \} = \{ \{a\}, \{a,b\}, \{a,b\} \} = \{ \{a\}, \{a,b\} \}\)</span></li>
</ol>
<p>As you can see, three distinct triples all end up with same set-theoretic encoding. That’s not good. This means that if your syntax outputs <span class="math inline">\(\{ \{a\}, \{a,b\} \}\)</span>, you don’t actually know if it gave you the tuple <span class="math inline">\(\langle a, b \rangle\)</span>, <span class="math inline">\(\langle a, b, a \rangle\)</span>, or <span class="math inline">\(\langle a, b, b \rangle\)</span>. If your encoding can’t keep things distinct that should be distinct, it’s not a great encoding.</p>
<p>Fortuny is aware of that, and he has a workaround. Since the problem only arises for tuples that contain identical elements, one has to ensure that there are no identical elements. To this end, he subscripts each entry with the derivational step at which it was introduced. Here’s what this would look like for the counterexamples above:</p>
<ol type="1">
<li><span class="math inline">\(\{ \{a_1\}, \{a_1,b_2\} \} = \langle a_1, b_2 \rangle\)</span></li>
<li><span class="math inline">\(\{ \{a_1\}, \{a_1,b_2\}, \{a_1,b_2,a_3\} \} = \langle a_1, b_2, a_3 \rangle\)</span></li>
<li><span class="math inline">\(\{ \{a_1\}, \{a_1,b_2\}, \{a_1,b_2,b_3\} \} = \langle a_1, b_2, b_3 \rangle\)</span></li>
</ol>
<p>Alright, that fixes the math problem, but it creates even bigger problems — I told you it’s a rat tail. Now that Fortuny has added subscripts, and he has to allow for arbitrary many of them. From a computational perspective, that’s not that great. At best it’s clunky, at worst it creates serious issues with computational power. And from a linguistic perspective, it violates the Inclusiveness condition <span class="citation" data-cites="Chomsky95a">(Chomsky 1995)</span>, according to which syntax does not enrich lexical items with any mark-up, diacritics, or other encodings of non-lexical information. I certainly am not gonna lose any sleep over somebody’s proposal violating the Inclusiveness condition, but I’d wager that this attitude isn’t shared by the main audience for a pure theory paper on Merge and linearization. The set-based view has forced Fortuny into a formalization that makes his argument, which ultimately doesn’t hinge on sets, less attractive to his target audience.</p>
<p>That said, let’s assume you’re you’re willing to accept all those modifications and look at the payoff. You now have a system where linear order is baked directly into syntax. But Fortuny still has to tell us why linear order nonetheless doesn’t seem to matter all that much in syntax. He relates this to a crucial limitation of Merge. As you might have noticed, the nesting system gets a bit more complicated when you try to merge a complex specifier. Suppose you have already built the complex specifier <span class="math inline">\(\{d, e\}\)</span>; I omit subscripts because the notation is cluttered enough as is. Suppose furthermore that <span class="math inline">\(\{d, e\}\)</span> belongs to the constituent <span class="math inline">\(\{ \{d\}, \{d, e\} \}\)</span>. Let’s try to merge <span class="math inline">\(\{d, e\}\)</span> into <span class="math inline">\(\{ a, b, c \}\)</span>, which is part of the constituent <span class="math inline">\(\{ \{a\}, \{a,b\}, \{a,b,c\} \}\)</span>. What should be the output? Fortuny says that the whole constituent <span class="math inline">\(\{ \{d\}, \{d, e\} \}\)</span> is merged with the previous output <span class="math inline">\(\{a,b,c\}\)</span>, yielding the new output <span class="math inline">\(\{ a,b,c, \{ \{d\}, \{d,e\}\}\}\)</span>. Adding this to the previous constituent <span class="math inline">\(\{ \{a\}, \{a,b\}, \{a,b,c\} \}\)</span>, we get the new constituent <span class="math display">\[
\{ \{a\}, \{a,b\}, \{a,b,c\}, \{ a,b,c, \{\{d\}, \{d,e\}\} \} \}
\]</span> Not the most readable, but internally consistent.</p>
<p>Fortuny then observes that in general we do not want to allow movement from such subconstituents — the Specifier Island Constraint and the Adjunct Island Constraint strikes again. Under the assumption that Move is just a variant of Merge, he defines a single application domain for Merge that does not allow this operation to target any proper part of the subconstituent <span class="math inline">\(\{\{d\}, \{d,e\}\}\)</span>. But if you take that as a general constraint on syntax, it also means that syntax cannot directly relate <span class="math inline">\(d\)</span> and <span class="math inline">\(e\)</span> to <span class="math inline">\(a\)</span>, <span class="math inline">\(b\)</span>, or <span class="math inline">\(c\)</span>. Consequently, syntax cannot define a linear order over all of <span class="math inline">\(a\)</span>, <span class="math inline">\(b\)</span>, <span class="math inline">\(c\)</span>, <span class="math inline">\(d\)</span>, and <span class="math inline">\(e\)</span>. And that’s why linear order in syntax has very limited role, even though linear order is directly baked into syntax.</p>
<h1 id="did-the-sets-help">Did the sets help?</h1>
<p>Alright, time to take stock. If we compare Fortuny’s set-theoretic operations to the more high-level story I presented above, do the sets actually illuminate anything? I don’t think so.</p>
<p>You don’t need nests to establish that the syntactic computation naturally furnishes all the asymmetries that are needed to establish linear order. Actually, nests muddle this point because they force you into dealing with occurrences, subscripts, the Inclusiveness condition, and all that other stuff that’s completely orthogonal to the core issue. Nor do sets really help us understand why the role of linear order is limited. Fortuny stipulates a specific notion of domain based on empirical observations about Move, but that’s completely independent of sets as it’s a generalized version of the Specifier and Adjunct Island constraints. And those are all just more specific instances of the assumption that sensing tree automata are a computational upper bound on syntactic expressivity. I’d also say that Fortuny’s set-based definition of domain is much harder to make sense of than sensing tree automata. Overall, the set-based presentation is a handicap for the paper, not a boon.</p>
<p>It’s unfortunate, because Fortuny’s big picture points are right on the money imho. But they’re buried under the mathematical clutter of sets, sets, and more sets.</p>
<h2 id="references" class="unnumbered">References</h2>
<div id="refs" class="references">
<div id="ref-Bruening14">
<p>Bruening, Benjamin. 2014. Precede-and-command revisited. <em>Language</em> 90.342–388. doi:<a href="https://doi.org/10.1353/lan.2014.0037">10.1353/lan.2014.0037</a>.</p>
</div>
<div id="ref-Chomsky95">
<p>Chomsky, Noam. 1995. Bare phrase structure. <em>Government and binding theory and the Minimalist program</em>, ed. by. by Gert Webelhuth, 383–440. Oxford: Blackwell.</p>
</div>
<div id="ref-Chomsky95a">
<p>Chomsky, Noam. 1995. Categories and transformations. <em>The Minimalist program</em>, 219–394. Cambridge, MA: MIT Press. doi:<a href="https://doi.org/10.7551/mitpress/9780262527347.003.0004">10.7551/mitpress/9780262527347.003.0004</a>.</p>
</div>
<div id="ref-Fortuny08">
<p>Fortuny, Jordi. 2008. <em>The emergence of order in syntax</em>. Amsterdam: John Benjamins.</p>
</div>
<div id="ref-Fortuny18">
<p>Fortuny, Jordi. 2018. Structure dependence and linear order: Clarifications and foundations. <em>Language</em> 94.611–628. doi:<a href="https://doi.org/10.1353/lan.2018.0037">10.1353/lan.2018.0037</a>.</p>
</div>
<div id="ref-FortunyCorominasMurtra09">
<p>Fortuny, Jordi, and Bernat Corominas-Murtra. 2009. Some formal considerations on the generation of hierarchically structured expression. <em>Catalan Journal of Linguistics</em> 8.99–111. <a href="https://www.raco.cat/index.php/CatalanJournal/article/view/168906/221175">https://www.raco.cat/index.php/CatalanJournal/article/view/168906/221175</a>.</p>
</div>
<div id="ref-Kayne94">
<p>Kayne, Richard S.. 1994. <em>The antisymmetry of syntax</em>. Cambridge, MA: MIT Press.</p>
</div>
<div id="ref-Kuratowski21">
<p>Kuratowski, Kazimierz. 1921. Sur la notion de l’ordre dans la théorie des ensembles. <em>Fundamenta Mathematica</em> 2.161–171.</p>
</div>
</div>
Against math: Kuratowski's spectre2020-03-30T00:00:00-04:002020-03-30T00:00:00-04:00Thomas Graftag:outde.xyz,2020-03-30:/2020-03-30/against-math-kuratowskis-spectre.html<p>As some of you might know, <a href="https://thomasgraf.net/output/graf13thesis.html">my dissertation</a> starts with a quote from <em>My Little Pony</em>. By Applejack, to be precise, the only pony that I could see myself have a beer with (and I don’t even like beer). <a href="https://youtu.be/k3NkMTV9r5U">You can watch the full clip,</a> but here’s the line that I quoted:</p>
<blockquote>
<p>Don’t you use your fancy mathematics to muddy the issue.</p>
</blockquote>
<p>Truer words have never been spoken. In light of my obvious mathematical inclinations this might come as a surprise for some of you, but I don’t like using math just for the sake of math. Mathematical formalization is only worth it if it provides novel insights. </p>
<p>As some of you might know, <a href="https://thomasgraf.net/output/graf13thesis.html">my dissertation</a> starts with a quote from <em>My Little Pony</em>. By Applejack, to be precise, the only pony that I could see myself have a beer with (and I don’t even like beer). <a href="https://youtu.be/k3NkMTV9r5U">You can watch the full clip,</a> but here’s the line that I quoted:</p>
<blockquote>
<p>Don’t you use your fancy mathematics to muddy the issue.</p>
</blockquote>
<p>Truer words have never been spoken. In light of my obvious mathematical inclinations this might come as a surprise for some of you, but I don’t like using math just for the sake of math. Mathematical formalization is only worth it if it provides novel insights. </p>
<p>Some work falls short of this bar (your call whether mine does). And some work is actively worse because of its use of math. Both things have happened and are still happening in Minimalist syntax. Ever since the publication of <em>Bare Phrase Structure</em> <span class="citation" data-cites="Chomsky95">(Chomsky 1995)</span>, there has been a line of Minimalist research that wants to formalize Merge in set-theoretic terms and derive linguistic properties from mathematical set theory. This is, for lack of a better term, ass-backwards.</p>
<p>Today’s post is the start of a two-part series. It covers the general conceptual and mathematical problems with a lot of this work. The next post will discuss a concrete example of how bringing in math can actively undermine a linguistic proposal rather than strengthening it. So, without further ado, let’s talk Kuratowski.</p>
<h2 id="kuratowski-and-the-confusion-of-sets-and-set-theory">Kuratowski and the confusion of sets and set theory</h2>
<p>Quick show of hands, who has seen this before: <span class="math display">\[\{ \{a\}, \{a, b\}\} = \langle a, b \rangle\]</span> This is the <strong>Kuratowski definition</strong> of pairs in terms of sets <span class="citation" data-cites="Kuratowski21">(Kuratowski 1921)</span>. In contrast to sets, pairs have an intrinsic order, so that <span class="math inline">\(\langle a, b \rangle \neq \langle b, a \rangle\)</span> (unless <span class="math inline">\(a = b\)</span>). Instead of <span class="math inline">\(\{ \{a\}, \{a, b\} \}\)</span> one can also use <span class="math inline">\(\{ a, \{a, b\} \}\)</span>, which is called the <strong>short Kuratowski definition</strong>.</p>
<p>I can’t think of any other mathematical tidbit that has been invoked more often in syntax (although I have yet to find a paper that actually cites <span class="citation" data-cites="Kuratowski21">Kuratowski (1921)</span>). Minimalists like this definition because it looks very similar to the set-theoretic objects <span class="citation" data-cites="Chomsky95">Chomsky (1995)</span> uses to encode syntactic structure: Merge takes two syntactic objects <span class="math inline">\(a\)</span> and <span class="math inline">\(b\)</span> and combines them into the syntactic object <span class="math inline">\(\{ a, \{a, b\} \}\)</span>. Even though the object is a set and thus unordered, we can use the (short) Kuratowski definition to establish a connection to pairs, which are ordered. And from there we can develop all kinds of ideas about linear order in syntax. Except that we actually can’t because the (short) Kuratowski definition only holds in a specific version of set theory. It’s not a theorem about the connection between sets and linear order, it’s a particular mathematical definition of pairs that works in a particular version of mathematical set theory.</p>
<h2 id="why-does-the-kuratowski-definition-work">Why does the Kuratowski definition work?</h2>
<p>Now before we go on any further, let’s demystify the Kuratowski definition. Why is this the set-theoretic definition of pairs? First of all, it’s not <strong>the</strong> set-theoretic definition of pairs, it’s <strong>one</strong> set-theoretic definition. As always in math, there’s a million ways to set things up. Wiener’s definition represents the pair <span class="math inline">\(\langle a, b \rangle\)</span> as <span class="math inline">\(\{ \{ \{a\}, \emptyset \}, \{\{b\}\} \}\)</span>. Hausdorff uses the much more intuitive <span class="math inline">\(\{ \{a, 1\}, \{b, 2\} \}\)</span>. And there’s many other alternatives. So don’t attach too much metaphysical importance to the Kuratowski definition, it’s just a definition that happens to work because it captures a specific property of pairs.</p>
<p>Pairs are characterized by an essential equivalence: <span class="math display">\[\langle a, b \rangle = \langle c, d \rangle \text{ iff } a = b\ \&\ c = d\]</span> That’s what separates pairs from sets, where the expression <span class="math inline">\(\{ a, b \}\)</span> is the same as <span class="math inline">\(\{ b, a \}\)</span> because of the lack of order. With pairs, on the other hand, <span class="math inline">\(\langle a, b \rangle \neq \langle b , a \rangle\)</span> (unless <span class="math inline">\(a = b\)</span>, in which case we would have <span class="math inline">\(\langle a, a \rangle = \langle a, a \rangle\)</span>).</p>
<p>The Kuratowski definition works because sets of the form <span class="math inline">\(\{ \{a\}, \{a, b\} \}\)</span> satisfy the same equality condition: <span class="math display">\[\{ \{a\}, \{a,b\} \} = \{ \{c\}, \{c,d\} \} \text{ iff } a = b\ \&\ c = d\]</span> The right-to-left direction is easy to see. That is to say, if <span class="math inline">\(a = b\)</span> and <span class="math inline">\(c = d\)</span>, then it is pretty much inevitable that <span class="math inline">\(\{ \{a\}, \{a,b\} \} = \{ \{c\}, \{c,d\} \}\)</span>. It’s the left-to-right direction of the <em>iff</em> that’s tricky. In order to show that <span class="math inline">\(\{ \{a\}, \{a,b\} \} = \{ \{c\}, \{c,d\} \}\)</span> entails <span class="math inline">\(a = b\)</span> and <span class="math inline">\(c = d\)</span>, we have to consider several cases.</p>
<h3 id="case-1-a-b">Case 1: <span class="math inline">\(a = b\)</span></h3>
<p>Suppose <span class="math inline">\(a = b\)</span>. Remember that sets are <strong>idempotent</strong>, which means that repetitions are ignored. For instance, <span class="math inline">\(\{ a, b, c, b, a, a \} = \{a, b, c\}\)</span>. If <span class="math inline">\(a = b\)</span>, then <span class="math inline">\(\{ \{a\}, \{a,b\} \} = \{ \{a\}, \{a, a\} \} = \{ \{a\}, \{a\} \} = \{ \{a\} \}\)</span>. But then <span class="math inline">\(\{ \{a\}, \{a,b\} \} = \{ \{c\}, \{c,d\} \}\)</span> is actually <span class="math inline">\(\{ \{a\} \} = \{ \{c\}, \{c,d\} \}\)</span>. This is possible only if <span class="math inline">\(\{ c \} = \{c ,d\}\)</span>, which implies <span class="math inline">\(c = d\)</span>. So we actually have <span class="math inline">\(\{ \{c\}, \{c,d\} \} = \{ \{c\}, \{c,c\} \} = \{ \{c\}, \{c,c\} \} = \{\{c\}\} = \{\{a\}\}\)</span>, and hence <span class="math inline">\(a = c\)</span>. Overall, then, we have <span class="math inline">\(a = b = c = d\)</span>, which necessarily entails <span class="math inline">\(a = c\)</span> and <span class="math inline">\(b = d\)</span>.</p>
<h3 id="case-2-a-neq-b">Case 2: <span class="math inline">\(a \neq b\)</span></h3>
<p>Now suppose <span class="math inline">\(a \neq b\)</span>. Then either <span class="math inline">\(\{a\} = \{c,d\}\)</span> or <span class="math inline">\(\{a\} = \{c\}\)</span>.</p>
<ol type="1">
<li><p>Since two sets with distinct cardinality cannot be identical, the equality <span class="math inline">\(\{a\} = \{c,d\}\)</span> holds only if <span class="math inline">\(c = d\)</span>. Then <span class="math inline">\(\{ \{a\}, \{a,b\} \} = \{ \{c\}, \{c,d\} \} = \{ \{c\}, \{c\} \}\)</span>, but <span class="math inline">\(\{c\} \neq \{a, b\}\)</span> because <span class="math inline">\(a \neq b\)</span>. This is a contradiction, so it must be the case that <span class="math inline">\(\{a\} \neq \{c,d\}\)</span>.</p></li>
<li><p>Assume, then, that <span class="math inline">\(\{a\} = \{c\} \neq \{c,d\}\)</span>. Then <span class="math inline">\(c \neq d\)</span>, and <span class="math inline">\(\{a, b\} = \{c, d\}\)</span> iff <span class="math inline">\(b = d\)</span>. Overall, then, we have <span class="math inline">\(a = c\)</span> and <span class="math inline">\(b = d\)</span>, as required.</p></li>
</ol>
<h2 id="when-does-the-kuratowski-definition-work">When does the Kuratowski definition work?</h2>
<p>Did you notice something in the proof above? The proof is mathematically sound, but it relies on specific assumptions of set theory. For instance, that it is impossible for both <span class="math inline">\(c \neq d\)</span> and <span class="math inline">\(\{a\} = \{c,d\}\)</span> to be true because a set with one member can never be identical to a set with two members. Alright, that’s intuitive enough, but things get worse.</p>
<p>For Minimalist syntax, we don’t really care about the long Kuratowski definition with <span class="math inline">\(\{ \{a\}, \{a,b\} \} = \langle a,b \rangle\)</span>, we want the short one with <span class="math inline">\(\{ a, \{a, b\} \} = \langle a,b \rangle\)</span> because that’s the kind of set-theoretic object that’s built by Merge. The proof above, however, runs into a problem with the short definition. Suppose <span class="math inline">\(\{a, \{a, b\}\} = \{c, \{c,d\} \}\)</span> and <span class="math inline">\(a = \{c, d\}\)</span>. Then either <span class="math inline">\(\{a, b\} = c\)</span> or <span class="math inline">\(\{a, b\} = \{c, d\}\)</span>. We have to rule out the case that <span class="math inline">\(\{a, b\} = c\)</span> — otherwise, the connection to pairs will break down as we could have really weird sets that are equivalent even though they wouldn’t be equivalent when viewed as pairs.</p>
<p>Intuitively, <span class="math inline">\(\{a, b\} = c\)</span> is easy to rule out. If <span class="math inline">\(\{a, b\} = c\)</span> and <span class="math inline">\(a = \{c, d\}\)</span>, then we get some kind of infinite loop by substituting <span class="math inline">\(\{c, d\}\)</span> for <span class="math inline">\(a\)</span> and <span class="math inline">\(\{a, b\}\)</span> for <span class="math inline">\(c\)</span>: <span class="math display">\[a = \{c, d\} = \{ \{a, b\}, d \} = \{ \{ \{c,d\}, b \}, d \} = \{ \{ \{ \{a, b\}, d\}, b\}, d\} = \ldots\]</span> Clearly that’s not okay, right? Actually, it is.</p>
<p>Ruling out such cases of infinite recursion requires the <strong>axiom of regularity</strong>. This axiom is part of the standard formalization of set-theory known as ZFC, <strong>Zermelo-Fraenkel with the axiom of choice</strong>. That is actually a really weird axiomatization because it is a first-order definition, which means that sets and the objects contained by sets have the same type. If you still think a set is a collection of objects, you’re not thinking in ZFC terms where there is no distinction between objects and collections of objects. ZFC is about as far away from our informal understanding of sets as one can get.</p>
<p>And to add insult to injury, the axiom of regularity does precious little work for ZFC. All the important theorems for ZFC set theory hold irrespective of whether one enforces the axiom of regularity. And there is a giant cottage industry of non-standard set theories that all eschew the axiom of regularity plus a truckload of other ZFC axioms. There is no such thing as <strong>the</strong> definition of sets, there’s many competing formalizations that support completely different theorems. Many of them do not support the short Kuratowski definition. The short Kuratowski definition simply does not work unless one makes very specific commitments about the nature of sets.</p>
<h2 id="the-folly-of-mathing-your-syntax">The folly of mathing your syntax</h2>
<p>I think what this shows is that this kind of set-theoretic work in syntax is trying to have its cake and eat it to. On the one hand, nobody wants to say that syntax literally operates with a notion of set that corresponds to the ZFC axiomatization of set theory. That would entail a commitment to the psychological reality of its highly abstract and counter-intuitive axioms, including the axiom of regularity. And as far as cognitive commitments go, that’s pretty far out there. In general, the set-theoretic view of Merge is taken to be either a convenient metaphor or to be rooted in naive set theory. I don’t know of a single paper that uses the short Kuratowski definition and explicitly states that the sets built by merge are assumed to obey the laws of ZFC. So that’s one side of the cake: naive notions of set, rather than mathematical set theory.</p>
<p>But on the other hand this work then drags out mathematical properties such as idempotency and the short Kuratowski definition, without acknowledging that these do not work with the intuitive notion of sets. Because, let’s face it, sets simply aren’t intuitive. The closest thing we have to sets in the real world is bags, and those still aren’t sets because they lack idempotency: a bag with two <span class="math inline">\(a\)</span>-objects is not the same as a bag with one <span class="math inline">\(a\)</span>-object. There is no such thing as an intuitive notion of sets; sets are intrinsically unintuitive.</p>
<p>And this takes me to the broader point I want to make in this series. All that mathematical quibbling about definitions, equivalences, and axiomatizations is completely pointless. Why would you ever open yourself up to criticism of that kind? None of the syntactic proposals that use the Kuratowski definition actually need it to make their point. The ideas could be stated in very different terms, and they would be none the poorer for it. Dressing up your linguistic idea in terms of sets doesn’t magically make it better or derives some linguistic property without further stipulations. Quite to the contrary: the moment you invoke the Kuratowski definition, you’re implicitly stipulating the cognitive reality of half a dozen first-order axioms. And in service of what? If your idea works, we can define it in a million ways and it doesn’t really matter what it looks like when it is hashed out in terms of sets. If your idea doesn’t work, then it doesn’t work and is bunk no matter how elegantly you derived it from set theory.</p>
<h2 id="references" class="unnumbered">References</h2>
<div id="refs" class="references">
<div id="ref-Chomsky95">
<p>Chomsky, Noam. 1995. Bare phrase structure. <em>Government and binding theory and the Minimalist program</em>, ed. by. by Gert Webelhuth, 383–440. Oxford: Blackwell.</p>
</div>
<div id="ref-Kuratowski21">
<p>Kuratowski, Kazimierz. 1921. Sur la notion de l’ordre dans la théorie des ensembles. <em>Fundamenta Mathematica</em> 2.161–171.</p>
</div>
</div>
"Star-Free Regular Languages and Logic" at KWRegan's Blog2020-03-23T00:00:00-04:002020-03-23T00:00:00-04:00Jeffrey Heinztag:outde.xyz,2020-03-23:/2020-03-23/star-free-regular-languages-and-logic-at-kwregans-blog.html<p>Bill Idsardi brought this to my attention. Enjoy your reading!</p>
<p><a href="https://rjlipton.wordpress.com/2020/03/21/star-free-regular-languages-and-logic/">Star-Free Regular Languages and Logic</a></p>
<p>on the <a href="https://rjlipton.wordpress.com/">Gödel’s Lost Letter and P=NP</a> blog.</p>
Trees for free with tree-free syntax2020-03-06T00:00:00-05:002020-03-06T00:00:00-05:00Thomas Graftag:outde.xyz,2020-03-06:/2020-03-06/trees-for-free-with-tree-free-syntax.html<p>Here’s another quick follow-up to the <a href="https://outde.xyz/2020-02-20/unboundedness-is-a-red-herring.html">unboundedness argument</a>. As you might recall, that post discussed a very simple model of syntax whose only task it was to adjudicate the well-formedness of a small number of strings. Even for such a limited task, and with such a simple model, it quickly became clear that we need a more modular approach to succinctly capture the facts and state important generalizations. But once we had this more modular perspective, it no longer mattered whether syntax is actually unbounded. Assuming unboundedness, denying unboundedness, it doesn’t matter because the overall nature of the approach does not hinge on whether we incorporate an upper bound on anything. Well, something very similar also happens with another aspect of syntax that is beyond doubt in some communities and highly contentious in others: syntactic trees. </p>
<p>Here’s another quick follow-up to the <a href="https://outde.xyz/2020-02-20/unboundedness-is-a-red-herring.html">unboundedness argument</a>. As you might recall, that post discussed a very simple model of syntax whose only task it was to adjudicate the well-formedness of a small number of strings. Even for such a limited task, and with such a simple model, it quickly became clear that we need a more modular approach to succinctly capture the facts and state important generalizations. But once we had this more modular perspective, it no longer mattered whether syntax is actually unbounded. Assuming unboundedness, denying unboundedness, it doesn’t matter because the overall nature of the approach does not hinge on whether we incorporate an upper bound on anything. Well, something very similar also happens with another aspect of syntax that is beyond doubt in some communities and highly contentious in others: syntactic trees. </p>
<h2 id="the-first-and-only-example">The first and only example</h2>
<p>Remember that finite-state automata (FSAs) can be represented much more compactly via recursive transition networks (RTNs). As long as we put an upper bound on the number of recursion steps, every RTN can be compiled out into an FSA, although the FSA might be much larger and contain numerous redundancies. Here’s the RTN I provided for a tiny fragment of English:</p>
<figure>
<img src="https://outde.xyz/img/thomas/underappreciated_unboundedness/ftn_factored_embedding.svg" alt="An RTN with center-embedding" /><figcaption>An RTN with center-embedding</figcaption>
</figure>
<p>And indubitably you also remember how this device would generate <em>the fact that the fact surprised me surprised me</em>, which I explained with such remarkable lucidity that it should be indelibly etched into your mind:</p>
<blockquote>
<p>We start at S0 and take the NP edge, which puts us at NP0. At the same time, we put S1 on the stack to indicate that this is where we will reemerge the next time we exit an FSA at one of its final states. In the NP automaton, we move from NP0 all the way to NP3, generating <em>the fact that</em>. From NP3 we want to move to NP4, but this requires completing the S-edge. So we go to S0 and put NP4 on top of the stack. Our stack is now [NP4, S1], which means that the next final state we reach will take us to NP4 rather than S1. Anyways, we’re back in S0, and in order to go anywhere from here, we have to follow an NP-edge. Sigh. Back to NP0, and let’s put S1 on top of the stack, which is now [S1, NP4, S1]. We make our way from NP0 to NP2, outputting <em>the fact</em>. The total string generated so far is <em>the fact that the fact</em>. NP2 is a final state, and we exit the automaton here. We consult the stack and see that we have to reemerge at S1. So we go to S1 and truncate the stack to [NP4, S1]. From S1 we have to take a VP-edge to get to S2. Alright, you know the spiel: go to VP0, put S2 on top of the stack, giving us [S2, NP4, S1]. The VP-automaton is very simple, and we move straight from VP0 to VP2, outputting <em>surprised me</em> along the way. The string generated so far is <em>the fact that the fact surprised me</em>. VP2 is a final state, so we exit the VP-automaton. The stack tells us to reemerge at S2, so we do just that while popping S2 from the stack, leaving [NP4, S1]. Now we’re at S2, but that’s a final state, too, which means that we can exit the S-automaton and go… let’s query the stack… NP4! Alright, go to NP4, and remove that entry from the stack, which is now [S1]. But you guessed it, NP4 is also a final state, so we go to S1, leaving us with an empty stack. From S1 we have to do one more run through the VP-automaton to finally end up in a final state with an empty stack, at which point we can finally stop. The output of all that transitioning back and forth: <em>the fact that the fact surprised me surprised me</em>.</p>
</blockquote>
<p>But believe it or not, this miracle of exposition can be represented more compactly in the form of a single diagram.</p>
<figure>
<img src="https://outde.xyz/img/thomas/underappreciated_unboundedness_trees/ftn_trace.svg" alt="A graph depicting how the subautomata call each other" /><figcaption>A graph depicting how the subautomata call each other</figcaption>
</figure>
<p>Looks familiar? There, let me rearrange it a bit and add an S at the top.</p>
<figure>
<img src="https://outde.xyz/img/thomas/underappreciated_unboundedness_trees/ftn_tree.svg" alt="OMG, it’s a tree" /><figcaption>OMG, it’s a tree</figcaption>
</figure>
<p><a href="https://www.smbc-comics.com/comic/sob">Son of a gun!</a></p>
<h2 id="trees-computational-traces">Trees = Computational traces</h2>
<p>The graph that we have up there is called a <strong>computational trace</strong>. It is a record of the steps of the computation that lead to the observed output. Computational traces aren’t anything fancy or language-specific, they arise naturally wherever computation takes place.</p>
<p>Computational traces don’t necessarily exhibit tree-like structures. They can just be strings, or they can be more complex objects, e.g. directed acyclic graphs (which a linguist would call multi-dominance trees that can have multiple roots). The interesting thing is that models of syntax inevitably give rise to computational traces that are at least trees. And the reason is once again that syntax pushes us in the direction of factorization, the direction of many small systems that invoke each other. The computational nature of syntax is intrinsically tree-like.</p>
<h2 id="closing-thoughts">Closing thoughts</h2>
<p>So there you have it. Even if syntax may just generate strings, like an FSA or RTN, it nonetheless exhibits tree structure in the accompanying computations. It doesn’t hinge on unboundedness. It doesn’t hinge on the self-embedding property or recursion, either — even if the RTN were just a finite transition network, the process of moving between automata would induce tree structure. And that’s why this is an underappreciated argument: it depends on so little, it avoids all the usual hot-button issues like recursion, and yet it is used so rarely.</p>
<p>Btw, the connection between trees and computation isn’t some fancy new insight. Mark Steedman has long argued for this view of syntactic structure. Heck, trees made their way into generative syntax as a compact way of representing the derivations of context-free grammars. But they also got reified very quickly, changing from records of syntactic computation to the primary data structure. This had the unintended consequence that the connection between trees and computation has slowly fallen into oblivion, and that makes trees look a lot more stipulative to outsiders.</p>
<p>I personally believe that the reification of trees has largely been a bad thing for the field. The original insights got shortened to the dogma that a syntactic formalism that doesn’t produce trees can’t possibly be right, even though the structure of the generated object has no direct connection to the structure of the generation mechanism. The reification of trees has erased that distinction, resulting in an overly narrow space of analytical options. One of the most important developments in computational syntax in the last twenty years was to tease them apart again and study the computational traces independently of what output they produce. This has been a very productive enterprise, and the insights obtained this way suggest that this is really what syntax is about.</p>
<p>It also fits naturally with the computational view of the inverted T-model. The <a href="https://outde.xyz/2019-05-15/underappreciated-arguments-the-inverted-t-model.html">bimorphism perspective</a> puts syntax in the position of an interpolant, a means of succinctly describing a computational system of bidirectional mappings. From this perspective, syntax simply is computation; syntactic structure is computational structure.</p>
Unboundedness, learning, and POS2020-02-26T00:00:00-05:002020-02-26T00:00:00-05:00Thomas Graftag:outde.xyz,2020-02-26:/2020-02-26/unboundedness-learning-and-pos.html<p><a href="https://outde.xyz/2020-02-20/unboundedness-is-a-red-herring.html">Ignas Rudaitis left a comment under my unboundedness post</a> that touches on an important issue: the interaction of unboundedness and the poverty of the stimulus (POS). My reply there had to be on the short side, so I figured I’d fill in the gaps with a short follow-up post. </p>
<p><a href="https://outde.xyz/2020-02-20/unboundedness-is-a-red-herring.html">Ignas Rudaitis left a comment under my unboundedness post</a> that touches on an important issue: the interaction of unboundedness and the poverty of the stimulus (POS). My reply there had to be on the short side, so I figured I’d fill in the gaps with a short follow-up post. </p>
<h2 id="learnability-pos-complexity-theory-chess">Learnability : POS = Complexity theory : chess</h2>
<p>Ignas worries that if we do not assume unboundedness, then a lot of POS arguments lose their bite because unboundedness is what makes the learning problem hard. I do not agree with the latter, at least not in the literal sense. I think the relation between learnability results and actual language acquisition is a lot more subtle, and it is comparable to the relation between complexity theory and algorithm design.</p>
<p>The most famous results in complexity theory are about asymptotic worst-case complexity. That’s a fancy way of saying: how hard is it to solve a problem if everything that can go wrong does go wrong and we do not get to put an <em>a priori</em> bound on the size of the problem? To give a flowery analogy, complexity theorists don’t study the complexity of chess, they study the complexity of chess on an arbitrarily large chess board. But since we always play chess on an 8-by-8 board, results about arbitrary <span class="math inline">\(n\)</span>-by-<span class="math inline">\(n\)</span> boards seem pretty pointless. Except that, in practice, they’re not. The interesting thing is that many of the problems that complexity theory tells us are hard in the unbounded case aren’t really any easier in the bounded case. Hard problems tend to strike both hard and early. Boundedness doesn’t fix the issue. What you have to do is pinpoint the true source of complexity, and unboundedness is a useful assumption in doing so.</p>
<p>Something similar is going on with learnability results. Learnability is all about figuring out the structure of the hypothesis space and how that can be exploited to find the target language with as little input as possible. Unboundedness can be useful in that enterprise. But the insights that are gained this way are largely independent of unboundedness because the learning challenges strike hard and early.</p>
<h2 id="an-example-from-learning-sl-2">An example from learning SL-2</h2>
<p>Let’s see how this general point works out in a concrete case, the learning of strictly 2-local languages (SL-2). As you <a href="https://outde.xyz/2019-08-19/the-subregular-locality-zoo-sl-and-tsl.html">might remember from an earlier post</a>, a language is strictly local iff it can be described by a strictly local grammar, which is a finite set of forbidden substrings. An SL language is SL-2 iff all forbidden substrings are of length 2. Suppose that our alphabet only contains the symbol <span class="math inline">\(a\)</span>, and that we use <span class="math inline">\(\$\)</span> to indicate the edge of a string. Then there’s 4 distinct forbidden substrings of length 2:</p>
<ul>
<li><span class="math inline">\(aa\)</span>: don’t have <span class="math inline">\(a\)</span> next to <span class="math inline">\(a\)</span></li>
<li><span class="math inline">\(\$a\)</span>: don’t start with <span class="math inline">\(a\)</span></li>
<li><span class="math inline">\(a\$\)</span>: don’t end with <span class="math inline">\(a\)</span></li>
<li><span class="math inline">\(\$\$\)</span>: don’t have a string without any symbols (the <strong>empty string</strong>)</li>
</ul>
<p>There are <span class="math inline">\(2^4 = 16\)</span> distinct grammars that we can build from those 4 forbidden bigrams. For instance, <span class="math inline">\(\{ \$\$ \}\)</span> permits all strings over <span class="math inline">\(a\)</span> except the empty string. The larger grammar <span class="math inline">\(\{ \$\$, aa \}\)</span>, on the other hand, generates the string <span class="math inline">\(a\)</span> and nothing else ( <span class="math inline">\(\$\$\)</span> rules out the empty string, and <span class="math inline">\(aa\)</span> rules out all longer strings). Since there’s only finitely many distinct SL-2 grammars, we know that the space of of SL-2 languages is finite, which means that the class of SL-2 languages can be learned in the limit from positive text. But the standard learning algorithm for finite language classes isn’t all that efficient. We can do better if we pay attention to the structure of the space.</p>
<p>To this end, we take all 16 SL-2 grammars and order them by the subset relation.</p>
<figure>
<img src="https://outde.xyz/img/thomas/underappreciated_unboundedness_pos/sl2_lattice.svg" alt="Behold the space of SL-2 grammars" /><figcaption>Behold the space of SL-2 grammars</figcaption>
</figure>
<p>Why, that’s a nice powerset lattice we’ve got there. The learner can exploit this structure as follows:</p>
<ol type="1">
<li>Start with the grammar at the top of the lattice as the initial conjecture. That is to say, assume that everything is forbidden.</li>
<li>If you see an input <em>i</em>, extract all bigrams from <em>i</em> and remove them from the currently conjectured grammar. The intuition: bigrams we see in the input cannot be illicit and thus must not be in the set of forbidden bigrams.</li>
<li>Continue doing this until we converge onto a specific point in that lattice.</li>
</ol>
<p>You might say this is a really obvious learning algorithm: just discard forbidden bigrams that you see in the input. What’s so great about that? Well, nothing. The interesting tidbit is the lattice structure that the learner operates over. This structure tells us that the learner can converge on the target grammar really, really fast. The number of steps the learner has to take is bounded by the number of “levels” in the lattice. That’s because removing a bigram will always move us down by at least one level in the lattice, and we can only go so far before we hit rock bottom. Instead of 16 grammars, we have to test at most 4 grammars after the initial hypothesis that everything is forbidden. The lattice structure of the hypothesis space allows us to rule out tons of potential grammars with just one piece of data.</p>
<p>This combinatorial fact gets a lot more impressive as we move beyond bigrams. The space of SL-<span class="math inline">\(n\)</span> grammars over the alphabet <span class="math inline">\(a\)</span> has <span class="math inline">\(2^{2^n}\)</span> grammars, but the learner has to test at most <span class="math inline">\(2^n\)</span>. With 5-grams, that’s already a huge difference — at most 32 grammars have to be tested in the space of 4,294,967,296 grammars. Talk about exploiting the structure of the learning space!</p>
<h2 id="unboundedness-doesnt-matter-again">(Un)Boundedness doesn’t matter… again</h2>
<p>Crucially, the lattice structure of the space is completely independent of unboundedness. Suppose that we take all SL-2 languages and limit their strings to a length of 10. This does not change anything about the structure of the hypothesis space. To wit, here’s two diagrams, one showing the mapping from SL-2 grammars to SL-2 string languages, the other one the mapping from SL-2 grammars to SL-2 string languages with string length less than 10. To avoid clutter, I omit the arrows for grammars that generate the empty language (which is deviant as a natural language anyways).</p>
<figure>
<img src="https://outde.xyz/img/thomas/underappreciated_unboundedness_pos/sl2_unbounded.svg" alt="The function f maps SL-2 grammars to the corresponding SL-2 string languages" /><figcaption>The function <span class="math inline">\(f\)</span> maps SL-2 grammars to the corresponding SL-2 string languages</figcaption>
</figure>
<figure>
<img src="https://outde.xyz/img/thomas/underappreciated_unboundedness_pos/sl2_bounded.svg" alt="The function g maps SL-2 grammars to the corresponding SL-2 string languages, restricted to strings up to length 10" /><figcaption>The function <span class="math inline">\(g\)</span> maps SL-2 grammars to the corresponding SL-2 string languages, restricted to strings up to length 10</figcaption>
</figure>
<p>Do you see a difference? Because I certainly don’t. The boundedness has no impact on the hypothesis space, only on the relation between grammars and generated languages.</p>
<p>We could have gone with a different hypothesis space, of course. For instance, we could have used the full space of SL-10 local languages, in which case the boundedness can be directly encoded in the SL grammar. But then the space will be much larger, there will be <span class="math inline">\(2^{2^{10}} = 2^{1024} = \text{enormously many}\)</span> grammars, and even the efficient lattice space learner may take up to 1024 steps to find the target language. Beyond that, we will also miss crucial generalizations (e.g. that no language should allow <em>aa</em> while forbidding <em>aaa</em>). And that’s the POS argument right there: the SL-10 hypothesis space furnishes grammars/languages that we really do not want if the target space is SL-2 with an upper bound on string length. And of course the same problem would hold if we lift the upper bound on string length, the lattice of SL-10 grammars would still be a bad hypothesis space for learning SL-2 languages. Bounded, unbounded, once again it does not matter.</p>
<h2 id="as-i-said-a-red-herring">As I said, a red herring</h2>
<p>Whether you assume that natural language is bounded or not, the fact remains that even within the bounded portion there is a lot of systematicity that linguistic proposals have to capture. The systematicity reflects a very constrained and richly structured hypothesis space, and exploiting this structure makes the learning problem much easier. This is what drives POS arguments. Whether the mapping from the hypothesis space to the generated languages incorporates (un)boundedness does not change this basic fact. If anything, shifting aspects like boundedness into the hypothesis space just messes things up. That’s what we saw with the step from SL-2 to SL-10 above. It’s FSAs all over again, missing generalizations and furnishing hypothesis that really shouldn’t be in that space.</p>
<p>POS arguments are driven by the structure of the space, the need to capture specific means of generalization. Boundedness is immaterial for that, it simply does not change the picture. A red herring indeed.</p>
Unboundedness is a red herring2020-02-20T00:00:00-05:002020-02-20T00:00:00-05:00Thomas Graftag:outde.xyz,2020-02-20:/2020-02-20/unboundedness-is-a-red-herring.html<p>Jon’s <a href="https://outde.xyz/2020-01-12/overappreciated-arguments-marrs-three-levels.html">post on the overappreciated Marr argument</a> reminded me that it’s been a while since the last entry in the <em>Underappreciated arguments</em> series. And seeing how the competence-performance distinction showed up in the comments section of <a href="https://outde.xyz/2019-12-28/semantics-should-be-like-parsing.html">my post about why semantics should be like parsing</a>, this might be a good time to talk one of the central tenets of this distinction: unboundedness. Unboundedness, and the corollary that natural languages are infinite, is one of the first things that we teach students in a linguistics intro, and it is one of the first things that psychologists and other non-linguists will object to. But the dirty secret is that nothing really hinges on it. </p>
<p>Jon’s <a href="https://outde.xyz/2020-01-12/overappreciated-arguments-marrs-three-levels.html">post on the overappreciated Marr argument</a> reminded me that it’s been a while since the last entry in the <em>Underappreciated arguments</em> series. And seeing how the competence-performance distinction showed up in the comments section of <a href="https://outde.xyz/2019-12-28/semantics-should-be-like-parsing.html">my post about why semantics should be like parsing</a>, this might be a good time to talk one of the central tenets of this distinction: unboundedness. Unboundedness, and the corollary that natural languages are infinite, is one of the first things that we teach students in a linguistics intro, and it is one of the first things that psychologists and other non-linguists will object to. But the dirty secret is that nothing really hinges on it. </p>
<h2 id="the-standard-argument-and-the-counterarguments">The standard argument and the counterarguments</h2>
<p>You all know the standard argument for unboundedness of certain syntactic constructions. Native speakers of English know that all of the following are well-formed:</p>
<ol class="example" type="1">
<li>John’s favorite number is 1.</li>
<li>John’s favorite numbers are 1 and 2.</li>
<li>John’s favorite numbers are 1, 2, and 3.</li>
<li>John’s favorite numbers are 1, 2, 3, and 4.</li>
</ol>
<p>We can continue this up to any natural number <em>n</em>, and the sentence will still be well-formed. But that means there is no finite cut-off point, we can always construct a sentence that’s even longer. This unboundedness of the construction (in this case, coordination) implies that the set of well-formed sentences is infinite. Of course in the real world there are outside factors that limit <em>n</em>. For instance, you can only say a finitely bounded number of words before the heat death of the universe sets in and all existence ceases to be. Or if you want something with less of a dash of cosmic horror, human memory and attention span can only handle so much before you slip up. But that’s orthogonal to the speaker’s linguistic knowledge. That’s exactly why we want a competence-performance split, and hence we have unboundedness.</p>
<p>There’s two counterarguments to the standard unboundedness argument. The first one rejects the competence-performance distinction and says that linguistic knowledge is so deeply embedded in the performance systems that we cannot factor them out without losing the core of language. Basically, this group of researchers doesn’t want to study some lofty idealization of language, they want the real deal with all the behavioral quirks that come with it. Let’s call this the <strong>plea for performance</strong>.</p>
<p>The second argument is very different in nature as it fully acknowledges the importance of competence but takes umbrage with the inductive generalization step. As far as I know <span class="citation" data-cites="ScholzPullum02">Scholz and Pullum (2002)</span> were the first to make this argument in the literature, but it might have been floating around for a long time before that. Their point is that the unboundedness argument relies on an empirically unsupported step of induction. Thought experiments like the one above do not show us that there is no finite cut-off point past which competence breaks down. At best one can show that there is no cut-off point that is so low that we can find it experimentally. Except that there’s actually plenty of such cut-off points, e.g. with center embedding, but they are deemed inadmissible due to the competence-performance distinction. At this point the standard argument becomes circular: we assume a competence-performance split because the linguistic knowledge can generalize way beyond the limits of the performance systems, and all evidence to the contrary doesn’t count because we assume a competence-performance split. The standard argument implicitly assumes as an axiom that which it is supposed to derive, which means the unboundedness assumption is unfounded speculation. Let’s call this the <strong>specter of speculation</strong>.</p>
<p>We could now take deep plunge into the merits of the plea for performance or the specter of speculation, but in the spirit of the <em>Underappreciated arguments</em> series we’ll simply avoid all of that by presenting an alternative argument for unboundedness that sidesteps the issues of the standard argument. Spoilers: it’s all about the rich combinatorics of syntax and how those are best described. I’ll present a very concrete example, but the general thrust of the argument goes back to <span class="citation" data-cites="Savitch93">Savitch (1993)</span>, who worked it out in a more abstract, mathematical setting with theorems and proofs.</p>
<h2 id="unboundedness-is-both-red-and-a-herring">Unboundedness is both red and a herring</h2>
<h3 id="a-bounded-grammar-fragment-for-syntax">A bounded grammar fragment for syntax</h3>
<p>Even the most ardent unboundedness skeptic will usually concede that the three English sentences below are well-formed.</p>
<!-- (@) The man was surprised by the outcome. -->
<!-- (@) The woman was surprised by the rumor that the man was surprised by the outcome. -->
<!-- (@) The girl was surprised by the fact that the woman was surprised by the rumor that the man was surprised by the outcome. -->
<ol start="5" class="example" type="1">
<li>The fact surprised me.</li>
<li>The fact that the fact surprised me surprised me.</li>
<li>The fact that the fact that the fact surprised me surprised me surprised me.</li>
</ol>
<p>The last one is pushing things a bit, and with further levels of embedding things do break down for most speakers (if you don’t like ((5) and (6) are strictly speaking sufficient for the underappreciated argument to work, so you can disregard (7) if you don’t like it for some reason). Linguists usually treat the break down of center embedding as a performance artefact. But who knows, maybe it’s competence — we won’t make any commitments here. All we need is the three examples above, with a maximum of two levels of embedding.</p>
<p>What kind of computational mechanism could produce (5), (6), and (7)? One of the simplest available options is a <strong>finite-state automaton</strong> (FSA). Here is the FSA that generates the first sentence, and nothing else.</p>
<figure>
<img src="https://outde.xyz/img/thomas/underappreciated_unboundedness/embedding0.svg" alt="An FSA for the fact surprised me" /><figcaption>An FSA for <em>the fact surprised me</em></figcaption>
</figure>
<p>This FSA has a unique starting point, the initial state 0 marked by <em>start</em>. And it has a unique end point, the final state 4 marked by a double circle. The FSA considers a string well-formed iff this string describes a path from an initial state to a final state. In the example above, there is only one such path: <em>the fact surprised me</em>. So this FSA considers (5) well-formed, but nothing else.</p>
<p>Okay, then let’s try to expand the automaton so that it also accepts (6), which displays an instance of center embedding. Note that we cannot simply add an edge labeled <em>that</em> from the final stack back to the initial state, as in the figure below.</p>
<figure>
<img src="https://outde.xyz/img/thomas/underappreciated_unboundedness/embedding0_loop.svg" alt="Looping back gives us right embedding, not center embedding" /><figcaption>Looping back gives us right embedding, not center embedding</figcaption>
</figure>
<p>This automaton would produce right-embedding sentences like <em>the fact surprised me that the fact surprised me</em>, not center embedding as in (6). And since the FSA contains a cycle, right-embedding can be repeated over and over again, allowing for an unbounded number of embeddings. We do not want that here because we’re trying to construct an argument that has no commitment to unboundedness.</p>
<p>Back to the drawing board then. In order to capture one level of center embedding, we have to add new structure to the automaton, yielding the FSA below.</p>
<figure>
<img src="https://outde.xyz/img/thomas/underappreciated_unboundedness/embedding1.svg" alt="The FSA now can handle one level of center embedding" /><figcaption>The FSA now can handle one level of center embedding</figcaption>
</figure>
<p>Now there are two possible paths depending on whether we follow the <em>surprised</em>-edge or the <em>that</em>-edge after state 2. Those two paths correspond to the strings in (5) and (6). We can use the same strategy to add yet another “level” to the automaton and add (7) to the set of well-formed strings.</p>
<figure>
<img src="https://outde.xyz/img/thomas/underappreciated_unboundedness/embedding2.svg" alt="And finally two levels of center embedding" /><figcaption>And finally two levels of center embedding</figcaption>
</figure>
<p>Alright, so now we have a simple computational device that handles the three sentences above. It doesn’t even induce any tree structures, so if you’re a fan of shallow parsing or similar ideas, this model does not directly contradict those assumptions either. It really is a minimal base, you’d be hard-pressed to find something even simpler that can handle up to two levels of center-embedding. Everybody on board? Great, time for the mid-argument turn-around!</p>
<h3 id="generalization-is-the-key">Generalization is the key</h3>
<p>Clearly the three sentences above aren’t the only sentences of English. At the very least, there’s the following three variations.</p>
<ol start="8" class="example" type="1">
<li>The fact shocked me.</li>
<li>The fact that the fact shocked me shocked me.</li>
<li>The fact that the fact that the fact shocked me shocked me shocked me.</li>
</ol>
<p>We can of course extend the FSA to allow for those sentences, but note that we have to modify it in multiple places due to how the FSA handles embedding.</p>
<figure>
<img src="https://outde.xyz/img/thomas/underappreciated_unboundedness/embedding2_newverb.svg" alt="Adding 1 verb requires 5 changes" /><figcaption>Adding 1 verb requires 5 changes</figcaption>
</figure>
<p>That’s not nice, and it only gets worse from here. The phrase <em>the fact</em> isn’t exactly representative of the richness of English noun phrases. All of the following are also well-formed:</p>
<ol start="11" class="example" type="1">
<li>facts</li>
<li>the facts</li>
<li>the well-known facts</li>
<li>three very well-known facts</li>
<li>these three very, very well-known facts</li>
<li>these three very, very well-known, controversial, definitely irrefutable facts</li>
</ol>
<p>And at the same time there’s some combinations that do not work, e.g. using the indefinite with a mass noun (<em>a furniture</em>) or combining a sentential complement with a noun that cannot take such an argument (<em>the car that you annoyed me</em>). Let’s put those aside and try to give an FSA that handles only the very basic facts in (12)–(16). To save space, I don’t number the states, I use parts of speech instead of lexical items in this FSA, and I allow loops, which strictly speaking allows for unboundedness. But look guys, this is already a chunky FSA as is, I really don’t want to explode it even further to enforce a limit on how many adverbs or adjectives we may have:</p>
<figure>
<img src="https://outde.xyz/img/thomas/underappreciated_unboundedness/np.svg" alt="An FSA for (a fragment of) English noun phrases" /><figcaption>An FSA for (a fragment of) English noun phrases</figcaption>
</figure>
<p>And now we have to — you guessed it — insert that into the previous automaton in three distinct subject positions.</p>
<figure>
<img src="https://outde.xyz/img/thomas/underappreciated_unboundedness/embedding2_npsubjects.svg" alt="We can’t make it smaller than this, and this is a mess" /><figcaption>We can’t make it smaller than this, and this is a mess</figcaption>
</figure>
<p>Yikes! Pleasant to look at this is not. And this is already a simplification because we only expanded the options for the subject position, while objects are still limited to just <em>me</em>. And remember that we didn’t worry about things such as the mass/count distinction or the selectional restrictions of nouns. Nor did we consider that noun phrases can be embedded inside other noun phrases:</p>
<ol start="17" class="example" type="1">
<li>this fact about language in your talk</li>
<li>the destruction of the city</li>
</ol>
<p>Assuming again at cut-off point of 3 levels of embedding for noun phrases, we should actually have the FSA below.</p>
<figure>
<img src="https://outde.xyz/img/thomas/underappreciated_unboundedness/insane.svg" alt="I mean, seriously" /><figcaption>I mean, seriously</figcaption>
</figure>
<p>At this point, we should really start looking for a different way of describing those FSAs because nobody can make sense of those giant graphs, and we’re still just talking about a tiny fragment of English.</p>
<p>And while we’re at it, this description mechanism should also enforce a certain degree of uniformity across the levels of embedding. In principle, we could modify the FSA above such that adjectives are only allowed at the lowest level of embedding, or such that odd levels of embedding have the linear order Det-Num-Adj-N and even levels instead have Num-N-Adj-Det. No natural language works like this. But the FSA can make this distinction because it is always aware of which level of embedding it is at. We can exploit the fact that the rules of grammar are (largely) uniform across levels of embedding to give a more compact, factorized description of FSAs.</p>
<h3 id="factoring-fsas-into-transition-networks">Factoring FSAs into transition networks</h3>
<p>Suppose that instead of a single, all-encompassing FSA, we have a collection of FSAs that we can switch between as we see fit. For instance, the very first FSA we saw, which only generates <em>The fact surprised me</em>, could instead be described as a collection of three interacting FSAs.</p>
<figure>
<img src="https://outde.xyz/img/thomas/underappreciated_unboundedness/ftn_factored.svg" alt="A network of interacting FSAs" /><figcaption>A network of interacting FSAs</figcaption>
</figure>
<p>We start with the S-automaton. In order to move from state S0 to state S1, we have to make our way through the NP-automaton. In this automaton, we follow the path <em>the fact</em> to make our way from NP0 to NP1 and then NP2, which is a final state. At this point we are done with the NP-automaton and reemerge in position S1. From there we want to get to S2, the final state of the S-automaton, but doing so requires us to make our way through the VP-automaton. Alright, so we start at VP0 and trace a path to VP2, giving us <em>surpised me</em>. Once VP2 is reached, we reemerge at S2, and since that is a final state of the FSA we started with, we can finally stop. The path we took through these automata corresponds to the string <em>the fact surprised me</em>. So the factored description in terms of multiple automata generates the same output as the original FSA.</p>
<p>This kind of factored representation is called a <strong>finite transition network</strong> (FTN). For each FTN, we can construct an equivalent FSA by replacing edges with the automata they refer to. For the FTN above, we take the S-automaton and replace the NP-edge with the NP-automaton and the VP with the VP-automaton.</p>
<figure>
<img src="https://outde.xyz/img/thomas/underappreciated_unboundedness/ftn_compiled.svg" alt="The FSA network can be compiled out into a single FSA" /><figcaption>The FSA network can be compiled out into a single FSA</figcaption>
</figure>
<p>Structurally, that is exactly the same automaton as the one we originally gave for <em>the fact surprised me</em>.</p>
<p>Factoring automata via FTNs can definitely be overkill, but it pays off for large automata because we can avoid duplication. To give just one example, it’s very easy to allow more complex objects than just <em>me</em>:</p>
<figure>
<img src="https://outde.xyz/img/thomas/underappreciated_unboundedness/ftn_factored_object.svg" alt="Complex objects come at a minimal cost" /><figcaption>Complex objects come at a minimal cost</figcaption>
</figure>
<p>All we did is add an NP-edge from VP1 to VP2, and now our objects can be just as complex as our subjects.</p>
<p>The FTN above does not handle any levels of center embedding yet. The easiest way to do this is to add an S-edge to the NP automaton.</p>
<figure>
<img src="https://outde.xyz/img/thomas/underappreciated_unboundedness/ftn_factored_embedding.svg" alt="And center embedding is also easy to add" /><figcaption>And center embedding is also easy to add</figcaption>
</figure>
<p>Our FTN is now a <strong>recursive transition network</strong> (RTN). The RTN uses a stack to keep track of how we move between the FSAs. Here’s how this works for <em>the fact that the fact surprised me surprised me</em>. Warning, this will take a bit:</p>
<p>We start at S0 and take the NP edge, which puts us at NP0. At the same time, we put S1 on the stack to indicate that this is where we will reemerge the next time we exit an FSA at one of its final states. In the NP automaton, we move from NP0 all the way to NP3, generating <em>the fact that</em>. From NP3 we want to move to NP4, but this requires completing the S-edge. So we go to S0 and put NP4 on top of the stack. Our stack is now [NP4, S1], which means that the next final state we reach will take us to NP4 rather than S1. Anyways, we’re back in S0, and in order to go anywhere from here, we have to follow an NP-edge. Sigh. Back to NP0, and let’s put S1 on top of the stack, which is now [S1, NP4, S1]. We make our way from NP0 to NP2, outputting <em>the fact</em>. The total string generated so far is <em>the fact that the fact</em>. NP2 is a final state, and we exit the automaton here. We consult the stack and see that we have to reemerge at S1. So we go to S1 and truncate the stack to [NP4, S1]. From S1 we have to take a VP-edge to get to S2. Alright, you know the spiel: go to VP0, put S2 on top of the stack, giving us [S2, NP4, S1]. The VP-automaton is very simple, and we move straight from VP0 to VP2, outputting <em>surprised me</em> along the way. The string generated so far is <em>the fact that the fact surprised me</em>. VP2 is a final state, so we exit the VP-automaton. The stack tells us to reemerge at S2, so we do just that while popping S2 from the stack, leaving [NP4, S1]. Now we’re at S2, but that’s a final state, too, which means that we can exit the S-automaton and go… let’s query the stack… NP4! Alright, go to NP4, and remove that entry from the stack, which is now [S1]. But you guessed it, NP4 is also a final state, so we go to S1, leaving us with an empty stack. From S1 we have to do one more run through the VP-automaton to finally end up in a final state with an empty stack, at which point we can finally stop. The output of all that transitioning back and forth: <em>the fact that the fact surprised me surprised me</em>.</p>
<h3 id="hey-thats-not-bounded">Hey, that’s not bounded!</h3>
<p>I mentioned earlier that every FTN can be compiled out into an equivalent FSA. The same is not true for RTNs, and that’s because there is no limit on how deeply the automata can be nested. Generating <em>the fact that the fact that the fact that the fact surprised me surprised me surprised me surprised me</em> would have been exactly the same, mechanically. However, RTNs can still be converted to FSAs if we put an upper bound on the depth of the stack.</p>
<p>But I think you’ll agree that it doesn’t really matter for the RTN whether the stack depth is bounded. Yes, for some applications it may be convenient to convert the RTN to an FSA. But that has no impact on the shape of the RTN and dependencies between automata that it describes. It is immaterial for the generalizations. And that’s why it doesn’t matter whether language is truly unbounded or not. Maybe what we have in our heads is just a giant FSA. It simply does not matter. The combinatorics of syntax are such that the quest for a compact description will inevitably drive you towards machinery that is largely agnostic about whether unboundedness holds.</p>
<h2 id="the-underappreciated-argument-in-a-nutshell">The underappreciated argument in a nutshell</h2>
<p>Alright, this has been a long read (and those pesky automata made it an even longer write). But the bottom line is that there’s no need to commit to unboundedness as an empirical truth. It’s not gonna fly for psychologists, and it runs into all the philosophical problems of arguments by induction. We can sidestep all of that.</p>
<p>Even if we stick only with those utterances that are easily processed, the combinatorial space displays an extraordinary degree of systematicity. Succinctly capturing these combinatorics requires factorization very much along the lines of what linguists have been doing. Even if you don’t like trees, if you think syntax is all about strings, that is does not need to support semantic interpretation, that performance puts a hard bound on competence, or that linguistic theories don’t need to enjoy any degree of cognitive reality, you’ll still end up in a corner where succinctness pushes you towards machinery that could just as well be unbounded.</p>
<p>Linguistic analysis is not undermined by the issue of whether language is truly unbounded because this is completely independent of the factors that favor factorization and succinctness. The same considerations that favor transformations over CFGs in <span class="citation" data-cites="Chomsky57">Chomsky (1957)</span> also favor not committing to boundedness. Bounded, unbounded, it simply does not matter, so don’t get hung up about it.</p>
<h2 id="references" class="unnumbered">References</h2>
<div id="refs" class="references">
<div id="ref-Chomsky57">
<p>Chomsky, Noam. 1957. <em>Syntactic structures</em>. The Hague: Mouton.</p>
</div>
<div id="ref-Savitch93">
<p>Savitch, Walter J. 1993. Why it might pay to assume that languages are infinite. <em>Annals of Mathematics and Artificial Intelligence</em> 8.17–25.</p>
</div>
<div id="ref-ScholzPullum02">
<p>Scholz, Barbara C., and Geoffrey K. Pullum. 2002. Searching for arguments to support linguistic nativism. <em>The Linguistic Review</em> 19.185–223.</p>
</div>
</div>