outdex

MGs do not struggle (as much as you think) with multiple wh-movement

2021-07-23T00:00:00-04:00

In February I had a nice chat with Bob Frank and Tim Hunter regarding their SCiL paper on comparing tree-construction methods across mildly context-sensitive formalisms. Among other things, this paper reiterates the received view that MGs cannot handle unbounded multiple wh-movement. That is certainly true for standard MGs as defined in Stabler (1997), but my argument was that this is due to what may now be considered an idiosyncrasy of the definition. We can relax that definition to allow for multiple wh-movement while preserving essential formal properties of MGs. However, friendly chats aren’t a good format for explaining this in detail, so I promised them an Outdex post with some math. Well, 5 months later, I finally make good on my promise.

Why multiple wh-movement is considered problematic for MGs

In MGs, movement is feature-triggered: the landing site must have a licensor feature f⁺, the head of the mover carries an f^-. Consider a simple case of wh-movement:

Which book did Mary give Kelly.

Let’s ignore all the syntactic details like VP-internal subjects, what triggers do-support, and the positions of direct and indirect objects so that we can focus just on the wh-movement. A fairly standard MG analysis would posit that which carries wh^- and did has wh⁺. Those are movement features of the same name but opposite polarity, which triggers movement of the phrase headed by which (i.e. which book) to a specifier of did. As part of this movement step, the wh-features that triggered it are also checked and deleted. Overall pretty vanilla and close in spirit to Chomsky (1995) or Adger (2003), except that MGs use feature polarity instead of interpretability.

We can represent the whole derivation for (1) in the form of a tree, but again I’ll only indicate the features that matter for wh-movement.

This isn’t just a matter of notation, those trees will play a key role in generalizing MGs so that they can handle multiple wh-movement.

But first things first, we haven’t even established yet why MGs have problems with multiple wh-movement. The problem is actually two-fold. One is what is know as the Shortest Move Constraint (SMC), the other is the 1-to-1 matching of MG feature checking. Let’s look at a concrete example. In (2) below, we now have a variant of (1) where all DPs undergo wh-movement.

Which book who whom did give.

What would the MG derivation for this have to look like? Well, since each DP moved, each one has to have a wh^- on the respective head. And since each wh^- must be checked against some wh⁺, which causes both features to be deleted, did must carry one wh⁺ for each moving DP, giving us a total of three wh⁺ on did. The corresponding derivation is shown below, with arrows indicating what is supposed to move where.

Alas, this is not a licit MG derivation because it violates the SMC. The SMC could also be called the Ghostbusters constraint: do not cross the (feature) streams! If we imagine each f^- sending a stream up through the tree until it finds a matching f⁺, then we may never have two f-streams travelling alongside each other. But this is exactly what happens in the derivation above, where three distinct wh-streams end up entangled.

Now this issue we can actually work around because who’s to say that we have to consider all those wh-features wh-features just because linguists treat them as wh-features. Maybe those aren’t wh^-, wh^-, and wh^- on which, who, and whom. Maybe they’re actually wh1^-, wh2^-, and wh3^-, giving us the lovely derivation tree below.

From a formal (substance-free) perspective, wh1^- and wh2^- are just as distinct as, say, wh^- and top^-. They are completely different features, and as a result the SMC, which only cares about crossed streams of the same feature type, is no longer violated. Yes, it is a very lame solution, but it is a solution.

Except, it’s not really. Because if there are languages with no upper bound on how many wh-phrases may move at the same time (and there might indeed be some depending on how one draws the line between competence and performance), then it’s not enough to just have wh1^-, wh2^-, and wh3^-. We also need wh4^-, wh5^-, and so on, ad infinitum. And this also means that we need infinitely many versions of did: one carrying only wh1⁺, one that carries wh1⁺ and wh2⁺, one with wh1⁺, wh2⁺, and wh3⁺, and so on. But the most fundamental assumption of MGs is that the collection of lexical types must be finite, which means the lexicon cannot contain infinitely many variants of did. We have finally hit a dead end: the combination of the SMC and one-to-one feature checking forces us to treat multiple wh-movement in a way that is incompatible with the very foundation of MGs.

SMC, what art thou good for?

The argument above shows that MGs as defined in Stabler (1997) cannot handle unbounded multiple wh-movement. But it is important to keep in mind why MGs were defined this way, in particular with respect to the SMC. The SMC has always been a bit of an ugly wart on MGs, the one thing that every syntactician immediately calls out as a major deviation from mainstream Minimalism. Why then stick with the SMC? Because it is a conceptually simple constraint that gives us three properties:

Move is deterministic: there is no ambiguity as to what moves where (because the SMC rule out all configurations where such ambiguity could arise).
MG derivations are regular: the set of well-formed derivation trees forms a regular tree language (which is something mathematical linguists care a lot about).
Movement is regular: movement corresponds to a specific kind of regular tree transduction from derivation trees to derived structures (which, again, is something mathematical linguists care a lot about).

Instead of ensuring those three properties through independent means, one can just put in place the SMC and get them all as a corollary. But the price we pay for this simplicity is that the SMC brings about its own set of problems, with multiple wh-movement being the most prominent exponent. If we tackle the three properties independently of each other, then there is a fairly easy way to do multiple wh-movement in MGs.

Regular derivations with a decomposed SMC

The standard MG feature calculus can be reinterpreted as a collection of constraints on derivation trees. Side remark: That’s quite generally a good perspective to take as it gets us to look beyond feature notation (where MGs and Minimalism differ quite a bit) to look at the behavior of the whole feature calculus (where MGs and Minimalism are very close in spirit). Anyhoo, this tree-geometric view of feature checking is exactly what we need to get a grip on multiple wh-movement.

In tree-geometric terms, MG feature checking revolves all around what I decided to call occurrences many years ago, in my youthful exuberance. Intuitively, a Merge node or a Move is an occurrence of a lexical item iff it checks one of the negative features on said lexical item. More precisely:

The 0-occurrence of a lexical item l is the Merge step where l gets selected, checking its category feature (in MGs, category features are negative and selector features are positive, so the 0-occurrence still involves checking of a negative feature).
The $i$-th occurrence of l is the closest node that
1. properly dominates the $(i-1)$-th occurrence of l, and
2. can check the $i$-th licensee feature on l (i.e. the $i$-th negative movement feature).

This definition assumes that if a lexical item has multiple licensee features, they are linearly ordered to indicate in which order they must be checked. There is an alternative definition for MGs with unordered licensee features, but for our discussion it won’t matter either way because we can get away with assuming that no lexical item has more than one licensee feature. So, feel free to stick with the intuitive notion, supplemented by whatever understanding you glean from the two examples below.

The feature checking requirements of Merge and Move boil down to two simple constraints:

Full checking: Every lexical item with exactly n negative features has exactly n occurrences.
SMC: Every interior node is an occurrence for exactly one lexical item.

Yes, this looks different from how I described the SMC above, but it produces exactly the same tree language. Consider the two derivations we saw earlier, one for simple wh-movement, the other for multiple wh-movement. In the derivation with a single wh-movement step, both Full checking and SMC are satisfied: each lexical item’s number of occurrences matches the number of negative features it carries, and every interior node is an occurrence of exactly one lexical item.

In the illicit derivation with multiple wh-movement, however, only Full checking still holds whereas SMC is violated in multiple ways. First, the lowest Move node is an occurrence for three separate lexical items because it is the closest Move node that properly dominates the 0-occurrences of those lexical items and can check wh_-. But because the lowest Move node sucks up all those extra occurrences, we also have the opposite problem: the two higher Move nodes aren’t an occurrence for any lexical item at all, and as a result SMC is violated. So SMC establishes a delicate balance where no node may have too many or too few occurrences. The occurrence-based SMC looks very different from the Ghostbusters constraint, but the two are equivalent with respect to their derivational extension.

Alright, now that we can operate with the tree-geometric SMC constraint instead of the Ghostbusters constraint, we’re only one modification away from handling multiple wh-movement. The SMC condition is actually a shorthand for two separate constraints.

SMC (at least): Every interior node is an occurrence for at least one lexical item.
SMC (at most): Every interior node is an occurrence for at most one lexical item.

Suppose, then, that we drop SMC (at most), leaving us only with the requirement that every Merge or Move node is an occurrence of at least one lexical item — in other words, that it checks at least one negative feature. In one important sense, this doesn’t change much: the derivation tree language will still be regular, which is one of the three things the original SMC does for MGs. The weakened version gets the job done just as well in this respect. But with respect to multiple wh-movement, it completely changes things.

With the weakened SMC, we still can’t have the multiple wh-movement derivation we had above because that one contains two Move nodes without any occurrences. But what we can have is the configuration below:

This almost exactly the same except that did carries only one wh₊ and hence there’s only one Move node, which handles all three wh-movers at once. The derivation is now well-formed because every interior node is an occurrence of at least one lexical item. That a single node serves as multiple occurrences isn’t a problem anymore because we dropped SMC (at most). Intuitively, you may think of this as the wh₊ on did acting as a persistent licensor feature that can check any number of wh_- below it. We could even parameterize it so that SMC (at most) is still active by default and only certain features on certain lexical items are exempt from it. That way, we can allow for multiple wh-movement without allowing for, say, multiple subject movement — although of course we would still need a theory of linguistic substance that explains to us why the latter never occurs even though it would be a simple parametric change. But the goal here isn’t to give an exhaustive analysis of multiple wh-movement, I merely want to show that multiple wh-movement isn’t that big hurdle for MGs that it is commonly believed to be.

A logical transduction for multiple wh-movement

We now have a variant of MGs that can strategically relax the SMC for some types of movement so that we can have derivations with multiple wh-movement, which we model as a persistent wh₊ checking all wh_- below it — or in tree-geometric terms, one Mode node serving as an occurrence for all wh-movers below it. This takes care of the derivational challenges posed by multiple wh-movement, and in my not-so-humble opinion that is the key issue. But of course it would be nice to know that this derivation can still be mapped to a derived structure. The cool thing is, this doesn’t even need any modifications on our part.

One particularly elegant view of movement is in terms of first-order transductions, which you might remember from this post. The idea here is that we define the relations that hold in the derived structure in terms of relations that hold in the derivation tree. More concretely, we may treat movement as a first-order formula that reinterprets the derivation tree as a multi-dominance tree where the root of a phrase is connected to all the occurrences of its head. Using $\triangleleft$ for immediate dominance in the derivation tree, we can define the immediate dominance relation $\blacktriangleleft$ in the output structure as follows:

\[ x \blacktriangleleft y \Leftrightarrow x \triangleleft y \vee \exists l [\text{occurrence}(x,l) \wedge \text{derivational-root}(y,l)] \]

Or in plain English:

Node x is a mother of node y in the derived multi-dominance tree iff one of the following holds:
1. x is a mother of y in the derivation tree,
2. x is an occurrence of some node l, and y is the root of the phrase projected by l.

This presupposes that we have first-order definitions of the predicates occurrence and derivational-root, which isn’t too hard — you can check Graf (2012) and Graf (2012) for the precise formulas.

With the formula above, the derivation trees for simple and multiple wh-movement are mapped to the multi-dominance trees below. You might notice that they look basically the same as the original derivation trees, except that movement arrows are now interpreted as immediate dominance arcs.

The cool thing is that the first-order formula is independent of which version of the SMC is in place, it works equally well with both. In fact, this is just the usual first-order transduction for standard MGs, with not a single thing changed about it. Since the transduction hasn’t changed at all, movement is still deterministic and regular: there is no ambiguity as to how we should connect movers to their landing sites, and doing so requires no new machinery. The derivation tree languages are also still regular, as we saw above, so all three formal properties of MGs have been preserved while opening up the framework enough to handle multiple wh-movement. Mission accomplished, multiple wh-movement handled (well, there’s one more section coming, but let me gloat a bit for now).

Linearization and c-command

The multi-dominance tree above probably looks weird to you because we have a node with more than two daughters. That doesn’t usually happen in Minimalism, and it does mean that multiple wh-movement still creates two problems for us: c-command and, by extension, linearization. As is, all the movers c-command each other, which means that

they cannot be linearized according to the LCA, and
licensing between them should be unrestricted, e.g. for Principle A.

Now in an ideal world, it would turn out that multiple wh-movement affects c-command only with respect to phrases that are not part of the same cluster of multiple wh-movements. In that case, we could basically patch our definition of c-command so that we ignore the mutual c-command relations (and only the mutual ones!) that are brought about by multiple wh-movement. For our example derivation with multiple wh-movement, that would mean that we consider the wh-movement of which book, who, and whom with respect to other phrases so that they precede all the other lexical material, but then fall back to the base c-command linearizations between the three, giving us who which book whom did give. If we instead wanted which book who whom give, then which book would first have to move to a position from where it c-commands the other wh-phrases, and only then does it partake in the multiple wh-movmeent. From a mathematical perspective, this kind of “partially c-command transparent” movement would be an easy and maximally general addition to the definition of c-command that is well within the realms of first-order logic and works with any arbitrary number of wh-movers.

But that doesn’t seem to be how things actually work. Omer Preminger mentioned that a while ago on this blog, citing work by Norvin Richards (if the link doesn’t take you right to his comment: it’s the one at the very top). ~~Multiple wh-movement seems to behave more like scrambling in that you can have any arbitrary order (modulo a few restrictions), and the c-command relations match that observed order.~~ {{{Edit: This characterization is overly simplified, and placing it right after the reference to Omer’s comment suggests that it is his mischaracterization when it is in fact me that’s being sloppy. Here’s the full quote:

For Bulgarian: leftmost one must be the one whose highest-position-prior-to-wh-movement is highest. Order amongst the others is free.

I interpret this as a kind of two-class system, where we get to single out a finite number of wh-movers for the front positions, and the remaining wh-phrases, of which there can be an unbounded number, may show up in any arbitrary order.}}} So we need to be able to get any random permutation of any arbitrary number of wh-movers, and we can’t get that by starting with a fixed hierarchy and moving stuff around. We could make the transduction non-deterministic, but that’s a major step from a formal perspective, and more importantly, I think it’s the less insightful route.

Keep in mind, with multiple wh-movement we’re generalizing from a fairly small number of movers to an unbounded number, and whenever you do that, there is a risk that your generalization is off target. Suppose, for instance, that multiple wh-movement works more like this: you get to single out three wh-phrases that go to the front, everything else is ordered by a complex metric that considers base c-command, morphological case, animacy, and prosody. Does any of the available data rule out this option? Probably not, because you’d need at least five movers to have a potential counterexample, and that’s already more than you find in the wild. If we assume that only one wh-phrase can be singled out, like in Bulgarian, then examples with two or three multiple wh-movers become more informative, and then we could test if something like this multi-variable metric fits the data.

To the best of my knowledge — which is admittedly very limited in that empirical domain — nobody has done that. I don’t blame them. From a Minimalist perspective, it is the multi-variable metric that would be weird and convoluted, whereas a purely movement-based approach is enough for the observed data because there’s no SMC to muck things up. It is only from the computational perspective that you suddenly wonder whether there are any substance-based confounds in the data and how one could tease those out given the limits of performance. It is a tricky issue, but I think it is one worth pursuing because it really changes what the key issues are about multiple wh-movement:

Multiple wh-movement is not special. In a formal sense, it is standard movement that is special because it requires us to rule out cases where a Move node is an occurrence for multiple lexical items. We don’t need elaborate accounts to explain the existence of multiple wh-movement.
Multiple wh-movement is very special. In full generality, it constitutes an unrestricted system for creating c-command relations and linearizations. This makes it similar to scrambling, which has also proven a tough nut to crack. Just as with scrambling, there might be computationally meaningful restrictions that we’ve missed because we haven’t looked at the data through this lens.

The tl;dr for a very long post

In sum, the issue of multiple wh-movement is far from settled. The received view that MGs can’t handle multiple wh-movement only holds with respect to a very strong SMC, which isn’t all that integral to the MG framework. With a relaxed SMC, MGs can handle multiple wh-movement as long as there are additional restrictions on how c-command and linearization work for the wh-phrases. The big empirical question is whether such restrictions can be found, and that would make for a wonderful joint project for theoretical and computational linguists.

References

Adger, David. 2003. Core syntax: A Minimalist approach. Oxford: Oxford University Press.

Chomsky, Noam. 1995. Categories and transformations. The Minimalist program, 219–394. Cambridge, MA: MIT Press. doi:10.7551/mitpress/9780262527347.003.0004.

Graf, Thomas. 2012. Locality and the complexity of Minimalist derivation tree languages. (Ed. by.) Philippe de Groot and Mark-Jan Nederhof. Formal Grammar 2010/2011. Lecture notes in computer science. Heidelberg: Springer. doi:10.1007/978-3-642-32024-8_14. http://dx.doi.org/10.1007/978-3-642-32024-8_14.

Graf, Thomas. 2012. Movement-generalized Minimalist grammars. (Ed. by.) Denis Béchet and Alexander J. Dikovsky. LACL 2012. Lecture notes in computer science. doi:10.1007/978-3-642-31262-5_4. http://dx.doi.org/10.1007/978-3-642-31262-5_4.

Stabler, Edward P.. 1997. Derivational Minimalism. Logical aspects of computational linguistics, ed. by. by Christian Retoré, 1328:68–95. Lecture notes in computer science. Berlin: Springer. doi:10.1007/BFb0052152. https://doi.org/10.1007/BFb0052152.

Mellow musings on peer review

2021-06-07T00:00:00-04:00

My last post (yes, ages ago) reflected on two issues that come up quite a bit on Martin Haspelmath’s blog, and in both cases I did not really agree with his conclusions. But there is a third issue he mentions a lot: peer review, be it for conferences, journals, or grants. Haspelmath has an impressive number of posts on the topic. The tldr is that reviews are a waste of time for reviewers, do not improve the final paper or proposal (e.g. because authors have to tack on extraneous stuff to please reviewers), and incentivize flashy presentation over substance. Moreover, reviewers are frequently forced into the role of gatekeepers who have to defend the fair maidens of Publisher Island and Conference Valley from the ravaging hordes of sub-par submissions. I do not want to directly argue for or against these points, I’m sure the prestigious outdex readership can make up its own mind. But I will say that my own experience has been a lot more positive, largely because of the field I’m in. So the following are some reflections on what I think mathematical linguistics as a field gets right with peer review (and there’s also a tiny bit about how it can sometimes go wrong).

Peer-review in mathematical linguistics is pleasantly mellow

I have published many computational papers, and the few times I got a nasty or unhelpful review was when the paper was submitted to a theoretical linguistics venue. The reviews from mathematical linguists, on the other hand, are among the most professional and level-headed I’ve ever seen. There’s four reasons for that, and I’m gonna talk about each one in painstaking detail.

Nah, not really, we can mostly ignore the first two because they don’t provide much of a learning opportunity for theoretical linguistics. First, mathematical linguistics is a small field, and that instills a very personal sense of community. Second, mathematical work is easier to evaluate objectively: are the definitions internally consistent, are the proofs correct, does the paper provide enough motivation for the work done, can you read it without going stark-raving mad. Suggestions for improvement are straight-forward, too: tighten up the writing here, change the notation a bit there, rework proof 3 to account for this special case. It’s all very cut-and-dry and, crucially, devoid of emotion or dogma. Math provides a framework where every paper could have plenty useful in it even if its priors do not match yours. You just won’t be inclined to, say, dismiss a TAG paper because you know for a fact that CCG is the only true formalism. It’s all very much in the spirit of letting a thousand flowers bloom because we know how to pick parts from one flower and crossbreed them with another one.

That’s all nice and dandy, and I wouldn’t wanna have it any other way, but it isn’t really something that theoretical linguistics can hope to emulate. We can’t really kick out 90% of all linguists to make the field feel a bit cozier (it’s still pretty cozy compared to most other fields). And as I’ve said many times before, many linguists believe that there is one right theory, usually their own, in which case any deviation from that is problematic. And that creates a very different reviewing dynamic. But there are two other factors that make mathematical linguistics reviewing different, and those can be emulated in other fields: the length of submissions, and the streamlined review process.

Submission length

Mathematical linguistics largely avoids 2-page abstracts or papers with 30+ pages, two formats that are very popular in other subfields and, coincidentally, a total pain to review. One is too short, the other too long, although journal papers at least aren’t as pointless as abstracts. Abstracts are a complete waste of everybody’s time:

The authors have to squeeze tons of information into two pages or less, and then spend hours tweaking and polishing. All that effort is single-use because abstracts cannot be easily retooled into more useful formats, like a poster, slides, or a paper. An abstract can only produce more abstracts.
The reviewers, on the other hand, have to read the damn thing at least three times to make sense of it. Basically, they have to spend time and energy decompressing all the compressing the authors had to do because for some reason the abstract had to be two pages instead of 4. And even after all that decompressing they still don’t have much to go on for feedback. Maybe some missing references or some problematic data points, but who knows if those references are missing because of ignorance or space constraints, and who knows whether a data point is actually problematic or just too complicated to be covered in two pages.

Quite simply, abstracts are too short to be easily written, read, or evaluated. The authors learn little more from the reviews than how to rewrite the abstract, and the reviewers learn even less from the abstract itself.

Btw, this also means that abstracts are an atrocious gatekeeping mechanism for conferences. If you know how to bullshit, an abstract is the perfect format for bullshitting your way past the reviewers, whereas the best research project won’t get a pass if you haven’t figured out how to condense it to two pages. This goes doubly so for work that’s not part of the mainstream, for if the reviewer doesn’t already know where you’re coming from, an abstract doesn’t give you enough space to get them ready for the ride. Net gain for authors, reviewers, and the community at large: 0 at best, negative (countable) infinity at worst.

In mathematical linguistics we don’t submit abstracts, we submit papers. But those are short papers, 8 to 16 pages depending on the venue. That’s long enough to avoid all the problems with abstracts, yet without going too far in the other direction. It’s not some 50 page monster where just opening the PDF makes your reviewer heart sink. You can actually read such a short paper two or three times in a reasonable amount of time. There is plenty of substance to sink your teeth into. There is a chance for you to meaningfully improve a piece of research, a piece of academic writing that may be consumed by others many years from now. And eventually those short papers can be compiled into a long paper that makes it into a journal — a long paper that has seen tons of feedback along the way and rests on a very solid foundation, which again makes the reviewing stage a lot more pleasant.

Single round reviewing

Now all of that by itself would already be great, but the real kicker is that if you are asked to review one of those short 8 or 16 page papers, you only review it once. Just as with abstracts, there is no revise-and-resubmit.

That one review you write is your final word, and it isn’t even binding. You rate the paper on several criteria, and you provide detailed written feedback, and then the ball is in the editor’s corner. They decide whether the paper makes the cut or not, and if it does, it is the authors’ judgment call how much they want to change. The editors always include some boilerplate request to keep the reviewer’s remarks in mind when producing the camera-ready version, but there are no mechanisms in place to enforce that exactly because the quality requirements of the venue have already been met. If the paper had issues that absolutely required revisions, it would have been rejected. And hence it is up to the authors to use the feedback they got as they see fit. There is no need to please the reviewers, no need for lengthy responses where you justify why and how you incorporated some remarks while ignoring others. You made the cut, you’re given an opportunity to make the paper as good as you can based on the feedback you got, and that’s it.

What can be copied

To sum up, the secret reviewing sauce in mathematical linguistics has four ingredients:

It’s a small field, so it’s all very intimate and sociable.
Reviewing math-heavy papers, albeit challenging on a technical level, is fairly straight-forward. It’s more craft than art.
Conferences are paper-reviewed, not abstract-reviewed.
Reviewing is a one-round affair for conference papers. Since most papers are conference papers, not journal papers, most reviews are one-round affairs.

I think the last two points actually do most of the heavy lifting, and those are exactly the ones that need not be limited to mathematical linguistics. Of course that would take quite some convincing. Reviewers are hard to get by as is, and apparently most linguists believe that reviewing 3 short papers would take a lot more time than reviewing 3 abstracts. So if anyone wanted to bring the wonders of mathematical linguistics reviewing to other subfields, it’d be an uphill battle. And I’m not saying that this is the only right way of doing things. All I’m saying is that if you share Martin Haspelmath’s concerns about peer review and quite frankly, find it a drag no matter which end of it you’re on, there are existing systems that can be emulated.

Where there’s light, there’s a tiny bit of shadow

Now I’m not gonna pretend that absolutely everything is all sweet and gooey over here in the cotton candy land of mathematical linguistics. If you have to review a paper that’s way outside our area of expertise, that’s more painful than a comparable abstract review because you’re looking at 8 pages of gibberish instead of 2, and you have to write more than two sentences for your review. And that’s not a hypothetical: in the neighbouring crazy funky party town of computational linguistics, where the recent NLP boom means that there’s way more submissions than qualified reviewers, you will probably have to review a paper on a topic you know embarrassingly little about. In order to fix that, the ACL and other conferences have moved to increasingly elaborate, multi-stage review systems that try to make the reviewers vet each other’s reviews in order to filter out obviously sub-par reviews. It’s all very clunky and bureaucratic and, as far as I can tell, only improves the situation insofar as the organizers get to feel like they’re doing something to improve the situation. And don’t even get me started on the ACL’s rolling review initiative (well, alright, you can get me started in the comments section). So there’s no perfect solution, every system has both strengths and weaknesses. But, speaking for myself, I greatly prefer how mathematical linguistics handles things.

Next time: some actual language and computation on this blog devoted to language and computation.

Discovering Martin Haspelmath's blog

2021-04-28T00:00:00-04:00

Unbecoming as it may be for a blogging linguist, I am not particularly familiar with the overall blogosphere in linguistics. As a devoted Twitter & Facebook hermit, I am perpetually out of the loop, and I like it that way. So it is only recently that I have become aware of Martin Haspelmath’s long-running blog, thanks to a post by David Adger. There’s tons of posts, but based on the limited sample I’ve read so far, it seems that most revolve around one of three issues: terminology, innateness, and peer-review. I think the latter is actually the most interesting, but for the sake of ~~completeness~~ self-indulgence, I’ll add my $0.02 regarding the first two, leaving peer-review for a separate post.

Terminology

Haspelmath frequently talks about how X should actually be called Y or Z. By his own admission, he keeps “insisting on careful use of terminology in linguistics”. For example, one terminological quibble that keeps coming up is that typology should be called comparative linguistics. Problems with his specific terminological proposals have been pointed out in the comments there, and I don’t have much to add in that respect. I’ll just take a moment to link to xkcd and smbc, which succinctly point out the real problem with such terminological alterations: they only make things worse.

But since this is the outdex, where absolutely everything has to circle back to computation, let’s look at a recent case of such terminological confusion. Some of you might have heard of sequential functions, which are a subtype of finite-state transductions. It doesn’t matter here how they work or what they do. The important thing is that a sequential function is a finite-state transduction that satisfies a number of conditions. If we weaken one of those conditions, we get the more powerful class of subsequential functions. They are subsequential because the conditions they meet form a subset of the conditions met by sequential functions. But of course this means that the class of subsequential functions is a superset of the class of sequential functions, rather than a subset. Can you see where this is going? Some researchers felt that this is unintuitive and have started calling subsequential functions sequential, and sequential functions subsequential. The end result of that well-intended change in terminology is that now I never know what type of function people are actually talking about.

Yes, yes, you could argue that it’s just temporary growing pains and the end state will be better for everyone involved, like the shift from Python 2 to Python 3. Except that if we go through all those growing pains, perhaps it should be for terminology that isn’t just as broken: If we ever define a class that lies properly between subsequential and sequential (in that new terminology where the latter are more powerful), we will have a class that, although a subset of the sequential functions, contains some functions that are not subsequential. Wonderful!

Bottom line: terminology is messy, cannot be regulated in a top-down manner, and meddling with it makes things worse.

The grammar blueprint

Continuing the previous point, I am not a fan of the term grammar blueprint that Haspelmath uses for UG. To me, a blueprint is the very opposite of a large space of options with tremendous variation within. It makes me think of programming cookbooks like Go Programming Blueprints, which focus less on the programming language itself and more on prepackaged solutions to common problems. But whatever, I know what Haspelmath is referring to, he can call things on his blog any way he wants, and this isn’t really what I find confusing about his take on UG.

Haspelmath argues that the post-(Chomsky 2005) idea of a minimal UG undermines the universalist methodology of Minimalism, where it is still assumed that languages are largely the same (e.g. same functional projections) and that insights can be meaningfully transferred between languages. This is what the debate with Adger is about, and Haspelmath also has earlier exchanges with other linguists that touch on this, e.g. Gillian Ramchand and Elena Anagnostopoulou. There’s a few other posts that I think are informed by this view, e.g. Haspelmath’s evaluation of Laura Kalin’s work on DOM, which got an in-depth reply on Philosophy of Linguistics (another blog I was unaware of). Overall, a lot of ink has been spilled, and it is not clear to me why:

Haspelmath says that current Minimalist work makes sense under a rich-UG view, but not under a small-UG view. But with respect to the work being done, the two are interchangeable. Here is the reasoning chain under a rich-UG view:

Assumption 1 (rich): Language is an innate ability.
Assumption 2 (rich): Innateness includes structural projections, categories, and so on.
Corollary (rich): Insights from one language can be transferred to other languages.

And here is what it looks like under a small-UG view:

Assumption 1 (small): Language is an innate ability.
Assumption 2 (small): Insights from one language can be transferred to other languages.

The only thing that has changed is that Assumption 2 (rich) is gone and Corollary (rich) has been upgraded to the status of an assumption. But for anything downstream the reasoning chain, it doesn’t matter whether transferability is an assumption or a corollary of an assumption. As far as the methodology is concerned, it makes no difference.

Now if you want to get into the weeds of whether Assumption 2 (rich) is more plausible than Assumption 2 (small), i.e. ontological commitments, knock yourself out. I will point out, though, that

scientific assumptions have to be useful, not plausible, and
Assumption 2 (small) has more support than Assumption 2 (rich) in the sense that if we consider the space of axioms that have one of those assumptions as a corollary, the space for Assumption 2 (small) is larger than that for Assumption 2 (rich).

But I really shouldn’t have brought up those ancillary points, because the only thing that matters methodologically is that none of this matters. The cognitive issues may be what motivates the program, but the analytical work is largely independent of that. There’s a reason why virtually all Minimalist textbooks include some spiel about the learning problem, syntax as a cognitive science, language as a window into mental computation, yada yada yada, yet barely any of them contain a chapter on parsing, learnability, or computation (one notable exception being Sportiche, Koopman, and Stabler’s An Introduction to Syntactic Analysis and Theory, and the list of authors should give you a clue how the computational chapter made it in). I’ve complained about that before, but that’s the way things are. And because that’s the way things are, a Minimalist’s analytical work does not hinge on any particular conception of UG.

References

Chomsky, Noam. 2005. Three factors in language design. Linguistic Inquiry 36.1–22. doi:10.1162/0024389052993655. http://dx.doi.org/10.1162/0024389052993655.

Handbook chapter on Minimalism and computational linguistics

2021-03-30T00:00:00-04:00

Aah, the soothing sound of crickets. In case you’ve been wondering about the recent radio silence at this prestigious online soapbox, my todo list finally caught up with me and I had to spend the last few weeks writing up/revising some papers that were way overdue. It was a matter of life and death — the editors were already contemplating Satanic blood sacrifices, and while I enjoy a good Black Mass as much as the next guy, I’d rather not be its subject matter. In this post I’d like to talk a bit about one of those papers, a chapter on Minimalist grammars in an upcoming handbook on Minimalism. Though I have to admit that it’s mostly a ruse to get some of you to give it a read and leave some feedback in the comments section.

Going short or going long?

The handbook is actually slated to contain two chapters on MGs. The first one, written by Greg Kobele, presents MGs as a specific incarnation of the Minimalist framework, with an emphasis on the analysis of empirical phenomena. My chapter, on the other hand, looks at the computational properties of MGs and how those relate to linguistic issues. So, mostly big-picture stuff, no specific data. In addition, my chapter was originally planned as a vignette, i.e. a short chapter of approximately 15 pages. My thinking was that Greg’s chapter would provide enough of a foundation that I could move at a brisker pace. And since the average Minimalist probably does not want to slog through 30 pages of computational discussion, keeping it short and sweet would increase readership.

But that strategy did not quite work out: the paper isn’t so short after all, and it might be too terse to qualify as “sweet”. This makes me wonder if the paper needs to be decompressed. The question is, do I remove some topics and keep the length the same, or do I expand the presentation and accept that it will be a longer paper? And if I extend it, do I want to add some other topics that wound up on the cutting floor? It’s a handbook chapter after all, and those should be comprehensive references. Then again, it’s a handbook on Minimalism, not MGs, so one could say it doesn’t need to be that comprehensive. So that’s the first point I’m not sure about, short or long, and if the latter, just longer, or longer with extra content?

Cut content

There’s a few things that I decided to cut down or remove completely that I would love to put back in. Foremost among them is multiple wh-movement because that is an area where computational considerations yield eminently empirical questions:

Is multiple wh-movement actually unbounded, or is there a principled finite bound on how many wh-phrases can be fronted (e.g., do the moved wh-phrases have to differ in morphological case)?
If there is no upper bound on how many wh-phrases are fronted, what is the evidence that those are individual wh-movement steps, rather than, say, a big cluster of wh-phrases undergoing a single instance of wh-movement?
What determines the order of the fronted wh-phrases? C-command modulo movement? Case? And how much variation is allowed?

The answers to those questions have a huge impact on what multiple wh-movement looks like from a computational perspective.

Also missing from the chapter is scrambling. Syntacticians can’t quite agree what to do with scrambling, and neither can computational linguists. It’s a challenging phenomenon from either perspective. But again there are interesting takes on it. Joshi, Becker, and Rambow (2000), for instance, argue that TAG puts a principled cut-off point on scrambling, so that rather than attributing the unacceptability of complex scrambling constructions to performance, we can simply treat it as a hard limit of syntax. This makes concrete empirical predictions, which, to the best of my knowledge, have not been systematically tested so far. Readers of a handbook on Minimalism probably should be aware of this, it’s an excellent topic for a collaboration between theoretical and computational linguists.

And then there’s learning, which I did not say anything about in this chapter, mostly because it’s a huge can of worms with very little work that directly interacts with Minimalism. Yes, Alex Clark, Ryo Yoshinaka and others have done amazing work on learning mildly context-sensitive languages, but learnability results are a subtle issue that is difficult to present in an accessible manner without grossly distorting the work. And I honestly do not understand them well enough to contemplate their implications for the Minimalist mainstream. There’s also some early work on learning types of MGs that exhibit a very limited kind of lexical ambiguity. But the result always struck me as a bit too artificial, and though I might well be wrong about that, talking about it still requires going into the whole learnability VS language acquisition thing, which I’d like to avoid.

Literature

I really tried to be exhaustive when it comes to references. Ideally, this and Greg’s paper will jointly serve as a new, up-to-date entry point into the MG literature, so it’s important to closely track what’s out there and capture the full breadth of MG work. I feel fairly confident that I’ve got that part covered — almost one third of the paper is references, But this also means that every accidental omission weighs heavy, and while I have already noticed a few, I’m sure there’s more.

Feedback?

If you’re curious, check out the manuscript. All feedback is welcome. Email is fine, but the comments section is also ready for your perusal.

References

Joshi, Aravind, Tilman Becker, and Owen Rambow. 2000. Complexity of scrambling: A new twist to the competence-performance distinction. Tree adjoining grammars: Formalisms, linguistic analysis and processing, ed. by. by Anne Abeillé and Owen Rambow, 167–181. Stanford: CSLI.

The 2021 Outdex bingo

2021-01-05T00:00:00-05:00

2020 hasn’t been particularly kind to most folks, although it did work out really well for my department here at Stony Brook (more on that in some other post, perhaps). 2021 still has that “new car” smell, but it might need some help to stay fresh for the full 52 weeks. This is why I present you with my revolutionary invention: Outdex bingo.

An outdex bingo card

It works pretty much like regular bingo:

Get yourself an Outdex bingo card (if everybody uses the one above, it’ll be pretty boring).
Whenever you read an Outdex post in 2021, make sure to mark off all terms on your bingo card that appear in the post.
Keep doing that until you have filled in a row, column, or diagonal. The BINGO! field in the middle is treated as a wildcard.
Peruse the prestigious Outdex comments system to announce your bingo victory. Your reward will be determined by a D20 roll on a secret table.

But where, I hear you ask, can you get yourself an Outdex bingo card? Why, you can compile it yourself, or ask a slightly tech-savvy friend to do it for you.

How to compile

You will need Python 3.6 or newer, as well as a working LaTeX installation with a recent version of tikz.

All the files you need are in a separate folder in the Outdex repository: https://github.com/outde-xyz/website/tree/master/content/img/thomas/outdex_bingo/
Download three files from this folder:
- taglist
- outdex_bingo.py
- bingo_template.tex
In a shell, run python3 outdex_bingo.py.
In the same folder, there is now a file bingo.tex, which you can compile to a PDF in the usual fashion.

If that sounds like a pain, I’m sure there’s some newbie-friendly way to set this up in Overleaf, but I won’t look into it because I don’t even have an Overleaf account. If somebody wants to do that, though, I’ll be happy to help any way I can.

How it works

First I wrote word_counts.py as a small script to get an idea for what tags and terms commonly show up in Outdex posts. I then put 50 of those in the text file taglist. The Python script outdex_bingo.py randomly picks terms from this list and converts the selection into a tikz matrix. The matrix code is then combined with bingo_template.tex to produce bingo.tex for a specific bingo card.

So if you want to mix things up a bit, there’s three places where you can make modifications:

You can add or remove entries in taglist to change what terms can appear on your bingo cards.
You can run outdex_bingo.py with different parameters to get smaller or larger bingo cards, or bingo cards without a wildcard in the middle.
You can change the tikz styles in bingo_template.tex to get a different layout for your bingo cards.

The code

If there’s interest, I can write a follow-up post that explains little bit what’s going on in the Python script and the tex-template. Neither one does anything fancy, but if you’re just starting out with Python and/or LaTeX you might find some useful techniques in there.

Oh, and if you check the linked folder in the github repo, you’ll also see a file outdex_bingo.tex. That was my first attempt, for which I tried to skip Python do everything directly in LaTeX, based on code from some stackexchange posts. However, I couldn’t quite get it to work as it sometimes produces only a partial bingo card with empty cells. No idea what the problem is, maybe somebody with better LaTeX skills could take a look.

Three types of generalizations

2020-12-14T00:00:00-05:00

My post on defossilization clearly wasn’t esoteric enough, so I’m upping the ante by turning to one of the most esoteric and ephemeral issues in linguistic theory. Yes, we’re gonna talk about generalizations and what their role ought to be in how we do linguistics. Since it’s a long post even for outdex standards, I’ll give you a tldr: I think there’s at least three types of generalization, and we shouldn’t lump them together. In particular, not every generalization has a payoff in the grammar.

We want generalizations

Let’s get to the obvious point right away: generalizations are essential for linguistics. There’s two reasons for that. One is purely utilitarian: generalizations make broad empirical predictions and keep the theory simple, both of which support scientific inquiry. The other reason is the empirical true-ism that language requires generalization. The infant has to generalize from a finite data sample to an infinite one, native speakers have to generalize linguistic laws to nonce forms, and so on. If the object of study necessarily involves generalizations, then our theory of the object shouldn’t be missing them. But where in our theory should the generalizations come from?

Learner VS grammar

Traditionally, linguists encode generalizations directly in the grammar. For instance, word-final devoicing isn’t a collection of segment-based rewrite rules like \[ \textit{z} \Rightarrow \textit{s} \mid \_ \$ \] and \[ \textit{v} \Rightarrow \textit{f} \mid \_ \$. \] Instead, we have a single feature-based rule \[ [+ \textit{voice}] \Rightarrow [- \textit{voice}] \mid \_ \$ \] or something along those lines. That’s easier to make sense of, and it explains why word-final devoicing tends to target a natural class of segments, rather than an arbitrary list of some voiced segments.

But this doesn’t actually need to be in the grammar, it can be handled in the learning algorithm. Suppose that the learner actually infers grammars that use segment-based rewrite rules, but in identifying the correct target grammar is relies on notions of natural classes that are similar to what we have in the feature-based rule. Then the learner would be using a kind of meta-reasoning that is not directly encoded in the grammar. In a very literal sense, you could think of this as the learner identifying a feature-based rule and then compiling it out to a bundle of segment-based rules, but that’s probably too literal because learners can reason in much more abstract ways that consider the structure of the entire hypothesis space and aren’t tied to a specific way of representing rules. Either way the result is a system that still operates according to the relevant generalization, but the generalization is no longer encoded in the grammar.

Wug tests can be rethought along the same lines. The grammar doesn’t need to encode how to handle new forms. Instead, the grammar could be a generalization-free bundle of rules that require specific diacritics to trigger. For instance, nouns would be split into different subgroups noun-z, noun-s, noun-zero, noun-ablaut and so on, depending on what kind of plural they take. When confronted with a nonce noun, the learning algorithm’s job is to assign it the correct subtype (and perhaps add it to the lexicon for future use). The way the learner would figure out the right subtype might look very similar to how linguists encode this in the grammar. If so, then we’d still have our familiar generalization, but it would be in the learner, not the grammar. Again, zero generalization in the grammar, but the whole system still obeys the relevant generalization.

Grammars without generalization are simpler

Okay, at this point you’re probably wondering what the freaking point is, we’re just shifting around generalizations between two abstract entities — learner and grammar — that might not be cognitively distinct to begin with. And you’re essentially right. But follow along for a bit more, this little thought experiment will teach us something important about generalizations: they’re costly.

Let’s look at total reduplication. People usually claim that total reduplication cannot be handled by finite-state automata. But that claim only holds with respect to a larger framework of assumptions. For any given word, total reduplication can be made part of its representation. Here’s a lexical entry for cat that also allows for the total reduplicant cat cat:

Reduplication in a finite-state automaton

Do this with all words that can undergo total reduplication, and you have a grammar that generates all the correct surface forms, including reduplicated ones. Hmm, so total reduplication can be handled with finite-state machines after all.

Now of course there’s a huge array of arguments for why this is a dumb way to do it:

This doesn’t work for nonce forms.
We could have just as well written down a representation that builds palindromes.
We could limit reduplication to words that contain a prime number of consonants.
… and so on.

But these arguments are all about generalization, and, at the risk of repeating myself, that can be handled in the learner. The learner could have a specific reduplication template for creating those representations. Feed in a word, get out an FSA-representation that also allows for reduplication. Then the concerns above disappear:

This template can be applied to a nonce form just like a total reduplication rule would be in the grammar-based view of standard linguistics.
The template could be limited so that it cannot build palindromes.
The template might not be able to detect if the number of consonants is prime.
… and so on.

And this leaves us with a system that contains the relevant generalizations yet uses finite-state machinery for total reduplication.

By the way, the same can be done in syntax. Maybe the grammar is just a finite-state device that is built from a mildly context-sensitive system, with a fixed cut-off point on unbounded constructions. Maybe the cut-off point is usually set to a low value like 3, but when you really hunker down and spend a lot of resources on understanding a more complex sentence, the learner produces a new finite-state grammar on the fly with a higher cutoff point.

In both cases, these compiled out grammars are less powerful than the template they’re created from. And that’s because they do not have to carry the burden of generalization. What this tells us is that generalization is costly. And when something is costly, you shouldn’t use it unless it is worth the cost. So is generalization worth that cost?

Three types of generalization

I think there’s at least three types of generalization.

M[eta]-generalizations
Those are generalizations that researchers are aware of, but that aren’t explicitly encoded in the cognitive machinery. For instance, tendencies about how context and memory load affects processing aren’t explicitly encoded in the human parser, they’re an emergent property of its behavior. The same goes for things like Zipf’s law, which is a statement about the relation between word types and word token frequencies; that’s an interesting generalization, but it’s not a law of the grammar.
F[rugal]-generalizations
Frugal generalizations make the overall system less taxing on cognitive resources. For instance, the finite-state machinery you get from the learner’s context-free template might be so large that it’s easier to just use the original grammar for parsing. Similarly, a mildly context-sensitive formalism may provide a more succinct grammatical description than a CFG, and since parsing performance for sentences with less than ~50 words is largely dependent on grammar size, optimizing grammar size is preferable to optimizing grammar complexity. Or a specific phonological phenomenon may be actually SL-5 in a given language, but if you generalize to strings of unbounded length you get a much more succinct TSL-2 description.
L[earner]-generalizations
These are generalizations that pertain to the whole class of natural languages. Most linguistic generalizations are of this form. Going back to the example of word-final devoicing, the reason linguists prefer the feature-based rewrite rule is because of what it says about the space of possible devoicing processes. If you only consider a single fixed language, there is no difference between the feature-based rule and the segment-based one.

Side-note: Depending on your ontological commitments, everything may be a meta-generalization. If you think language is a chaotic cacophony of neurons firing according to the laws of physics, then everything is an emergent property. But even then it is methodologically useful to distinguish between M-generalizations, F-generalizations, and L-generalizations.

Now I haven’t quite figured out how far it makes sense to push this, but it seems to me that these three types of generalizations should be approached in very different ways. F-generalizations are prime candidates for generalizations that should be encoded directly in the grammar because they are useful for the stuff that builds on the grammar, i.e. parsing and production. L-generalizations have the super-creative name they do because I think they can reasonably be outsourced to the learner. And M-generalizations have no business being in either the grammar or the learner.

L-generalizations and complexity

I don’t think many readers of this prestigious blog would complain about keeping M-generalizations out of the grammar, nor would there be an uproar against keeping F-generalizations in the grammar. The tricky one is L-generalizations, in particular if we put on our generative capacity hat (if you don’t own one, you may borrow mine for a few minutes).

When mathematical linguists make claims of the form “phenomenon X is at least in complexity class C”, those claims often build on hidden L-generalizations. A good example of that is the claim in Jardine (2016) that unbounded tone plateauing is not weakly deterministic (what exactly this means is completely irrelevant for my point, so don’t worry about the fancy terminology). As a blanket claim about how unbounded tone plateauing works, the statement is correct. But once you look at the individual languages that exhibit unbounded tone plateauing, each one displays a confound that makes the phenomenon weakly deterministic in this language. The confounds differ across languages, but each one invariably displays such a confound. Maybe that’s a coincidence. Or maybe it’s something fundamental: maybe L-generalizations are allowed to exhibit a certain level of complexity, but in the end it must all compile out to a system whose complexity does not exceed some much lower complexity threshold. Basically this: “Hey learner, grammar here! If you want to get all fancy-schmancy with your high-brow pie in the sky stuff, that’s cool, different strokes for different folks. But in the end please break it down to simple principles I can actually enforce.”

I’ve been idly toying with the idea that something along those lines may be going on with reduplication. Reduplication tends to be subject to additional constraints and restrictions, it’s rarely a straight-forward affair. There’s restrictions on what kind of stems can get reduplicated, what can be targeted by partial or total reduplication, and so on. I won’t pretend to have a good command of the empirical landscape on reduplication, but the impression I got from several discussions with Hossep Dolatian is that it’s all pretty convoluted. That’s in stark contrast to the current models of reduplication, which are very elegant. Too elegant, perhaps. It makes me wonder why reduplication isn’t a much more free-wheeling operation, something every language throws around left and right in a very principled manner without many exceptions or additional restrictions. Well, the culprit could be a hidden L-generalization. If you look at all the cross-linguistic reduplication data as a single, natural phenomenon that requires a single, unified model, then you’re dealing with something fairly complex. But maybe things are different if we consider each language in isolation. Maybe each language has enough confounds that allow us to come up with something less complex, albeit more convoluted.

I have no idea if that’s a plausible scenario. But it’s a scenario that I find highly fascinating, largely because it would put a new spin on overgeneration. Once you have a computational model an attested phenomenon, you invariably also have the power to handle other phenomena that are not attested. Why, then, don’t those occur? It could be that the attested phenomenon differs from those unattested phenomena in that it can be reduced to something much simpler once we cut out the L-generalization and look only at the language in isolation. Mathematical linguistics isn’t well-equipped to pursue this idea at this point. This perspective requires keeping track of all idiosyncratic properties of the language, which means a lot of moving pieces (some of which we may be unaware of because of missing data). The more moving pieces, the harder it is to construct a proof — you quickly reach a point where you’re better off running simulations, and that has its own giant bundle of drawbacks.

Grammar VS grammar

As if this topic weren’t already esoteric enough, it’s further complicated by the ambiguity of the term grammar. From a computational perspective, a grammar is a finite description of a possibly infinite set of objects, which could be strings, trees, graphs, input-output mappings (in the case of synchronous grammars), and so on. It’s what you feed into your parsing schema to get a parsing system, it’s what you use to reason about the set of objects, and so on. This is how I’ve been using the term grammar in this post so far.

The learning algorithm interacts with this notion of grammar but is different. For instance, strictly local grammars are useful for understanding how a learner could identify a strictly local string language in the limit from positive text, but at the end of the day the knowledge is in the learner, not the grammar. If the grammars are segment-based instead of feature-based, that doesn’t mean that the learner has to be segment-based, it can use higher order reasoning to maneuver the hypothesis space. There is no direct coupling between the description format of the grammar and the reasoning of the learner.

Most theoretical linguists have a richer notion of grammar, something that is more of a general description language, or almost like an API to all aspects of language. Syntacticians don’t write out a learning algorithm or a parsing algorithm, they enrich the grammar with generalizations that are meant to aid learning, processing, and so on. The grammar is the universal locus of generalization. SPE is a prime example of that as its description language was also supposed to provide a learning/naturalness metric for phonology. That’s a-okay in my book, from a methodological perspective the kitchen-sink approach can sometimes be easier to work with than having things scattered across many different components, so it’s good to have both options on the table.

But when we mathematical linguists engage in the process of analyzing the computational complexity of grammars and linguistic phenomena, we should take care to disentangle the different types of generalizations that theoretical linguists put in their grammars. If putting an L-generalization into the grammar changes the complexity picture, that is not the same thing as an F-generalization increasing complexity. One can be jettisoned for more efficient processing, the other can’t. If you’re making a claim about the complexity of a cross-linguistic phenomenon, that does not mean that there is a single language where the phenomenon is actually that complex. And if your complexity claim hinges on an M-generalization, then it really tells us very little about language as a cognitive object.

A sloppy wrap-up

Let me wrap up with a point of clarification: I’m not saying that only F-generalizations are fair game and L-generalizations should be excised from mathematical linguistics. Nor that we can ignore complexity in the learner — heck, there might not be a cognitive difference between the grammar and the learner, just like there might be no difference between the grammar and the parser. The complexity of building a grammar (the learner’s job) is just as important as grammar complexity, but the two are not equally important for each task. In the total reduplication example above, something has to build those finite-state automata, and the complexity of that process is what people have in mind when they say total reduplication is not finite-state. But that is a one-time cost, so if you want an efficient processing system that can handle reduplication, you can pay that cost once in a precompilation step and after that you won’t have to pay it again until a new entry needs to be added to the lexicon. The cost of L-generalizations doesn’t need to be paid all the time, you can take care of it once and then forget about it.

Similarly, we should not mistake L-generalizations for claims about specific processes in specific languages. Language is a biological system. Biology tends to be messy, and it tends to implement the same idea in many different ways. Suppose a linguist from the future were to show up on your door step and tell you that there actually is no such thing as reduplication, and in fact it’s a cluster of language-specific processes that are all distinct yet look the same at a sufficient level of abstraction. Would you be shocked? I wouldn’t. It’s a useful piece of information that does not undercut the idea of reduplication as something worth studying at various levels of generalization. We just have to make sure we know which level each theorem operates at.

References

Jardine, Adam. 2016. Computationally, tone is different. Phonology 33.247–283. doi:10.1017/S0952675716000129. https://doi.org/10.1017/S0952675716000129.

Representations as fossilized computation

2020-11-30T00:00:00-05:00

Okay, show of hands, who still remembers my post on logical transductions from over a month ago? Everyone? Wonderful, then let’s dive into an issue that I’ve been thinking about for a while now. In the post on logical transductions, we saw that the process of rewriting one structure as another can itself be encoded as a structure. Something that we intuitively think of in dynamic terms as a process has been converted into a static representation, like a piece of fossilized computation. Once we look at representations as fossilized computations, the question becomes: what kind of computational fossils are linguistic representations?

The problem with category features

Let’s start with a concrete example that I think I’ve got figured out. The example will make it clearer why it is important to study representations as fossilized computations.

Some of you might be familiar with my Casandra-like theatrics when it comes to category features. Category features allow you to trick c-selection into doing all kinds of linguistically unsavory things for you. Suppose we have a language where the only category features are O and E. Spoiler: those are short for odd and even, and that’s the problem. Now suppose that the language obeys the following rules:

Every head has exactly one category feature.
If a head takes no arguments, its category feature is O.
If a head takes exactly one argument, then
1. the head’s category feature is E if the argument has category feature O;
2. the head’s category feature is O if the argument has category feature E.
If a head takes exactly two arguments, then
1. the head’s category feature is E if the two arguments have distinct category features;
2. the head’s category feature is O if the two arguments have the same category feature.
All sentences are OPs.

These rules can be easily lexicalized in a manner that’s fairly innocuous from a linguistic perspective. Just like we may say that show is an N that selects nothing, or a V that selects two D-heads or a D-head and a P-head, we may say that show is either an O or an E and that its subcategorization frame depends on the choice of category. So a priori there isn’t anything linguistically suspect about this. Yet the resulting system is one that’s highly unnatural: if all those rules are followed, then we can only build syntactic structures that contain an odd number of lexical items.

Here’s two simple dependency trees to illustrate how this works. Each head’s category is listed in parenthesis. The left tree is an OP, whereas the right one is an EP and thus illicit. That’s not something we want our formalism to do because, well, languages don’t do it.

Only trees with an odd number of lexical items are OPs

We’re abusing category features as an information buffer that stores how many lexical item a subtree contains. C-selection then carries out a simple even/odd calculus. For instance, if a head H has two arguments, one of which is O and one is E, then the number of lexical items in the subtree rooted in H must be O + E + H = odd + even + 1 = even = E. Everything here is strictly local or lexicalized. We’re doing nothing special with the grammatical machinery, yet the result is unlike anything we find in language.

This is just the tip of the iceberg. A huge class of constraints can be made local in this manner, including even transderivational constraints. Virtually all constraints in the syntactic literature can be pushed into the category system. More troublingly, this also holds for insane variations of natural constraints, e.g. a constraint that’s exactly like the Adjunct Island constraints except that it is enforced iff 1) the size of the mover is less than 17 and either 2.1) there are at least 4 instances of movement in the whole tree, or 2.2) the number of violated island constraints is a multiple of 2 if the tree contains a deverbal noun, and a multiple of 5 otherwise. Given this kind of insanity, it’s no longer too shocking that we can also retool c-selection as a means for feature percolation, which in turn allows us to replace many instances of movement with a base merge mechanism that’s no longer subject to island constraints. It’s overgeneration galore.

Most linguists I show this to agree that it isn’t desirable, but they also have an easy solution: the category system is only allowed to contain the familiar categories V, N, A, P, T, C, and so on. I am, to put it politely, not a fan. First of all, substantive universals aren’t all that satisfying, in particular if they’re just a list with no internal structure or regularities. Second, I don’t believe that this actually works. Once you get down to the nitty-gritty, the number of categories you need really blows up. Not all adjectives are alike, for instance:

the ugly president
the alleged president
The president is ugly.
*The president is alleged.

There’s many subtle differences of this kind, and since these differences can interact you end up with combinatorial explosion, which means the number of categories quickly gets very large. Now you might say we can keep the standard category system and do everything else with constraints, but that’s exactly the thing: constraints are computation, and categories are fossilized computation. You’re not really changing what’s going on, you’re only changing what you bake into the representation.

The odd-even system above is just a fossilized computation of odd-even counting. The insane island constraint above can be pushed into c-selection because there is a particular way of fossilizing its computation. First we translate the constraint into a finite-state tree automaton, and then we record how this automaton can transition from one state configuration into another. We then choose our categories and c-selection rules in such a way that they mimic these state transitions. If you want all the details, (check out Graf 2017)(https://www.glossa-journal.org/articles/abstract/10.5334/gjgl.212/), which is a more accessible version of Graf (2011) and Kobele (2011).

The bottom line is this: there is a principled way to fossilize computations into the category system. That’s why we can trick categories and c-selection into doing stuff nobody wants them to do, giving us overgeneration galore. Short of stipulating a fixed set of categories, which is inflexible and unsatisfying, there isn’t a good way of preventing this because we have no good way of telling a natural category system from an unnatural one.

Wanna fix category features? Fix your representations!

For the longest time my solution to the issue above was to completely abandon the very notion of category features. Everything should be done with constraints because constraints give us well-defined notions of complexity that we can use to separate the natural from the unnatural. But this was both too radical and — which I have realized only now — not radical enough. It is too radical because categories are a fundamental component of all syntactic theories, ripping them out is gonna make things more complicated to work with. And it is not radical enough because it doesn’t plug the real loop hole: representations.

At the end of the day, language is about computation. The computation is factored into two components, which are the representation and the constraints/operations that apply to this representation. The dichotomy of category features and constraints shows that this is not a hard separation, we can take constraints and push them into the representation. But representations are a black hole: there is no systematic way of measuring the complexity of a representation. At least that’s what I thought, until I started to think of representations as fossilized computations. Once you make that shift in perspective, the issue is trivial. In order to measure the complexity of a representation, we can measure the complexity of the fossilization computation that converts computation into representation.

In the limited case of category features, this is easy enough to figure out. Suppose we start out with a representation that lacks all features related to c-selection, so at the very least category features, and possibly selector features if that’s how you encode c-selection. In this scenario, how hard would it be to correctly annotate each lexical item with is category features (and selector features)? This is sketched in the figure below (dependency trees are used merely for convenience, the basic idea applies to any kind of representation).

Measuring the complexity of category systems via defossilization

The process isn’t trivial because one lexical item could be annotated in various ways, as in the case of show that I mentioned earlier. But natural languages seem to use category systems that minimize this indeterminacy. As far as I can tell, the category of a lexical item can be reliable inferred from its local context in the tree. I’m not quite sure how large the context is that needs to be considered — that’s an interesting empirical question, one that I hope to take a crack at soon. But whatever the exact bound is, there is some finite upper bound on the size of the context, and in technical terms this means that natural languages use category systems that are strictly $k$-local.¹ More precisely, their category systems are input strictly local because we only need to consider the context in the input representation that we are adding category features to. The details can be found in Graf (2020) (pdf here), the first half of which is actually meant to be accessible to a general audience and can be read in less than 20 minutes (just ignore the second half). But details don’t matter here. Just remember: category systems of natural languages are input strictly local.

Now compare that to the odd-even system. Here, there is no upper bound on the context. In order to determine whether a head should get O or E, I have to know whether its subtree contains an odd or an even number of nodes. There’s two ways to do that.

We can look at the entire subtree. Since there is no upper bound on how large a subtree may be, this is not strictly local. In fact, even first-order logic can’t assign the correct category features. You need monadic second-order logic to get going, marking this as a very complex category system.
Alternatively, we can process the tree bottom-up, adding category features as we go. In this case, determining the category of a head only requires us to look at the category features of its arguments. This is strictly local because we only need to consider a context of bounded size. But the context is different: now we have to look at the output we are producing, rather than the input tree without category features. This is output strictly local.

Whichever route we take, the odd/even system is now formally distinct from the category systems we find in natural languages. And this makes me happy; we no longer need to stipulate a fixed set of categories or rely on fuzzy criteria of why a given category system is or is not natural. Instead, we have a rigorous procedure for measuring the complexity of category systems: we “defossilize” computation by ripping it out of the representation, and then we check how difficult it would be to put it back in. While the procedure may seem odd at first, I find it very encouraging that what we find in natural language looks very reasonable from this perspective.

Another example: Morphology

The “defossilization” strategy can be easily applied to other domains. Here’s an example from morphology.

There’s many ways one may think about morphology. In the item-and-arrangement tradition, we may posit an underlying representation with stems and abstract morphemes in the correct positions. This underlying representation is then rewritten as a surface form. Here is an example from Krongo (Kadugli; Sudan) that I took from WALS.

Underlying representation: Instr-baton
Surface form: á-kÙUfi

But where does this underlying representation come from? How come that the case affix appears correctly before the stem? Looks like our representation already contains some fossilized computation. Let’s see, then, how hard it would be to construct this representation.

First we might ask how hard it is to get the underlying representation from the derivational history, where case is added to the stem later on. This is basically an item-and-process view of morphology.

Derivation: baton-Instr

Switching the order of these two guys is input strictly local. But this is a pretty simple case. Let’s consider a more complicated scenario.

Suppose our language allows for recursive prefixation and suffixation, and that the two can depend on each other. For example, a prefix foo- turns verbs into nouns, and a suffix -bar turns nouns into verbs. So you could have something like

Underlying representation: foo-[[foo-[[foo-go]-bar]]-bar]*

Yet the derivation for this would be as follows:

Derivation: go-foo-bar-foo-bar-foo

How hard is it to compute the underlying representation from this derivation? Not too difficult, actually. I won’t go into details here, but this can be done with a transduction definable in first-order logic (go ahead, give it a try, it’s not too difficult). This entails that it can be computed by a 2-way finite-state transducer, and Hossep Dolatian and Jeffrey Heinz have argued recently that these transducers provide a good fit for morphology. As always, their argument is a bit more nuanced and defines several restricted subtypes of these transducers, but I’ll completely gloss over this because the post is already plenty long. The interesting point is that the difference between an item-and-arrangement view with underlying representations and an item-and-process view with derivations is just a single transduction of a complexity class that we might need anyways for morphology.

Ah, but now you wonder where the derivation is from. We don’t have to take that as a representational primitive either. We can just start with a stem as the input, and have that non-deterministically rewritten as a derivation. In our example above, go could be mapped to an infinite number of derivations:

go
go-foo
go-foo-bar
go-foo-bar-foo
and so on

This can be done by a non-deterministic finite-state transducer with epsilon-transitions (holy jargon, Batman).

Non-deterministic 1-way FST for morphological derivations

The non-determinism is only needed to choose between the many different derivations that are available for any given stem. Given a fixed choice, the transducer would be deterministic. I haven’t completely worked through this, but it seems to me that this is a very generous upper bound. The pattern above would actually be input strictly local if those transductions could do $\varepsilon$-transitions. So we might be able to make do with very little power, getting us into the range of transductions that Hossep and Jeff identify.

If all of that is on the right track, we can bootstrap morphological representations from nothing but the stem. And doing so doesn’t really change the complexity picture because we’re staying within classes we would need anyway. Morphological representations can be completely defossilized, turning them into pure, pristine computation.

So much work left to be done

At this point I’m sure you’re all very eager to do some defossilization of your own, so here’s a list of hot button issues that still need to be solved:

Movement features
That’s the big one. In Minimalist grammars, half of movement is actually figuring out what movement features to put where. Should this wh-phrase get to move, or this one? Movement features fossilize a lot of computation that syntacticians deeply care about, e.g. relativized minimality. We want to have a good idea of how these computations work. This might also provide us with new explanations as to why these conditions on movement should hold in the first place.
Compounds
I gave you such a nice picture of morphology, but unfortunately it only works if we ignore compounds. Extend the example above to a derivation with multiple stems that each carry their own affixes, and you’re in trouble. And with compounds, there seems to be no way around tree structures, which takes me to the next point.
Defossilizing trees
Morphological representations can be summoned out of thin air with nothing but the stem as input. But with trees, things are tricky. What should be the input? A numeration with all lexical items? There’s tons of ways to combine those, yielding vastly different trees. And do those lexical items get to carry features? Perhaps an LF-string would be a better input, and then we have to reverse engineer that into a tree. But that’s basically parsing, and the reason parsing is studied with parsers rather than transducers is because we have no model of string-to-tree transductions that fits the automata-theoretic view. Yeah, trees are tricky, and I’m not sure what to do with them.
Autosegmental structure
There has been lots of work recently on the subregular complexity of autosegmental phonology, driven mostly by Adam Jardine and his group at Rutgers. Here, too, we would like to know what kind of computational fossils these additional structures are. Even though graphs might seem more complex than trees, I think autosegmental structures are actually easier because they’re still pretty “stringy”. The simplest case of autosegmental structures is TSL, where it is obvious that the tier is the result of a strictly 1-local transduction. Making this transduction more complex yields more powerful extensions of TSL. The autosegmental structures I have seen so far also seem local in a specific way, all we need at this point is a formal model of strictly local string-to-graph transduction. Once we have that, I think it will fit the bill just fine.

As you can see, plenty of work for the interested researcher. I think this is a very useful perspective, and one that really gets us closer to issues linguists care about. There’s many linguistic debates that seems pointless from a formal perspective because the representations provide too many loopholes. Defossilization gives us a firm grip on representations, and that’s what we really need at this point. Subregular complexity has allowed us to tighten the constraint space, now we have to rein in representations.

References

Graf, Thomas. 2011. Closure properties of Minimalist derivation tree languages. (Ed. by.) Sylvain Pogodalla and Jean-Philippe Prost. LACL 2011. Lecture notes in artificial intelligence. Heidelberg: Springer. doi:10.1007/978-3-642-22221-4_7. https://dx.doi.org/10.1007/978-3-642-22221-4_7.

Graf, Thomas. 2017. A computational guide to the dichotomy of features and constraints. Glossa 2.1–36. doi:10.5334/gjgl.212. https://dx.doi.org/10.5334/gjgl.212.

Graf, Thomas. 2020. Curbing feature coding: Strictly local feature assignment. Proceedings of the Society for Computation in Linguistics (SCiL) 2020.

Kobele, Gregory M.. 2011. Minimalist tree languages are closed under intersection with recognizable tree languages. (Ed. by.) Sylvain Pogodalla and Jean-Philippe Prost. LACL 2011. Lecture notes in artificial intelligence. doi:10.1007/978-3-642-22221-4_9. https://doi.org/10.1007/978-3-642-22221-4_9.

Things work slightly different if you think in terms of Distributed Morphology. In that case, we start out with a reduced tree that only contains the roots, and we have to figure out what functional material to insert above those roots. I believe that this does not change the strictly local nature of category systems in natural languages, but it might affect how much context you need to take into account. For instance, if you’re an old school Minimalist and your syntactic structure includes fully inflected lexical items, then you don’t need any context to figure out that waters is a verb, not a noun. If all you have is an uninflected root, you need to dig a bit deeper.↩

Synchronous movement: What could go wrong?

2020-10-12T00:00:00-04:00

I know I promised you guys a follow-up post on logical transductions and the status of representations, but I just have to get this out first because it’s been gnawing at me for a few weeks now. There’s been some limitations of the subregular view of syntax in terms of movement tiers, and I think I’ve found a solution, one that somehow ends up looking a bit like the system in Beyond Explanatory Adequacy. The thing is, my solution is so simple that I fear I’m missing something very basic, some clear-cut empirical phenomenon that completely undermines my purported solution. So, syntacticians, this is your opportunity to sink my current love child in the comments section…

The problem with movement tiers

As you might remember, the basics of movement can be modeled as local constraints on movement tiers. The idea is that we look at a syntactic derivation, represented as an MG dependency tree. For each movement type, i.e. wh, topicalization, subject movement, and so on, we consider only those nodes that participate in this kind of movement as either the head of the moving phrase or the that provides the landing site. This information is assumed to be encoded via features — for instance, MGs use licensee features $\mathrm{f^-}$ for the mover and licensor features $\mathrm{f^+}$ for the landing site. In such a system, it is very easy to find the relevant nodes for each tier. On each $\mathrm{f}$-tier, we then require that

every $\mathrm{f^-}$ has an $\mathrm{f^+}$ mother, and
every $\mathrm{f^+}$ has exactly one $\mathrm{f^-}$ among its daughters.

This ensures a 1-to-1 match between movers and landing sites.

Dependency tree and tiers for Mary wonders which car Sue bought

That’s the general idea, and imho it’s an intuitively pleasing one. But this actually does not work in the general case, at least for MGs. In MGs, a head can have multiple licensee features, and those features are linearly ordered. For instance, if which is the head of a subject wh-phrase, its string of features would be something like $\mathrm{N^+}\ \mathrm{D^-}\ \mathrm{nom^-}\ \mathrm{wh^-}$. This means that once which has merged with an NP and has been selected by some other head, checking its selector feature $\mathrm{N^+}$ and its category feature $\mathrm{D^-}$ in the process, it undergoes subject movement via $\mathrm{nom^-}$, and then wh-movement via $\mathrm{wh^-}$.

Crucially, the features of which are inactive until all the features before them have been checked. You can think of this like a dot moving through the feature string from left to right, and the only feature that counts is whatever is immediately to the right of the dot. The steps above would correspond to the following feature string configurations:

$\bullet \mathrm{N^+} \mathrm{D^-}\ \mathrm{nom^-}\ \mathrm{wh^-}$: the selector feature $\mathrm{N^+}$ is active and must be checked via Merge
$\mathrm{N^+} \bullet \mathrm{D^-}\ \mathrm{nom^-}\ \mathrm{wh^-}$: the category feature $\mathrm{D^-}$ is active and must be checked via Merge
$\mathrm{N^+}\ \mathrm{D^-} \bullet \mathrm{nom^-}\ \mathrm{wh^-}$: the licensee feature $\mathrm{nom^-}$ is active and must be checked via Move
$\mathrm{N^+}\ \mathrm{D^-}\ \mathrm{nom^-} \bullet \mathrm{wh^-}$: the licensee feature $\mathrm{wh^-}$ is active and must be checked via Move
$\mathrm{N^+}\ \mathrm{D^-}\ \mathrm{nom^-}\ \mathrm{wh^-} \bullet$: all features of which have been checked

Since features are inactive by default, there is nothing wrong with derivations like the one below, represented once again as an MG dependency tree. The $\mathrm{wh}^-$ on which and what are never active at the same time, so there’s no confusion about how these licensee features are matched up against $\mathrm{wh}^+$ on saw and the C-head.

Dependency tree and tiers for Which witness what saw

But if you look at the wh-tier for this derivation, it does not obey the constraints above. We have a $\mathrm{wh^+}$ without a $\mathrm{wh^-}$ daughter, whereas another $\mathrm{wh^+}$ has two. The tier-based perspective misses the fact that the $\mathrm{wh^-}$ on which only becomes active after $\mathrm{nom^-}$ has been checked. Basically, the position of which on the $\mathrm{wh^-}$ tier should be higher, corresponding to the point in the derivation where a $\mathrm{nom^+}$ checks the $\mathrm{nom^-}$ on which.

There’s two ways to deal with this. My position so far has been to assume that the grammar is in single movement normal form. This is a bulky term for the simple idea that no head can ever have more than one licensee feature. It’s simply impossible for which to carry both $\mathrm{nom^-}$ and $\mathrm{wh^-}$. That’s an innocent working assumption in the sense that it does not affect weak or strong generative capacity. But it also pushes us farther away from the standard view of syntax, and that’s the very opposite of what I’d like subregular syntax to accomplish.

The other option is to switch to a much more sophisticated tier projection mechanism. It’s not even that hard to define, but it’s not particularly natural from a subregular perspective, and that’s why it doesn’t strike me as a very insightful route to take. So recently I figured, the hell with it, what if the tier-based view of syntax is correct in its current form? What would syntax look like if we don’t have the single movement normal form, but the tier constraints still apply in the same fashion?

Synchronous movement

Remember that I said above that features in MGs get unlocked one after the other. Features tend to spend most of their derivational life inactive, patiently waiting in line until $\bullet$ shows up to tell them its their turn. The tier-based view of syntax cannot handle this orderly line, it thinks of each lexical item as a beehive where all features are active at the same time. The $\mathrm{wh^-}$ on which doesn’t give a damn that it isn’t its turn yet, it’s ready to rock right away and it won’t have other $\mathrm{wh^-}$ barge into its territory. In a system that works like this, the derivation above would no longer be allowed — even though which can’t even target the $\mathrm{wh^+}$ on saw, it still won’t let any other phrase move there.

This may sound a little strange to you, but it gets even more bonkers once you look at it in terms of phrase structure trees. Traditionally, we think of it as which moving to the subject position in Spec,TP, and then it undergoes wh-movement from Spec,TP to Spec,CP.

Standard view: Movement is continuous sequence of steps

In tier-based syntax, which still undergoes subject movement as usual, but it simultaneously also undergoes wh-movement to Spec,CP. Kinda like in Chomsky’s Beyond Explanatory Adequacy. And also a bit like in a multi-dominant syntax, where a mover never truly leaves its base position. The fact that the feature string of is $\mathrm{N^+} \mathrm{D^-} \mathrm{nom^-} \mathrm{wh^-}$ no longer means that subject movement precedes wh-movement, it only tells us that wh-movement must target a position higher than the one targeted by nom-movement (and that requirement would have to be enforced by some additional mechanism besides movement tiers).

Tier-based view: All movement steps start from the base position

Problems?

So the last few weeks I’ve been trying to come up with clear-cut empirical problems for this approach. I can’t find any.

The standard MG model predicts that it is bad for two wh-phrases to have overlapping wh-movement paths, unless the part where they overlap is actually part of a different movement step like subject movement. That’s not really how syntax seems to work. You usually don’t get overlapping wh-paths, but rather two wh-phrases competing as to which one gets to move at all, while the other has to stay behind (let’s not get into multiple wh-movement here, all I’ll say is that it is entirely unproblematic and perfectly natural from the tier-based perspective).

Then I thought the distinction between A-movement and A$'$-movement might produce a counterexample. Perhaps a phrase must not undergo wh-movement from position X, but once it has undergone some other type of movement to a higher position it can move from there. This also seems to happen a lot with scrambling. But those cases can be reanalyzed as the difference between carrying just $\mathrm{wh^-}$ or $\mathrm{nom^-} \mathrm{wh^-}$ — it’s not the position itself that matters, but rather what kind of mover you are, a simple wh-mover VS a synchronous nom-wh-mover.

I’m mostly talking about wh-movement here, but I don’t think things are much different for topicalization, raising, and so on. I just can’t find a good counterexample. Part of that could be because of a category mismatch between what the movement literature cares about and what I’m looking for. Subjacency and relativized minimality, for example, aren’t directly about movement once you operate under the assumption that everything is encoded via features. They are about how features must be distributed over heads: “no, you can’t have a $\mathrm{wh^-}$, that has to go on this fellow over here because he’s in a structurally more prominent position”. Tier-based syntax, on the other hand, is a model of how movement has to proceed once these features have been distributed over heads. We’re probing part of a larger system, which is difficult unless all parts of the system are precisely nailed down, which they aren’t.

But as I said, maybe this is all just me being really dense. Maybe there is an obvious problem that I’m missing. Maybe there is a clear-cut argument why movement must be thought of as a road trip with several stops on the way, rather than a number of identical packages being sent from the same hub to different destinations.

Logical transductions: Bats, butterflies, and the paradox of an almighty God

2020-09-21T00:00:00-04:00

Since we recently a had a post about Engelfriet’s work on transductions and logic, I figured I’d add a short tutorial that combines the two and talks a bit about logical transductions. I won’t touch on concrete linguistic issues in this post, but I will briefly dive into some implications for how MGs push PF and LF directly into “syntax” (deliberate scare quotes). I also have an upcoming post on representations and features that is directly informed by the logical transduction framework. So if you don’t read anything here unless it engages directly with linguistics, you might still want to make an exception this time, even if today’s post is mostly logic and formulas.

Logic for structure

In linguistics, logic is usually kept to its semantic habitat, rarely venturing out into the domain of syntax or phonology. There are some exceptions, like this phonology textbook by Alan Bale and Charles Reiss, but they are, as I said, exceptions. That’s unfortunate, logic is actually a great tool for talking about structures. It’s basically a more rigorous form of constraint-based formalisms in linguistics (old-school constraint-based, not ranked constraints as in OT).

I’ll explain in a minute how exactly this works, but the key insight I want to relate to you right-away: from the perspective of mathematical logic, talking about a structure isn’t all that different from talking about a transduction that changes this structure into something else. The common dichotomy between structures one the one hand and processes that manipulate those structures on the other hand just falls apart, like chlorine under sun light.

But let’s not get too far ahead of ourselves. For now, our focus is still on the more standard case of using logic to talk about structure. Consider the following formula of first-order logic:

\[ \forall x [a(x) \vee b(x) \vee c(x)] \]

Without further context, this states that every $x$ has to satisfy predicate $a$, or predicate $b$, or predicate $c$. But we can interpret it as a claim about graph structures. Our domain of quantification is the nodes of the graph. And we say that $a(x)$ is true iff node $x$ has the label $a$. Given the same interpretation for $b(x)$ and $c(x)$, the formula above is a constraint that require every node to carry label $a$, or $b$, or $c$, or some combination of those three.

We can tighten this with another constraint to ensure that every node has exactly one label.

\[ \begin{align*} \forall x \big [ & (a(x) \rightarrow \neg b(x) \wedge \neg c(x)) \wedge\\ & (b(x) \rightarrow \neg a(x) \wedge \neg c(x)) \wedge\\ & (c(x) \rightarrow \neg a(x) \wedge \neg b(x)) \big ]\\ \end{align*} \]

Any graph in which both formulas a true is a model of this set of formulas. Right now our logic is still too limited to do anything interesting, so let’s add a bit of machinery to talk about how the nodes in a graph may be arranged.

Every graph defines a reachability relation $R$ $y$ is reachable from $x$ iff there is a sequence of edges that takes us from $x$ to $y$. In syntactic trees, reachability would correspond to (proper) dominance. We enrich our first-order logic with predicate $\triangleleft^+$ such that $x \triangleleft^+ y$ iff $y$ is reachable from $x$. Based on this, we can also define a predicate $\triangleleft$:

\[ x \triangleleft y \Leftrightarrow x \triangleleft^+ y \wedge \neg \exists z [x \triangleleft^+ z \wedge z \triangleleft^+ y] \]

This new predicate $\triangleleft$ only holds between nodes that are reachable via a single edge, rather than a sequence of two or more edges. In syntactic trees, $\triangleleft$ would be the mother-of relation.

Note that $\triangleleft$ and $\triangleleft^+$ do not have equal status in our logic. Whereas $\triangleleft^+$ must be stipulated as a primitive of our first-order language, $\triangleleft$ comes for free as it can be defined with the primitives that are already available in first-order logic:

the universal quantifier $\forall$ and the existential quantifier $\exists$,
the propositional connectives ($\neg$, $\wedge$, $\vee$, $\rightarrow$, $\leftrightarrow$),
the binary predicate $\triangleleft^+$ for the reachability relation,
three unary predicates $a$, $b$, and $c$ that we interpret as node labels.

So $\triangleleft$ is more like a LaTeX macro in that we can treat it as syntactic sugar to save us some typing. Wherever you see a statement of the form $x \triangleleft y$ in this post, you can substitute the right-hand side of the formula above:

\[x \triangleleft^+ y \wedge \neg \exists z [x \triangleleft^+ z \wedge z \triangleleft^+ y]\]

Doing so yields a first-order formula that only uses the primitives listed above. The difference between a primitive and a “macro” may seem overly pedantic, but it will be really important once we start talking about transductions.

Okay, so now that we have $\triangleleft^+$, and by extension $\triangleleft$, what are we gonna use it for? Well, for instance, we can require every well-formed graph to be a string.

\[ \begin{align*} \forall x,y,z \big [ & (x \triangleleft y \wedge x \triangleleft z \rightarrow y = z) \wedge\\ & (x \triangleleft z \wedge y \triangleleft z \rightarrow x = y) \wedge\\ & (x \triangleleft y \rightarrow \neg (x = y)) \big ]\\ \end{align*} \]

This says that

no node can have more than one outgoing edge (“if $x$ is related to $y$ and $z$, then $y$ and $z$ are the same node”), and
no node can have more than one incoming edge (“if $x$ and $y$ are both related to $z$, then $x$ and $y$ are the same node”), and
no node can be related to itself (“if $x$ is related to $y$, then $x$ and $y$ must be distinct nodes”).

And now we could add yet another constraint so that the strings must consist of 0 or more iterations of abc, e.g. abcabcabc or the empty string $\varepsilon$. To do this, we once again throw in some syntactic sugar:

\[ \mathrm{first}(x) \Leftrightarrow \neg \exists y [y \triangleleft x] \]

\[ \mathrm{last}(x) \Leftrightarrow \neg \exists y [x \triangleleft y] \]

These predicates check whether a node is the left edge or right edge of the string, which is the same as saying that the node has no predecessor or successor, respectively. With these two additional predicates at our disposal, we now formulate the constraint that limits the strings to iterations of abc.

\[ \begin{align*} \forall x \big [ & (\mathrm{first}(x) \rightarrow a(x)) \wedge\\ & (\mathrm{last}(x) \rightarrow c(x)) \wedge\\ & (a(x) \rightarrow \exists y [x \triangleleft y \wedge b(y)]) \wedge\\ & (b(x) \rightarrow \exists y [x \triangleleft y \wedge c(y)]) \wedge\\ & (c(x) \wedge \neg \mathrm{last}(x) \rightarrow \exists y [x \triangleleft y \wedge a(y)]) \big ]\\ \end{align*} \]

In plain English: the left edge must be a (if it exists), the right edge must be c (if it exists), a must be followed by b, b must be followed by c, and non-final c must be followed by a.

Hopefully you’ll agree with me that this is all fairly intuitive, even if first-order logic isn’t the most succinct description language around. Putting aside cumbersome notation, it’s really just the familiar system of defining well-formed structures in terms of a finite collection of inviolable constraints. The only tweak is that the constraints are expressed in logic instead of plain English or some other metalanguage. This system is very flexible in that it allows us to talk about arbitrary graph structures, be they strings, trees, multi-dominance trees, DAGs, or more general types of graphs. In fact, my focus on strings in this post is really just a matter of keeping the exposition simple; the first linguistic applications of this approach were all about trees (Blackburn, Gardent, and Meyer-Viol 1993; Backofen, Rogers, and Vijay-Shanker 1995; Cornell and Rogers 1998; Rogers 1998), and strings have only recently become more of a focus as part of the computational conquest of the phonological hinterland.

Transductions as parasitic structure

Alright, now we have a basic understanding of how first-order logic (or rather, any logic) can be used to talk about structure. As our concrete example, we have bunch of first-order constraints whose models are graphs that form strings of the form $\mathit{abc}^*$ (0 or more iterations of $\mathit{abc}$). Now let’s see how we go from this to using logic to talk about transductions, i.e. transformations of an input structure into some output structure.

Let’s consider a particular graph, say, the one for abcabcabcabc. Here’s a depiction of this example that emphasizes the graph-based view of this string by

arranging the nodes in a plane instead of the usual left-to-right order for strings, and
listing each node with its numerical index; the label is attached in the top left corner to highlight that it is a specific predicate that holds of the node with that index.

The string abcabcabcabc viewed through the lens of logic

If you follow along the edges, you’ll still get the string abcabcabcabc, so this is really just a matter of presentation, not content.

Now remember that we can also define all kinds of syntactic sugar or macros, like $\mathrm{first}$, $\mathrm{last}$, or $\triangleleft$. Let me show you a very different kind of predicate we could have defined. Spoilers: although this predicate only piggybacks on our existing structure, it essentially defines a new graph — the output of a specific transduction.

First, we will need a predicate that connects the closest nodes that share a specific label. Again we do not need to add this predicate to our language, it’s just a convenient shorthand for a more complex formula.

\[ x \prec_a y \Leftrightarrow x \triangleleft^+ y \wedge a(x) \wedge a(y) \wedge \neg \exists z [x \triangleleft^+ z \wedge z \triangleleft^+ y \wedge a(z)] \]

In our example string abcabcabcabc, it holds that $0 \prec_a 3$ as both $0$ and $3$ are labeled $a$, and $3$ is the closet node that is reachable from $0$. On the other hand, $0 \prec_a 5$ would be false because $5$ is not labeled $a$, $1 \prec_a 3$ would be false because $1$ is not labeled $a$, and $0 \prec_a 6$ is false because $3$ occurs between the two and is also labeled $a$. We define analogous predicates $\prec_b$ and $\prec_c$ for labels $b$ and $c$, respectively.

Next, we will also relativize $\mathrm{first}$ and $\mathrm{last}$ to labels so that they pick out the first and last node with a specific label, respectively. Here is what this looks like for $a$:

\[ \mathrm{first}_a(x) \Leftrightarrow \neg \exists y [y \prec_a x] \]

\[ \mathrm{last}_a(x) \Leftrightarrow \neg \exists y [x \prec_a y] \]

As you can see, that’s almost the same formula as for $\mathrm{first}$ and $\mathrm{last}$, except that we have replaced $\triangleleft$ with $\prec_a$. But since $\prec_a$ can be defined in terms of our logical primitives, this is a-okay, we haven’t expanded the language in any way. Okay, all the preparatory work has been done, time to move on to the main course:

\[ \begin{align*} x \blacktriangleleft y \Leftrightarrow & x \prec_a y \vee\\ & x \prec_b y \vee\\ & x \prec_c y \vee\\ & (\mathrm{last}_a(x) \wedge \mathrm{first}_b(y)) \vee\\ & (\mathrm{last}_b(x) \wedge \mathrm{first}_c(y))\\ \end{align*} \]

What does this do? Well, this is one of those cases where a picture says more than a thousand words.

The relation $\blacktriangleleft$ in the string abcabcabcabc

The predicate $\blacktriangleleft$ has added a new order between nodes, depicted here as dashed blue arrows. Instead of a sequence of triples abc, this order first connects all instances of a, followed by all bs, and finally all cs. If we move through the graph using $\blacktriangleleft$ instead of $\triangleleft$, we get aaaabbbbcccc instead of abcabcabcabc.

Since the figure is a little cluttered, we’ll clear things up a bit by drawing two separate instances of the graph, one using $\triangleleft$ and the other $\blacktriangleleft$.

Separating the graph into two distinct structures

But wait a second! What if this isn’t just a more convenient presentation, what if we attach a meaning to this? What if the top figure is the input structure, and the bottom figure is the output structure? Then we have just carried out a transduction. From this perspective, our definition of $\blacktriangleleft$ isn’t just syntactic sugar, it is a transformation that rearranges the nodes in the structure — in combination with the constraints we put in place earlier, the formula for $\blacktriangleleft$ defines a mapping from strings of the form $\mathit{abc}^n$ to strings of the form $a^n b^n c^n$!

The first time I came across this idea — in Morawietz (2003), which builds on Mönnich (1999) and subsequent work — it blew my mind. Strictly speaking, all we have is a single structure with multiple relations defined over it. But if we attach a specific interpretation to this, treating some as input relations and others as output relations, we get a transduction. Ignore $\blacktriangleleft$ and you get an input structure, ignore $\triangleleft$ and you get an output structure. We are translating a static structure into a dynamic process. This is the key insight behind logical transductions: inputs and outputs are just specific ways of interpreting a single structure, and the difference between the two is what we analyze as a transformation.

The paradoxical power of interpretation

This static picture of transductions might strike you as very unintuitive, but oh boy, you haven’t seen yet just how unintuitive it can get. Here’s three mathematical facts for you to chew on:

A stringset can be defined in first-order logic with $\triangleleft^+$ iff it is star-free.
The stringset $a^n b^n c^n$ is mildly context-sensitive.
The star-free stringsets are a proper subclass of the mildly context-sensitive stringsets.

These three things do not seem to fit together. We just used first-order logic to define all graphs that produce strings of the form $a^n b^n c^n$. But this should be impossible because first-order logic can only produce star-free stringsets, and $a^n b^n c^n$ is not star-free. This is a paradoxical situation. It reminds me of the paradox of an almighty God: God can be almighty only if (s)he can create a rock even (s)he cannot lift, but then (s)he cannot be almighty. The only way to resolve it is to say that God is so almighty (s)he can do things even (s)he cannot do. Just like first-order logic apparently can do things first-order logic cannot do.

The paradox of first-order logic can be resolved, though, and the resolution is to be more careful in what we do with our representatoins. The important thing to realize is that our collection of first-order formulas didn’t actually define the stringset $a^n b^n c^n$, it defined graph structures with multiple relations over them, including $\triangleleft$ and $\blacktriangleleft$. In order to get $a^n b^n c^n$ from this collection of graph structures, we have to use a yield function that maps the graph to a specific string. In the case of $a^n b^n c^n$, the yield function uses $\blacktriangleleft$, which is actually just syntactic sugar for an intricate collection of formulas using $\triangleleft^+$. The yield function converts $\blacktriangleleft$ from a parasitic relation defined over $\triangleleft^+$ into a structural primitive, and that conversion doesn’t come for free. The additional computation that goes into this is what allows us to go beyond the star-free boundary into the realm of mild context-sensitivity.

If you still find this confusing, here’s an analogy. Clearly a lot of computation has to go into assigning a tree structure to a string, that’s why parsing is hard. But the opposite is true, too: flattening a tree into a string doesn’t come for free. It often seems trivial to us because phrase structure trees make it very easy to compute the string, but yield mappings can be a lot more complex (the mapping from MG derivation trees to strings, for instance). If you want to flatten a graph into a string, there’s a lot of computation to be done. Our first-order logic doesn’t address the computations that go into the yield function. The computation is hidden away in the interpretative step of considering only some relations for the output structure. This process takes effort, just like I had to do some work to split the figure with both $\triangleleft$ and $\blacktriangleleft$ into two figures with distinct relations.

Interfaces in static syntax: Bats and butterflies

As you might’ve guessed, I’m a big fan of the logical view of transductions, and I have used it a lot in my work on MGs. One thing that sometimes comes up in connection with MGs is how they handle stuff that happens after syntax, at PF and LF. My preferred answer is that they don’t, because there’s nothing special to handle there: It’s syntax all the way down. When linguists talk about interfaces, in my mind they are actually talking about specific ways of interpreting a single structure that already encodes all necessary information. It’s like syntax produces a Rorschach test image and depending on whether we look at it with PF googles or LF googles we see a bat or a butterfly. It’s all just a matter of interpretation.

Think about it. In standard Minimalist syntax, we assume that there are pipelines that take some syntactic structure as input and then rewrite it into something the interfaces can work with. PF is one pipeline (or the end-point of that pipeline), LF another. So essentially we are dealing with two transductions that take a syntactic structure as input and produce some output structure for interface consumption. But what does that mean from the logical perspective? It means that we are defining ancillary predicates over our syntactic structure, some of which encode LF information, others PF information. All we have is a single static representation with all kinds of relations defined over it. Some relations are shared between the interfaces, others only one interface pays attention to. Now we can make an ontological split and say “hey, relations X, Y, and Z are ‘syntax’, and the rest is post-syntactic”, but from the perspective of logical transductions there’s no bite to it. You have primitive relations, you have parasitic relations that piggyback on them, and you have tons of different yield functions to choose from, some of which are empirically viable and others are not. The formalism defines the static structure will all relevant relations, including the parasitic ones. There is no formal split between syntax on the one hand and interfaces on the other, the action all happens in the syntactic formalism and the interfaces are just yield functions that filter out some parts of the structure (did I mention that this dovetails nicely with the bi-morphism perspective of the T-model?). As I said, it’s syntax all the way down.

Now, none of that means that the more traditional view in terms of input-output mappings is wrong. Quite to the contrary: they are both right! Dynamic cascades of transformations VS a single static representation, they are different views of the same system. Asking which one is really true is like asking whether light is a wave or a particle.

But don’t we have an encapsulation problem, then? If it’s all in a single structure, can’t everything be sensitive to everything, LF to PF, syntax to PF, and so on? Yes, one thing the logical perspective shows us is that representational encapsulation doesn’t get the work done. And I think that’s one of several ways how representations lead us astray in our thinking. My hunch is that the answer lies in computation, not representation… more on that next time.

References

Backofen, Rolf, James Rogers, and K. Vijay-Shanker. 1995. A first-order axiomatization of the theory of finite trees. Journal of Logic, Language and Information 4.5–39. doi:10.1007/BF01048403.

Blackburn, Patrick, Claire Gardent, and Wilfried Meyer-Viol. 1993. Talking about trees. Proceedings of the sixth conference of the European chapter of the Association for Computational Linguistics. doi:10.3115/976744.976748.

Cornell, Thomas, and James Rogers. 1998. Model theoretic syntax. The Glot international state of the article book. Studies in generative grammar 48. Mouton de Gruyter. doi:10.1515/9783110822861.171.

Morawietz, Frank. 2003. Two-step approaches to natural language formalisms. Berlin: Walter de Gruyter. doi:10.1515/9783110197259.

Mönnich, Uwe. 1999. On cloning context-freeness. Mathematics of syntactic structure, ed. by. by Hans-Peter Kolb and Uwe Mönnich, 195–231. Berlin: Walter de Gruyter.

Rogers, James. 1998. A descriptive approach to language-theoretic complexity. Stanford: CSLI.

A tribute to Joost Engelfriet

2020-09-02T00:00:00-04:00

Ed Stabler sent me a link to the most recent paper by Joost Engelfriet, which concludes with the following message:

That’s all folks! This was my last paper. Thank you, dear reader, and farewell.

That’s bitter-sweet. On the one hand, I admire that he can draw a line in the sand like this. On the other hand, I wish he’d erase that line and keep going for a few more years. Even though Engelfriet isn’t a mathematical linguist — and might not even be aware of the more linguistic side of that field, the one that we serve here at the Outdex Café — he has had a profound influence on the field, including a lot of my own work.

I’ve never met Joost Engelfriet, I only know him through his papers. But those are some impressive papers. They have been my go-to source for anything related to tree transductions. And if you’re a regular reader of this blog, you know that this isn’t just some mathematical curiosity, tree transductions are the lens that allows us to make sense of syntax. Movement is a tree transduction. The bimorphism perspective of the inverted T-model involves tree transductions. The derivation tree perspective of MGs wouldn’t be possible without tree transductions, and that, in turn, means that we would have missed the subregular nature of syntax. Engelfriet wasn’t the first to work on tree transductions — that honor goes to James Thatcher and William Rounds (who cites Transformational Grammar as a direct inspiration for tree transductions). Engelfriet was about 5 years late to the party, yet he’s shaped the field more than anybody else. I think it’s actually impossible to find a paper on tree transductions that doesn’t cite Engelfriet. He’s that integral to the field; Engelfriet is the Chomsky of tree transductions.

Engelfriet has also worked hard on connecting the automata-theoretic view of tree transductions to mathematical logic, in particular monadic second-order logic (MSO). And because tree transductions apparently aren’t complicated enough for him, he also teamed up with Bruno Courcelle to produce the definitive reference on MSO over graphs. And again this isn’t just some pointless mathematical exercise, MSO has been an integral part of mathematical linguistics since the mid 90s, pioneered by Jim Rogers, Thomas Cornell, and Uwe Mönnich, among others. It allowed us to study derivational formalisms as if they were representational formalisms, and that, too was an important stepping stone for me towards subregular syntax.

Almost every paper I have written incorporates work by Engelfriet in one way or another. And it’s not just that his research is essential to my work and the field at large, his papers are also fun. I won’t pretend that they’re easy reads, at least they’re not for me. Tree transductions are much more complex than string transductions, and topics like tree transducer decompositions or the equivalence of MSO-definable tree transductions and macro tree transductions of linear size increase are not for the faint of heart. But an Engelfriet paper never feels like it is any more complex than it has to be. It’s the same feeling of awe that I experience when watching somebody speedrun Super Metroid: I don’t quite get everything they’re doing, and I certainly can’t do it myself, but I can recognize the beauty in it.

Engelfriet’s work is also like Super Metroid speedruns in that many Outdex readers were probably unaware of its existence. Well, now you’re in the know. And if you now want to become a true Engelfriet aficionado, here’s a list of papers (and one book) that were essential reading for me:

Engelfriet (1975): Back in the days when tree transductions were the wild west, this paper established the key differences between bottom-up and top-down tree transductions that all modern work builds on.
Engelfriet (1977): Another classic, this one introduced the notion of regular look-ahead for top-down tree transductions; look-ahead is now a fundamental parameter of tree transducers. I’ve been thinking about look-ahead a lot recently in an attempt to expand sensing tree automata into transducers for movement. Don’t get your hopes up, though, it’s not ready for a post yet.
Engelfriet and Vogler (1985): The arrival of macro tree transducers. Every kind of tree transduction that you’ll come across in computational syntax is equivalent to some constrained macro tree transducer. Once you understand macro tree transducers, everything else is just a special case of a very general top-down device.
Engelfriet and Hoogeboom (2001): Engelfriet isn’t just all about trees and graphs, he also did some work on string transductions. This paper shows how two-way finite state transducers over strings are related to MSO string transductions, and this result is now spreading in the subregular phonology community.
Engelfriet and Maneth (2003): The paper that establishes the lovely result I mentioned above: MSO tree-to-tree transductions are equivalent to macro tree transductions of linear size increase. This is an important bridge that allows to connect the logical view to the automata-theoretic one. Both are used in mathematical linguistics, e.g. in work on MGs, so it’s good to have some way of translating between them, even if we still need to learn a lot more about how the weaker subclasses relate to each other.
Engelfriet, Lilin, and Maletti (2009): While this paper is focused on a particular type of tree transductions (extended multi bottom-up tree transducers), it gives a good idea of what the modern tree transducer landscape looks like and what the advantages and shortcomings of various transduction types are. I don’t know how many times I’ve looked up Table 1 and Figure 5 in this paper.
Maletti and Engelfriet (2012): Another paper that’s not directly about tree transductions, but is more closely aligned with the interests of computational linguists. This is a follow-up to a paper by Marco Kuhlmann and Giorgio Satta where they show that TAGs are not closed under (strong) lexicalization. Kuhlmann & Satta left open what kind of grammar formalism would be needed to strong lexicalize TAG, and this paper provides the answer: simple context-free tree grammars. This one stings a bit every time I see it because I was working on a reply to Kuhlmann & Satta, but Maletti & Engelfriet were faster and had a much better result.
Courcelle and Engelfriet (2012): The massive tome on MSO graph transductions that I mentioned earlier. No, I haven’t read the whole thing. Yes, I do keep coming back to it.
Engelfriet, Fülöp, and Maletti (2017): Tree transducers break down into two macro-classes, the top-down transducers and the bottom-up transducers. Standard bottom-up transducers don’t cut it for many tasks, including handling movement, so nowadays a lot of the attention goes to multi bottom-up transducers. Top-down transducers are also very limited, prompting the introduction of linear extended top-down tree transducers (also known as synchronous tree substitution grammars — because transductions can be regarded as grammars, too). But extended top-down tree transducers aren’t closed under composition, which means that a cascade of these transductions can do things a single extended top-down tree transducer cannot do. This raises the question how powerful these cascades are, and this paper provides the answer.
Engelfriet, Maletti, and Maneth (2018): This paper studies multiple context-free tree grammars, which are closely related to (set-local) multi-component TAG and multiple context-free string grammars. As you might know, the latter two are weakly equivalent to MGs. I haven’t fully absorbed this paper yet, but I have long wondered if my translation from standard TAG to MGs with lowering could be extended to multi-component TAG. With a bit more time, I might find the answer in this paper.

There’s many other papers I could’ve listed here. Feel free to link to your favorite in the comments. Happy reading everyone!

References (The Engelfriet Paperpalooza)

Courcelle, Bruno, and Joost Engelfriet. 2012. Graph structure and monadic second-order logic: A language-theoretic approach. Cambridge, UK: Cambridge University Press.

Engelfriet, Joost. 1975. Bottom-up and top-down tree transformations — a comparison. Mathematical Systems Theory 9.198–231. doi:10.1007/BF01704020.

Engelfriet, Joost. 1977. Top-down tree transducers with regular look-ahead. Theory of Computing Systems 10.289–303. doi:10.1007/BF01683280.

Engelfriet, Joost, Zoltán Fülöp, and Andreas Maletti. 2017. Composition closure of linear extended top-down tree transducers. Theory of Computing Systems 60.129–171. doi:10.1007/s00224-015-9660-2.

Engelfriet, Joost, and Hendrik Jan Hoogeboom. 2001. MSO definable string transductions and two-way finite-state transducers. ACM Transactions of Computational Logic 2.216–254. doi:10.1145/371316.371512.

Engelfriet, Joost, Eric Lilin, and Andreas Maletti. 2009. Extended multi bottom-up tree transducers. Acta Informatica 46. doi:10.1007/s00236-009-0105-8.

Engelfriet, Joost, Andreas Maletti, and Sebastian Maneth. 2018. Multiple context-free tree grammars: Lexicalization and characterization. Theoretical Computer Science 728.29–99. doi:10.1016/j.tcs.2018.03.014.

Engelfriet, Joost, and Sebastian Maneth. 2003. Macro tree translations of linear size increase are MSO definable. SIAM Journal on Computing 32.950–1006. doi:10.1137/S0097539701394511.

Engelfriet, Joost, and Heiko Vogler. 1985. Macro tree transducers. Journal of Computer and System Sciences 31.71–146.

Maletti, Andreas, and Joost Engelfriet. 2012. Strong lexicalization of tree adjoining grammars. Proceedings of the 50th annual meeting of the association for computational linguistics (acl 2012). https://www.aclweb.org/anthology/P12-1053.