[Esta página en español]
Sobresaliente Cum Laude
Tree adjoining grammars are an extension of context-free grammars that use trees instead of productions as the primary representing structure and that are considered to be adequate to describe most of syntactic phenomena occurring in natural languages. These grammars generate the class of tree adjoining languages, which is equivalent to the class of languages generated by linear indexed grammars and other mildly context-sensitive formalisms.
In the first part of this dissertation, we introduce the problem of parsing tree adjoining grammars and linear indexed grammars, creating, for both formalisms, a continuum from simple pure bottom-up algorithms to complex predictive algorithms and showing what transformations must be applied to each one in order to obtain the next one in the continuum.
In the second part, we define several models of automata that accept the class of tree adjoining languages, proposing techniques for their efficient execution. The use of automata for parsing is interesting because they allow us to separate the problem of the definition of parsing algorithms from the problem of their execution. We have considered the following types of automata:
Compilation schemata for these models of automata have been defined. A compilation schema allow us to obtain the set of transitions corresponding to the implementation of a parsing strategy for a given grammar.
All the presented automata can be executed in polynomial time with respect to the length of the input string by applying tabulation techniques. A tabular technique makes possible to interpret an automaton by means of the manipulation of collapsed representation of configurations (called items) instead of actual configurations. Items are stored into a table in order to be reused, avoiding redundant computations.
Finally, we have studied the relations among the different classes of automata, the main difference being the storage structure used: embedded stacks, indices lists or coupled stacks. According to the strategies that can be implemented, we can distinguish three kinds of automata: bottom-up automata, including bottom-up embedded push-down automata, bottom-up restricted logic push-down automata, right-oriented linear indexed automata and bottom-up 2-stack automata; top-down automata, including (top-down) embedded push-down automata, top-down restricted logic push-down automata and left-oriented linear indexed automata; and general automata, including strongly-driven linear indexed automata and strongly-driven 2-stack automata.
The dissertation is structured in three parts. The first part is devoted to introduce Tree Adjoining Languages (TAL) and the parsing techniques available for them, focusing on Tree Adjoining Grammars (TAG) and Linear Indexed Grammars (LIG). The second part defines several classes of automata for tree adjoining languages and the tabular techniques that can applied in order to get polynomial execution time. The third part is formed by a set of appendices.
We present this class of languages, focusing on TAG and LIG. Other grammar formalisms also generating TAL are presented briefly.
We describe several tabular algorithms for Tree Adjoining Grammar parsing, creating a continuum from simple pure bottom-up algorithms to complex predictive algorithms and showing what transformations must be applied to each one in order to obtain the next one in the continuum.
We develop a set of new tabular parsing algorithms for Linear Indexed Grammars, including bottom-up algorithms and Earley-like algorithms with and without the valid prefix property, creating a continuum in which one algorithm can in turn be derived from another. The output of these algorithms is a shared forest in the form of a context-free grammar that encodes all possible derivations for a given input string.
We present push-down automata (PDA) as an operational mechanism for context-free grammars (CFG), showing compilation schemata from CFG into PDA and the tabulation techniques proposed by Lang (Based on PUSH transitions) and Nederhof (based on SWAP transitions) for this class of automata.
We present the classical definition of Embedded Push-Down Automata (EPDA) to continue with a redefinition of EPDA in which the finite-state control has been eliminated and several kinds of normalized transition have been defined. We also show that the new definition preserves the equivalence with tree adjoining languages and that tabulation techniques are possible to execute automata in polynomial time.
The class of Bottom-up Embedded Push-Down Automata (BEPDA) is the dual of EPDA. They have been informally present in several papers but a formal and consistent definition has not been published. We provide a formal definition of BEPDA and we show that finite-state control can be eliminated, obtaining a new definition in which transitions are in a form useful to describe compilation schemata for TAG and suitable for tabulation.
Based on the close relationship between LIG and DCG, we try to restrict Logic Push-down Automata (LPDA) to make it adequate to describe parsing strategies TAL: we have described parsing strategies for LIG and TAG into LPDA, we have identified the types of transition involved and we have designed tabulation techniques based on those types of transitions.
This class of automata defined by Nederhof includes two classes of automata: Right-oriented Linear Indexed Automata (R-LIA) to describe parsing strategies for LIG in which the indices lists are recognized bottom-up and Left-oriented Linear Indexed Automata (L-LIA) to describe parsing strategies for LIG in which the indices lists are recognized bottom-up. We show the equivalence of R-LIA with a subclass of the Restricted Logical Push-down Automata presented in chapter 8 and we restrict L-LIA transitions in order to be able to design a tabulation technique less complicated than the technique proposed by Nederhof. We also propose a third class of automata, Strongly-Driven Linear Indexed Automata (SD-LIA), which can be used to describe bottom-up, top-down or mixed parsing strategies for LIG with respect to the recognition of the indices lists. SD-LIA seems to be equivalent to the recently presented Universal-Linear Indexed Automata (U-LIA).
Push-down automata working on two stacks has been identified as an operational device for parsing TAL. However, it is known that unrestricted use of two stacks give us the power of a Turing Machine. The remedy is to consider asymmetric stacks, one being the Master Stack where most of the work is done and the other being the Auxiliary Stack mainly used for restricted bookkeeping. The resulting Strongly-driven 2-Stack Automata (SD-2SA) can be used to describe bottom-up, top-down or mixed parsing strategies for TAG and LIG with respect to the recognition of the adjunctions or indices lists. A tabular technique is designed to allow SD-2SA be executed in polynomial time. We also define a version of SD-2SA specifically designed for bottom-up parsing strategies: Bottom-up 2-Stack Automata (BU-2SA).
There exists a close relation among the different classes of automata for TAL, the main difference being the storage structure used: embedded stacks, indices lists or coupled stacks. According to the strategies that cen be implemented, we can distinguish three kinds of automata:
A brief introduction to Parsing Schemata, the framework used to describe the parsing algorithms in chapters 3 and 4.
A brief introduction to CYK and Earley's algorithm for context-free grammars. These algorithms are the base of the algorithms for TAG and LIG shown in chapters 3 and 4.
We show how LR parsers for the analysis of arbitrary context-free grammars can be derived from classical Earley's parsing algorithm. The result is a Generalized LR parsing algorithm working at complexity O(n3) in the worst case, which is achieved by the use of tabulation to represent the non-deterministic evolution of the stack instead of graph-structured stack representations, as has often been the case in previous approaches. Then, this algorithms is extended to deal with Definite Clause Grammars and Linear Indexed Grammars.
Note: the dissertation is written in Spanish. If you are not fluent in Spanish, you can read some of the publications originated from the thesis.
Comments and suggestions about this dissertation are welcome. You can contact the author in the address
Miguel A. Alonso Pardo Departamento de Computación Facultad de Informática Campus de Elviña s/n 15071 La Coruña (SPAIN)
or by an e-mail addressed to firstname.lastname@example.org.
The web pages of the author contain additional information about the dissertation and related work. The URL is http://www.dc.fi.udc.es/~alonso/