Syntax analysis is a primary part of most natural language processing systems. Successful syntax analysis requires syntax grammar to be effective and broad-coverage. Because manual acquisition of such a grammar is known to be very difficult and time consuming, there has been many efforts to automatically learn a grammar from a large body of text corpus. Most of the efforts so far aim at learning of phrase structure grammar and partially succeeded. However, languages with (partial) free word order characteristic such as Korean are known to be described well in dependency grammar rather than in phrase structure grammar. But, efforts toward automatic acquisition of dependency grammar are rare.
This thesis presents a mechanism of corpus-based automatic induction of probabilistic dependency grammar (PDG). The key part of the work is the probabilistic parameter reestimation algorithm of dependency grammar. Inside-outside algorithm is widely used as a probabilistic parameter reestimation algorithm for context-free phrase structure grammars. This is not directly applicable to PDG because they are not based on constituents but on dependency relations between arbitrary two words. This thesis presents a reestimation algorithm which is a variation of the inside-outside algorithm adapted to PDG. The non-constituent objects, complete-link and complete-sequence, are defined as basic units for dependency structures, and the inside-outside probabilities of the objects are developed. The algorithm reestimates the probabilities of inter-word dependencies through the computation of the inside-outside probabilities of the non-constituent objects. It utilizes a CYK-style chart and has O($n^3$) time complexity, where n is the number of words in the sequence to be processed.
We experimented the PDG reestimation algorithm on learning of Korean probabilistic dependency grammar. It showed that the reestimation algorithm learned a stable grammar from an initial grammar with a reduction of 65% to 78% in grammar size depending on the size of training corpus. The correctness of the learned grammar was tested for unseen test sentences. For the best-1 parses of the test sentences, the grammar showed average 62.82% of parsing accuracy. We also present a robust best-n parsing algorithm for PDG. PDG best-n parsing is based on dynamic programming method utilizing CYK-style chart. It constructs the best-n complete-links and complete-sequences, from bottom up and left-to-right. Parsing robustness is guaranteed by smoothing the observed dependency probabilities to prevent any dependency relation from being assigned zero probability.
The automatically learned dependency grammar can be applied to language modeling. Language modeling based on dependency grammar can be more effective than the previous n-gram approach and the phrase structure grammar-based approach. Dependency grammar can represent long distance dependencies through the hierarchical structure from head-dependent relation between words, which is not supported in n-gram models. Acquisition of an effective dependency grammar can be automated by the proposed PDG reestimation algorithm. Dependency grammar-based language modeling can avoid the training corpus overfitting problem of n-gram models, and can be less affected by the structural-sparseness problem of the phrase structure grammar-based approach. Experiments on Korean raw corpus showed that the proposed dependency grammar-based model performs better than n-gram models at 11% to 11.5% reductions in test corpus entropy.