Enlisting the Ghost: Modeling Empty Categories for Machine Translation
Abstract
Empty categories (EC) are artificial elements in Penn Treebanks motivated by the
government-binding (GB) theory to explain certain language phenomena such as
pro-drop. ECs are ubiquitous in languages
like Chinese, but they are tacitly ignored
in most machine translation (MT) work
because of their elusive nature. In this
paper we present a comprehensive treatment of ECs by first recovering them with
a structured MaxEnt model with a rich
set of syntactic and lexical features, and
then incorporating the predicted ECs into
a Chinese-to-English machine translation
task through multiple approaches, including the extraction of EC-specific sparse
features. We show that the recovered
empty categories not only improve the
word alignment quality, but also lead to
significant improvements in a large-scale
state-of-the-art syntactic MT system.