A Database for Measuring Linguistic Information Content.
Venue
Language Resources and Evaluation Conference, ELDA, 330 W 58th St (2014)
Publication Year
2014
Authors
Richard Sproat, Bruno Cartoni, HyunJeong Choe, David Huynh, Linne Ha, Ravindran Rajakumar, Evelyn Wenzel-Grondie
BibTeX
Abstract
Which languages convey the most information in a given amount of space? This is a
question often asked of linguists, especially by engineers who often have some
information theoretic measure of ``information'' in mind, but rarely define exactly
how they would measure that information. The question is, in fact remarkably hard
to answer, and many linguists consider it unanswerable. But it is a question that
seems as if it ought to have an answer. If one had a database of close translations
between a set of typologically diverse languages, with detailed marking of
morphosyntactic and morphosemantic features, one could hope to quantify the
differences between how these different languages convey information. Since no
appropriate database exists we decided to construct one. The purpose of this paper
is to present our work on the database, along with some preliminary results. We
plan to release the dataset once complete.
