A Comparison of Features for Automatic Readability Assessment
Venue
23rd International Conference on Computational Linguistics (COLING 2010), Poster Volume, pp. 276-284
Publication Year
2010
Authors
Lijun Feng, Martin Jansche, Matt Huenerfauth, Noémie Elhadad
BibTeX
Abstract
Several sets of explanatory variables – including shallow, language modeling, POS,
syntactic, and discourse features – are compared and evaluated in terms of their
impact on predicting the grade level of reading material for primary school
students. We find that features based on in-domain language models have the highest
predictive power. Entity-density (a discourse feature) and POS-features, in
particular nouns, are individually very useful but highly correlated. Average
sentence length (a shallow feature) is more useful – and less expensive to compute
– than individual syntactic features. A judicious combination of features examined
here results in a significant improvement over the state of the art.
