Publication Data
A Comparison of Features for Automatic Readability Assessment
Abstract: Several sets of explanatory variables – including shallow,
language modeling, POS, syntactic, and discourse features – are compared and evaluated
in terms of their impact on predicting the grade level of reading material for primary
school students. We find that features based on in-domain language models have the
highest predictive power. Entity-density (a discourse feature) and POS-features, in
particular nouns, are individually very useful but highly correlated. Average sentence
length (a shallow feature) is more useful – and less expensive to compute – than
individual syntactic features. A judicious combination of features examined here
results in a significant improvement over the state of the art.
