Publication Data
N-Gram Statistical Similarities and Differences between Chinese and English
In this paper, we present results of analyzing quantity and frequency of N-grams in 100 million randomly-sampled English web pages and 100 million randomly-sampled Chinese web papges. We found that 1-gram and 2-gram frequency distributions are very different between Chinese and English; this is understandable since one character in Chinese does not consistitute a word in English. However, we found that 3-gram and 4-grams frequency distributions are surprisingly similar between Chinese and English, leading us to conjecture that in both languages, frequent 3-grams and 4-grams represent a set of concepts that are similar. The distribution of unique numbers of n-grams is quite different between English and Chinese. However, the distribution appears to indicate that, on average, 1.5 Chinese characters corresponds to 1 English word.
