We present methods for partitioning a weighted finite-state transducer (WFST)
representation of an n-gram language model into multiple shards, each of which is a
stand-alone WFST n-gram model in its own right, allowing processing with existing
algorithms. After independent estimation, including normalization, smoothing and
pruning on each shard, the shards can be merged into a single WFST that is
identical to the model that would have resulted from estimation without sharding.
We then present an approach that uses data partitions in conjunction with WFST
sharding to estimate models on orders-of-magnitude more data than would have
otherwise been feasible with a single process. We present some numbers on shard
characteristics when large models are trained from a very large data set.
Functionality to support distributed n-gram modeling has been added to the OpenGrm