Multilingual Language Processing From Bytes
Venue
NAACL (2016) (to appear)
Publication Year
2016
Authors
Dan Gillick, Cliff Brunk, Oriol Vinyals, Amarnag Subramanya
BibTeX
Abstract
We describe an LSTM-based model which we call Byte-to-Span (BTS) that reads text as
bytes and outputs span annotations of the form [start, length, label] where start
positions, lengths, and labels are separate entries in our vocabulary. Because we
operate directly on unicode bytes rather than language-specific words or
characters, we can analyze text in many languages with a single model. Due to the
small vocabulary size, these multilingual models are very compact, but produce
results similar to or better than the state-of-the-art in Part-of-Speech tagging
and Named Entity Recognition that use only the provided training datasets (no
external data sources). Our models are learning “from scratch” in that they do not
rely on any elements of the standard pipeline in Natural Language Processing
(including tokenization), and thus can run in standalone fashion on raw text.
