Classifying with Confidence From Incomplete Test Data
Venue
Journal Machine Learning Research (JMLR), vol. 14 (2013)
Publication Year
2013
Authors
Nathan Parris, Hyrum S. Anderson, Maya R. Gupta, Dun Yu Hsaio
BibTeX
Abstract
We consider the classification problem given incomplete information about a test
sample. This problem arises naturally when data about the test sample is collected
over time, or when costs must be incurred to collect the data. For example, in a
distributed sensor network only a fraction of the sensors may have reported
measurements at a certain time, and either additional time, power, bandwidth or
some other cost must be incurred to collect the complete data to classify. A
practical goal is to assign a class label as soon as enough data is available to
make a good decision. We formalize this goal through the notion of reliability ---
the probability that a label assigned to the incomplete data matches the label that
would be assigned to the complete data, and we propose a method to classify
incomplete data only if some reliability threshold is met. Our approach models the
complete data as a random variable whose distribution is dependent on the current
incomplete data and the (complete) training data. The method differs from standard
imputation strategies in that our focus is on determining the reliability of the
classification decision, rather than just the class label. We show that the method
provides useful reliability estimates of the correctness of the imputed class
labels on a set of experiments on time-series datasets, where the goal is to
classify the time-series as early as possible while still guaranteeing that the
reliability threshold is met.
