On the necessity of irrelevant variables
Abstract
This work explores the effects of relevant and
irrelevant boolean variables on the accuracy of classifiers.
The analysis uses the assumption that the variables are
conditionally independent given the class, and focuses on a
natural family of learning algorithms for such sources when the relevant variables have a small advantage over random guessing.
The main result is that algorithms relying predominately on irrelevant variables have error probabilities that quickly go to 0 in situations where algorithms that limit the use of irrelevant variables
have errors bounded below by a positive constant. We also show that
accurate learning is possible even when there are so few examples that
one cannot determine with high confidence whether or not any
individual variable is relevant.
irrelevant boolean variables on the accuracy of classifiers.
The analysis uses the assumption that the variables are
conditionally independent given the class, and focuses on a
natural family of learning algorithms for such sources when the relevant variables have a small advantage over random guessing.
The main result is that algorithms relying predominately on irrelevant variables have error probabilities that quickly go to 0 in situations where algorithms that limit the use of irrelevant variables
have errors bounded below by a positive constant. We also show that
accurate learning is possible even when there are so few examples that
one cannot determine with high confidence whether or not any
individual variable is relevant.