Large-Scale Automatic Classification of Phishing Pages
Venue
NDSS '10 (2010)
Publication Year
2010
Authors
Colin Whittaker, Brian Ryner, Marria Nazif
BibTeX
Abstract
Phishing websites, fraudulent sites that trick viewers into interacting with them,
continue to cost Internet users over a billion dollars each year. In this paper, we
describe the design and performance characteristics of a scalable machine learning
classifier we developed to detect phishing web sites. We use this classifier to
maintain Google's phishing blacklist automatically. Our classifier analyzes
millions of pages a day, examining the URL and the contents of a page to determine
whether or not a page is phishing. Unlike previous work in this field, we train the
classifier on a noisy dataset consisting of millions of samples from previously
collected live classification data. Despite the noise in the training data, our
classifier learns a robust model for identifying phishing pages which correctly
classifies more than 90% of phishing pages several weeks after training concludes.
