Publication Data
Large-Scale Automatic Classification of Phishing Pages
Abstract: Phishing websites, fraudulent sites that trick viewers into
interacting with them, continue to cost Internet users over a billion dollars each
year. In this paper, we describe the design and performance characteristics of a
scalable machine learning classifier we developed to detect phishing web sites. We use
this classifier to maintain Google's phishing blacklist automatically. Our classifier
analyzes millions of pages a day, examining the URL and the contents of a page to
determine whether or not a page is phishing. Unlike previous work in this field, we
train the classifier on a noisy dataset consisting of millions of samples from
previously collected live classification data. Despite the noise in the training data,
our classifier learns a robust model for identifying phishing pages which correctly
classifies more than 90% of phishing pages several weeks after training concludes.
