<?xml version="1.0" encoding="UTF-8" ?>
<feed xmlns="http://www.w3.org/2005/Atom">
<id>http://research.google.com/pubs/atom.xml</id>
<title type="text">Recent Google Publications (Atom)</title>
<updated>2009-11-12T17:20:24-08:00</updated>
<link href="http://research.google.com/pubs/atom.xml" rel="self"/>
<link type="text/html" rel="alternate" href="http://research.google.com/pubs/papers.html"/>
<entry>
<title><![CDATA[Large-Scale Automatic Classification of Phishing Pages]]></title>
<updated>2009-11-12T11:30:19-08:00</updated>
<id>urn:googlelabs:35580</id>
<summary>Phishing websites, fraudulent sites that trick viewers into interacting with them, continue to cost Internet users over a billion dollars each year. In this paper, we describe the design and performance characteristics of a scalable machine learning classifier we developed to detect phishing web sites. We use this classifier to maintain Google&#39;s phishing blacklist automatically. Our classifier analyzes millions of pages a day, examining the URL and the contents of a page to determine whether or not a page is phishing. Unlike previous work in this field, we train the classifier on a noisy dataset consisting of millions of samples from previously collected live classification data. Despite the noise in the training data, our classifier learns a robust model for identifying phishing pages which correctly classifies more than 90\% of phishing pages several weeks after training concludes.</summary>
<link rel="alternate" href="http://www.google.com/search?lr=&amp;ie=UTF-8&amp;oe=UTF-8&amp;q=Large-Scale+Automatic+Classification+of+Phishing+Pages+Whittaker+Ryner+Nazif" type="text/html" title="Search for publication"/>
<category term="Security, Cryptography, and Privacy" label="Security, Cryptography, and Privacy"/>
<author><name>Colin Whittaker, Brian Ryner, Marria Nazif</name></author>
</entry>
<entry>
<title><![CDATA[Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models]]></title>
<updated>2009-11-11T10:17:44-08:00</updated>
<id>urn:googlelabs:35648</id>
<summary>Efficient Large-Scale Distributed Training of Conditional Maximum Entropy Models, Gideon Mann, Ryan McDonald, Mehryar Mohri, Nathan Silberman, Daniel Walker IV</summary>
<link rel="alternate" href="http://www.google.com/search?lr=&amp;ie=UTF-8&amp;oe=UTF-8&amp;q=Efficient+Large-Scale+Distributed+Training+of+Conditional+Maximum+Entropy+Models+Mann+McDonald+Mohri+Silberman+IV" type="text/html" title="Search for publication"/>
<category term="Machine Learning" label="Machine Learning"/>
<author><name>Gideon Mann, Ryan McDonald, Mehryar Mohri, Nathan Silberman, Daniel Walker IV</name></author>
</entry>
<entry>
<title><![CDATA[Video2Text: Learning to Annotate Video Content]]></title>
<updated>2009-11-07T23:55:04-08:00</updated>
<id>urn:googlelabs:35638</id>
<summary>This paper discusses a new method for automatic
discovery and organization of descriptive concepts (labels)
within large real-world corpora of user-uploaded multimedia,
such as YouTube.com. Conversely, it also provides validation
of existing labels, if any. While training, our method does not
assume any explicit manual annotation other than the weak
labels already available in the form of video title, descrip-
tion, and tags. Prior work related to such auto-annotation
assumed that a vocabulary of labels of interest (e.g., indoor,
outdoor, city, landscape) is speciﬁed a priori. In contrast,
the proposed method begins with an empty vocabulary. It
analyzes audiovisual features of 25 million YouTube.com videos
– nearly 150 years of video data – effectively searching for
consistent correlation between these features and text metadata.
It autonomously extends the label vocabulary as and when it
discovers concepts it can reliably identify, eventually leading
to a vocabulary with thousands of labels and growing. We
believe that this work signiﬁcantly extends the state of the art
in multimedia data mining, discovery, and organization based
on the technical merit of the proposed ideas as well as the
enormous scale of the mining exercise in a very challenging,
unconstrained, noisy domain.</summary>
<link rel="alternate" href="http://www.google.com/search?lr=&amp;ie=UTF-8&amp;oe=UTF-8&amp;q=Video2Text%3A+Learning+to+Annotate+Video+Content+Aradhye+Toderici+Yagnik" type="text/html" title="Search for publication"/>
<category term="Data Mining" label="Data Mining"/>
<author><name>Hrishikesh Aradhye, George Toderici, Jay Yagnik</name></author>
</entry>
<entry>
<title><![CDATA[Automatic, Efficient, Temporally-Coherent Video Enhancement for Large Scale Applications]]></title>
<updated>2009-11-03T09:08:53-08:00</updated>
<id>urn:googlelabs:35264</id>
<summary>A fast and robust method for video contrast enhancement is presented.
The method uses the histogram of each frame, along with upper and lower
bounds computed per shot in order to enhance the current frame. This
ensures that the artifacts introduced during the enhancement is reduced
to a minimum. Traditional methods that do not compute per-shot estimates
tend to over-enhance parts of the video such as fades and transitions.
Our method does not suffer from this problem, which is essential for a
fully automatic algorithm. We present the parameters for our methods
which yielded the best human feedback, which showed that out of 208 videos,
203 were enhanced, while the remaining 5 were of too poor quality to
be enhanced. Additionally, we present a visual comparison of our work with
the recently-proposed Weighted Thresholded Histogram Equalization (WTHE) algorithm.</summary>
<link rel="alternate" href="http://www.google.com/search?lr=&amp;ie=UTF-8&amp;oe=UTF-8&amp;q=Automatic%2C+Efficient%2C+Temporally-Coherent+Video+Enhancement+for+Large+Scale+Applications+Toderici+Yagnik" type="text/html" title="Search for publication"/>
<category term="Computer Vision" label="Computer Vision"/>
<author><name>George Toderici, Jay Yagnik</name></author>
</entry>
<entry>
<title><![CDATA[An Online Algorithm for Large Scale Image Similarity Learning]]></title>
<updated>2009-10-28T12:08:37-08:00</updated>
<id>urn:googlelabs:35311</id>
<summary>Learning a measure of similarity between pairs of objects is a
  fundamental problem in machine learning. It stands in the core of
  classifications methods like kernel machines, and is particularly
  useful for applications like searching for images that are similar
  to a given image or finding videos that are relevant to a given
  video. In these tasks, users look for objects that are not only
  visually similar but also semantically related to a given
  object. Unfortunately, current approaches for learning similarity do
  not scale to large datasets, especially when imposing metric
  constraints on the learned similarity.
  We describe OASIS, a method for learning pairwise similarity that is
  fast and scales linearly with the number of objects and the number of
  non-zero features. Scalability is achieved through online learning of a
  bilinear model over sparse representations using a large margin
  criterion and an efficient hinge loss cost. OASIS is accurate at a
  wide range of scales: on a standard benchmark with thousands of
  images, it is more precise than state-of-the-art methods, and faster
  by orders of magnitude. On 2 millions images collected from the web,
  OASIS can be trained within 3 days on a single CPU. The non-metric
  similarities learned by OASIS can be transformed into metric
  similarities, achieving higher precisions than similarities that are
  learned as metrics in the first place. This suggests an approach for
  learning a metric from data that is larger by two orders of magnitude
  than was handled before.</summary>
<link rel="alternate" href="http://www.google.com/search?lr=&amp;ie=UTF-8&amp;oe=UTF-8&amp;q=An+Online+Algorithm+for+Large+Scale+Image+Similarity+Learning+Chechik+Sharma+Shalit+Bengio" type="text/html" title="Search for publication"/>
<category term="Machine Learning" label="Machine Learning"/>
<author><name>Gal Chechik, Varun Sharma, Uri Shalit, Samy Bengio</name></author>
</entry>
<entry>
<title><![CDATA[Group Sparse Coding]]></title>
<updated>2009-10-28T12:08:25-08:00</updated>
<id>urn:googlelabs:35313</id>
<summary>Group Sparse Coding, Samy Bengio, Fernando Pereira, Yoram Singer, Dennis Strelow</summary>
<link rel="alternate" href="http://www.google.com/search?lr=&amp;ie=UTF-8&amp;oe=UTF-8&amp;q=Group+Sparse+Coding+Bengio+Pereira+Singer+Strelow" type="text/html" title="Search for publication"/>
<category term="Machine Learning" label="Machine Learning"/>
<author><name>Samy Bengio, Fernando Pereira, Yoram Singer, Dennis Strelow</name></author>
</entry>
<entry>
<title><![CDATA[Competitive buffer management with packet dependencies]]></title>
<updated>2009-10-24T16:29:37-08:00</updated>
<id>urn:googlelabs:35632</id>
<summary>Competitive buffer management with packet dependencies, Alex Kesselman, Boaz Patt-Shamir, Gabriel Scalosub</summary>
<link rel="alternate" href="http://www.google.com/search?lr=&amp;ie=UTF-8&amp;oe=UTF-8&amp;q=Competitive+buffer+management+with+packet+dependencies+Kesselman+Patt-Shamir+Scalosub" type="text/html" title="Search for publication"/>
<category term="Algorithms" label="Algorithms"/>
<author><name>Alex Kesselman, Boaz Patt-Shamir, Gabriel Scalosub</name></author>
</entry>
<entry>
<title><![CDATA[Creating a High-Quality Machine Translation System for a Low-Resource Language: Yiddish]]></title>
<updated>2009-10-23T18:07:03-08:00</updated>
<id>urn:googlelabs:35627</id>
<summary>Creating a High-Quality Machine Translation System for a Low-Resource Language: Yiddish, Dmitriy Genzel, Klaus Macherey, Jakob Uszkoreit</summary>
<link rel="alternate" href="http://www.mt-archive.info/MTS-2009-Genzel.pdf"/>
<category term="Natural Language Processing" label="Natural Language Processing"/>
<author><name>Dmitriy Genzel, Klaus Macherey, Jakob Uszkoreit</name></author>
</entry>
<entry>
<title><![CDATA[Web-scale extraction of structured data.]]></title>
<updated>2009-10-20T23:05:44-08:00</updated>
<id>urn:googlelabs:35625</id>
<summary>Web-scale extraction of structured data., Michael Cafarella, Jayant Madhavan, Alon Halevy</summary>
<link rel="alternate" href="http://www.google.com/search?lr=&amp;ie=UTF-8&amp;oe=UTF-8&amp;q=Web-scale+extraction+of+structured+data.+Cafarella+Madhavan+Halevy" type="text/html" title="Search for publication"/>
<category term="Hypertext and the Web" label="Hypertext and the Web"/>
<author><name>Michael Cafarella, Jayant Madhavan, Alon Halevy</name></author>
</entry>
<entry>
<title><![CDATA[Exploring Schema Repositories with Schemr]]></title>
<updated>2009-10-20T23:02:49-08:00</updated>
<id>urn:googlelabs:35624</id>
<summary>Exploring Schema Repositories with Schemr, Kuang Chen, Jayant Madhavan, Alon Halevy</summary>
<link rel="alternate" href="http://www.google.com/search?lr=&amp;ie=UTF-8&amp;oe=UTF-8&amp;q=Exploring+Schema+Repositories+with+Schemr+Chen+Madhavan+Halevy" type="text/html" title="Search for publication"/>
<category term="Data and System Management" label="Data and System Management"/>
<author><name>Kuang Chen, Jayant Madhavan, Alon Halevy</name></author>
</entry>
</feed>
