Jump to Content
Marc Najork

Marc Najork

Marc Najork is a Distinguished Research Scientist at Google DeepMind. Previously, he was a Senior Director of Research Engineering at Google Research. Before joining Google, Marc was a Principal Researcher at Microsoft Research (2001-2014) and prior to that a Researcher at the DEC/Compaq Systems Researcher Center (1993-2001). Marc earned a Ph.D. in Computer Science from the University of Illinois. He is an ACM Fellow, an IEEE Fellow and a member of the SIGIR Academy. His service activities include Editor-in-Chief of the ACM Transactions on the Web (2011-2014), news board co-chair of the Communications of the ACM (2008-2014), conference chair of WSDM 2008, and program co-chair of WWW 2004 and WWW 2021.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Job Type Extraction for Service Businesses
    Yaping Qi
    Hayk Zakaryan
    Yonghua Wu
    Companion Proceedings of the ACM Web Conference 2023
    Preview abstract Google My Business (GMB) is a platform that allows business owners to manage their business profiles, which will be displayed when a user issues a relevant query on Google Search or Maps. Many GMB businesses provide diverse services from home cleaning and plumbing to legal services and education. However the exact service content, which we call job types, is often missing in their profiles. This leaves the burden of finding such content to the users, either by the tedious work of scanning through business websites or time-consuming calling of the owners. In the present paper, we describe how we build a pipeline to automatically extract the job types from websites of business owners and how we solve scalability issues for deployment. Rather than focusing on developing novel and sophisticated machine learning models, we share various challenges we have faced and practical experiences of building such a pipeline, including the cold start problem of dataset collection with limited human annotation resource, scalability, reaching a launch bar of high precision, and building a general pipeline with reasonable coverage of any free-text web pages without relying on the Document Object Model (DOM) structure. With these challenges, standard approaches for information extraction do not directly apply or are not scalable to be served. In this paper, we show how we address these challenges in different stages of the extraction pipeline, including: (1) utilizing structured content like tables and lists to tackle the cold start problem of dataset collection; (2) exploitation of various context information to improve model performance without hurting scalability; and (3) formulating the extraction problem as a retrieval task to improve generalizability, efficiency as well as coverage. The pipeline has been successfully deployed, and is scalable enough to be refreshed every few days to extract the latest online information. The extracted job types are serving millions of users of Google Search and Google Maps with at least three use cases: (1) job types of a place are directly displayed on mobile devices; (2) job types provide explanation as to why a place shows up given a query; (3) job types are used as a signal to rank business places. According to a user survey, the displayed job types has greatly enhanced the probability of a user hiring a service provider. View details
    Creator Context for Tweet Recommendation
    Matt Colen
    Sergey Levi
    Vladimir Ofitserov
    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track
    Preview abstract When discussing a tweet, people usually not only refer to the content it delivers, but also to the person behind the tweet. In other words, grounding the interpretation of the tweet in the context of its creator plays an important role in deciphering the true intent and the importance of the tweet. In this paper, we attempt to answer the question of how creator context should be used to advance tweet understanding. Specifically, we investigate the usefulness of different types of creator context, and examine different model structures for incorporating creator context in tweet modeling. We evaluate our tweet understanding models on a practical use case -- recommending relevant tweets to news articles. This use case already exists in popular news apps, and can also serve as a useful assistive tool for journalists. We discover that creator context is essential for tweet understanding, and can improve application metrics by a large margin. However, we also observe that not all creator contexts are equal. Creator context can be time sensitive and noisy. Careful creator context selection and deliberate model structure design play an important role in creator context effectiveness. View details
    Regression Compatible Listwise Objectives for Calibrated Ranking with Binary Relevance
    Pratyush Kar
    Bing-Rong Lin
    Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (2023)
    Preview abstract As Learning-to-Rank (LTR) approaches primarily seek to improve ranking quality, their output scores are not scale-calibrated by design. This fundamentally limits LTR usage in score-sensitive applications. Though a simple multi-objective approach that combines a regression and a ranking objective can effectively learn scale-calibrated scores, we argue that the two objectives are not necessarily compatible, which makes the trade-off less ideal for either of them. In this paper, we propose a practical regression compatible ranking (RCR) approach that achieves a better trade-off, where the two ranking and regression components are proved to be mutually aligned. Although the same idea applies to ranking with both binary and graded relevance, we mainly focus on binary labels in this paper. We evaluate the proposed approach on several public LTR benchmarks and show that it consistently achieves either best or competitive result in terms of both regression and ranking metrics, and significantly improves the Pareto frontiers in the context of multi-objective optimization. Furthermore, we evaluated the proposed approach on YouTube Search and found that it not only improved the ranking quality of the production pCTR model, but also brought gains to the click prediction accuracy. The proposed approach has been successfully deployed in the YouTube production system. View details
    Preview abstract Automatic headline generation enables users to comprehend ongoing news events promptly and has recently become an important task in web mining and natural language processing. With the growing need for news headline generation, we argue that the hallucination issue, namely the generated headlines being not supported by the original news stories, is a critical challenge for the deployment of this feature in web-scale systems Meanwhile, due to the infrequency of hallucination cases and the requirement of careful reading for raters to reach the correct consensus, it is difficult to acquire a large dataset for training a model to detect such hallucinations through human curation. In this work, we present a new framework named ExHalder to address this challenge for headline hallucination detection. ExHalder adapts the knowledge from public natural language inference datasets into the news domain and learns to generate natural language sentences to explain the hallucination detection results. To evaluate the model performance, we carefully collect a dataset with more than six thousand labeled "article, headline" pairs. Extensive experiments on this dataset and another six public ones demonstrate that ExHalder can identify hallucinated headlines accurately and justifies its predictions with human-readable natural language explanations. View details
    Preview abstract Unbiased learning to rank (ULTR) studies the problem of mitigating various biases from implicit user feedback data such as clicks, and has been receiving considerable attention recently. A popular ULTR approach for real-world applications uses a two-tower architecture, where click modeling is factorized into a relevance tower with regular input features, and a bias tower with bias-relevant inputs such as the position of a document. A successful factorization will allow the relevance tower to be exempt from biases. In this work, we identify a critical issue that existing ULTR methods ignored - the bias tower can be confounded with the relevance tower via the underlying true relevance. In particular, the positions were determined by the logging policy, i.e., the previous production model, which would possess relevance information. We give both theoretical analysis and empirical results to show the negative effects on relevance tower due to such a correlation. We then propose two methods to mitigate the negative confounding effects by better disentangling relevance and bias. Offline empirical results on both controlled public datasets and a large-scale industry dataset show the effectiveness of the proposed approaches. We conduct a live experiment on a popular web store for four weeks, and find a significant improvement in user clicks over the baseline, which ignores the negative confounding effect. View details
    Preview abstract Historically, information retrieval systems have all followed the same paradigm: information seekers frame their needs in the form of a short query, the system selects a small set of relevant results from a corpus of available documents, rank-orders the results by decreasing relevance, possibly excerpts a responsive passage for each result, and returns a list of references and excerpts to the user. Retrieval systems typically did not attempt fusing information from multiple documents into an answer and displaying that answer directly. This was largely due to available technology: at the core of each retrieval system is an index that maps lexical tokens or semantic embeddings to document identifiers. Indices are designed for retrieving responsive documents; they do not support integrating these documents into a holistic answer. More recently, the coming-of-age of deep neural networks has dramatically improved the capabilities of large language models (LLMs). Trained on a large corpus of documents, these models not only memorize the vocabulary, morphology and syntax of human languages, but have shown to be able to memorize facts and relations [2]. Generative language models, when provided with a prompt, will extend the prompt with likely completions – an ability that can be used to extract answers to questions from the model. Two years ago, Metzler et al. argued that this ability of LLMs will allow us to rethink the search paradigm: to answer information needs directly rather that directing users to responsive primary sources [1]. Their vision was not without controversy; the following year Shaw and Bender argued that such a system is neither feasible nor desirable [3]. Nonetheless, the past year has seen the emergence of such systems, with offerings from established search engines and multiple new entrants to the industry. The keynote will summarize the short history of these generative information retrieval systems, and focus on the many open challenges in this emerging field: ensuring that answers are grounded, attributing answer passages to a primary source, providing nuanced answers to non-factoid-seeking questions, avoiding bias, and going beyond simple regurgitation of memorized facts. It will also touch on the changing nature of the content ecosystem. LLMs are starting to be used to generate web content. Should search engines treat such derived content equal to human-authored content? Is it possible to distinguish generated from original content? How should we view hybrid authorship where humans contribute ideas and LLMs shape these ideas into prose? And how will this parallel technical evolution of search engines and content ecosystems affect their respective business models? View details
    Preview abstract Comparative decisions, such as picking between two cars or deciding between two hiking trails, require the users to visit multiple webpages and contrast the choices along relevant aspects. Given the impressive capabilities of pre-trained large language models, we ask whether they can help automate such analysis. We refer to this task as extractive aspect-based contrastive summarization which involves constructing a structured summary that compares the choices along relevant aspects. In this paper, we propose a novel method called STRUM for this task that can generalize across domains without requiring any human-written summaries or fixed aspect list as supervision. Given a set of relevant input webpages, STRUM solves this problem using two pre-trained T5-based large language models: first one fine-tuned for aspect and value extraction, and second one fine-tuned for natural language inference. We showcase the abilities of our method across different domains, identify shortcomings, and discuss questions that we believe will be critical in this new line of research. View details
    DSI++: Updating Transformer Memory with New Documents
    Yi Tay
    Jinfeng Rao
    Emma Strubell
    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing
    Preview abstract Differentiable Search Indices (DSIs) encode a corpus of documents in model parameters and use the same model to answer user queries directly. Despite the strong performance of DSI models, deploying them in situations where the corpus changes over time is computationally expensive because reindexing the corpus requires re-training the model. In this work, we introduce DSI++, a continual learning challenge for DSI to incrementally index new documents while being able to answer queries related to both previously and newly indexed documents. Across different model scales and document identifier representations, we show that continual indexing of new documents leads to considerable forgetting of previously indexed documents. We also hypothesize and verify that the model experiences forgetting events during training, leading to unstable learning. To mitigate these issues, we investigate two approaches. The first focuses on modifying the training dynamics. Flatter minima implicitly alleviate forgetting, so we optimize for flatter loss basins and show that the model stably memorizes more documents (+12%). Next, we introduce a generative memory to sample pseudo-queries for documents and supplement them during continual indexing to prevent forgetting for the retrieval task. Extensive experiments on novel continual indexing benchmarks based on Natural Questions (NQ) and MS MARCO demonstrate that our proposed solution mitigates forgetting significantly. Concretely, it improves the average Hits@10 by +21.1% over competitive baselines for NQ and requires 6 times fewer model updates compared to re-training the DSI model for incrementally indexing five corpora in a sequence. View details
    Preview abstract Query-document relevance prediction is a critical problem in Information Retrieval systems. This problem has increasingly been tackled using (pretrained) transformer-based models which are finetuned using large collections of labeled data. However, in specialized domains such as e-commerce and healthcare, the viability of this approach is limited by the dearth of large in-domain data. To address this paucity, recent methods leverage these powerful models to generate high-quality task and domain-specific synthetic data. Prior work has largely explored synthetic data generation or query generation (QGen) for Question-Answering (QA) and binary (yes/no) relevance prediction, where for instance, the QGen models are given a document, and trained to generate a query relevant to that document. However in many problems, we have a more fine-grained notion of relevance than a simple yes/no label. Thus, in this work, we conduct a detailed study into how QGen approaches can be leveraged for nuanced relevance prediction. We demonstrate that – contrary to claims from prior works – current QGen approaches fall short of the more conventional cross-domain transfer-learning approaches. Via empirical studies spanning three public e-commerce benchmarks, we identify new shortcomings of existing QGen approaches – including their inability to distinguish between different grades of relevance. To address this, we introduce label-conditioned QGen models which incorporates knowledge about the different relevance. While our experiments demonstrate that these modifications help improve performance of QGen techniques, we also find that QGen approaches struggle to capture the full nuance of the relevance label space and as a result the generated queries are not faithful to the desired relevance label. View details
    Generative Information Retrieval (abstract)
    Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (2023), pp. 1
    Preview abstract Historically, information retrieval systems have all followed the same paradigm: information seekers frame their needs in the form of a short query, the system selects a small set of relevant results from a corpus of available documents, rank-orders the results by decreasing relevance, possibly excerpts a responsive passage for each result, and returns a list of references and excerpts to the user. Retrieval systems typically did not attempt fusing information from multiple documents into an answer and displaying that answer directly. This was largely due to available technology: at the core of each retrieval system is an index that maps lexical tokens or semantic embeddings to document identifiers. Indices are designed for retrieving responsive documents; they do not support integrating these documents into a holistic answer. More recently, the coming-of-age of deep neural networks has dramatically improved the capabilities of large language models (LLMs). Trained on a large corpus of documents, these models not only memorize the vocabulary, morphology and syntax of human languages, but have shown to be able to memorize facts and relations~\cite{HowMuchKnowledge}. Generative language models, when provided with a prompt, will extend the prompt with likely completions -- an ability that can be used to extract answers to questions from the model. Two years ago, Metzler et al. argued that this ability of LLMs will allow us to rethink the search paradigm: to answer information needs directly rather that directing users to responsive primary sources~\cite{RethinkingSearch}. Their vision was not without controversy; the following year Shaw and Bender argued that such a system is neither feasible nor desirable~\cite{SituatingSearch}. Nonetheless, the past year has seen the emergence of such systems, with offerings from established search engines and multiple new entrants to the industry. The keynote will briefly the short history of these generative information retrieval systems, and focus on the many open challenges in this emerging field: ensuring that answers are grounded, attributing answer passages to a primary source, providing nuanced answers to non-factoid-seeking questions, avoiding bias, and going beyond simple regurgitation of memorized facts. It will also touch on the changing nature of the content ecosystem. LLMs are starting to be used to generate web content. Should search engines treat such derived content equal to human-authored content? Is it possible to distinguish generated from original content? How should we view hybrid authorship where humans contribute ideas and LLMs shape these ideas into prose? And how will this parallel technical evolution of search engines and content ecosystems affect their respective business models? View details
    End-to-End Query Term Weighting
    Karan Samel
    Swaraj Khadanga
    Wensong Xu
    Xingyu Wang
    Kashyap Kolipaka
    Proceedings of the 29th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '23) (2023)
    Preview abstract Bag-of-words based lexical retrieval systems are still the most commonly used methods for real-world search applications. Recently deep learning methods have shown promising results to improve this retrieval performance but are expensive to run in an online fashion, non-trivial to integrate into existing production systems, and might not generalize well in out-of-domain retrieval scenarios. Instead, we build on top of lexical retrievers by proposing a Term Weighting BERT (TW-BERT) model. TW-BERT learns to predict the weight for individual n-gram (e.g., uni-grams and bi-grams) query input terms. These inferred weights and terms can be used directly by a retrieval system to perform a query search. To optimize these term weights, TW-BERT incorporates the scoring function used by the search engine, such as BM25, to score query-document pairs. Given sample query-document pairs we can compute a ranking loss over these matching scores, optimizing the learned query term weights in an end-to-end fashion. Aligning TW-BERT with search engine scorers minimizes the changes needed to integrate it into existing production applications, whereas existing deep learning based search methods would require further infrastructure optimization and hardware requirements. The learned weights can be easily utilized by standard lexical retrievers and by other retrieval techniques such as query expansion. We show that TW-BERT improves retrieval performance over strong term weighting baselines within MSMARCO and in out-of-domain retrieval on TREC datasets. View details
    Search and Discovery in Personal Email Collections (Tutorial Proposal)
    Proceedings of the 15th ACM International Conference on Web Search and Data Mining (2022), 1617–1619
    Preview abstract Email has been an essential communication medium for many years. As a result, the information accumulated in our mailboxes has become valuable for all of our personal and professional activities. For years, researchers have developed interfaces, models, and algorithms to facilitate email search, discovery, and organization. This tutorial brings together these diverse research directions and provides both a historical background, as well as a high-level overview of the recent advances in the field. In particular, we lay out all of the components needed in the design of email search engines, including user interfaces, indexing, document and query understanding, retrieval, ranking, evaluation, and data privacy. The tutorial also goes beyond search, presenting recent work on intelligent task assistance in email and a number of interesting future directions. View details
    Scale Calibration of Deep Ranking Models
    28TH ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) (2022), pp. 4300-4309
    Preview abstract Learning-to-Rank (LTR) systems are ubiquitous in web applications nowadays. The existing literature mainly focuses on improving ranking performance by trying to generate the optimal order of candidate items. However, virtually all advanced ranking functions are not scale calibrated. For example, rankers have the freedom to add a constant to all item scores without changing their relative order. This property has resulted in several limitations in deploying advanced ranking methods in practice. On the one hand, it limits the use of effective ranking functions in important applications. For example, in ads ranking, predicted Click-Through Rate (pCTR) is used for ranking and is required to be calibrated for the downstream ads auction. This is a major reason that existing ads ranking methods use scale calibrated pointwise loss functions that may sacrifice ranking performance. On the other hand, popular ranking losses are translation-invariant. We rigorously show that, both theoretically and empirically, this property leads to training instability that may cause severe practical issues. In this paper, we study how to perform scale calibration of deep ranking models to address the above concerns. We design three different formulations to calibrate ranking models through calibrated ranking losses. Unlike existing post-processing methods, our calibration is performed during training, which can resolve the training instability issue without any additional processing. We conduct experiments on the standard LTR benchmark datasets and one of the largest sponsored search ads dataset from Google. Our results show that our proposed calibrated ranking losses can achieve nearly optimal results in terms of both ranking quality and score scale calibration. View details
    Preview abstract The pre-trained language model (eg, BERT) based deep retrieval models achieved superior performance over lexical retrieval models (eg, BM25) in many passage retrieval tasks. However, limited work has been done to generalize a deep retrieval model to other tasks and domains. In this work, we carefully select five datasets, including two in-domain datasets and three out-of-domain datasets with different levels of domain shift, and study the generalization of a deep model in a zero-shot setting. Our findings show that the performance of a deep retrieval model is significantly deteriorated when the target domain is very different from the source domain that the model was trained on. On the contrary, lexical models are more robust across domains. We thus propose a simple yet effective framework to integrate lexical and deep retrieval models. Our experiments demonstrate that these two models are complementary, even when the deep model is weaker in the out-of-domain setting. The combined model obtains an average of 20.4% relative gain over the deep retrieval model, and an average of 9.54% over the lexical model in three out-of-domain datasets. View details
    Preview abstract Multiclass classification (MCC) is a fundamental machine learning problem of classifying each instance into one of a predefined set of classes. Given an instance, an MCC model computes a score for each class, all of which are used to sort the classes. The performance of a model is usually measured by Top-K Accuracy/Error (e.g. K=1 or 5). In this paper, we do not aim to propose new neural network architectures as most recent works do, but to show that it is promising to boost MCC performance with a novel formulation through the lens of ranking. In particular, by viewing MCC as \emph{an instance class ranking problem}, we first argue that ranking metrics, such as Normalized Discounted Cumulative Gain, can be more informative than the existing Top-K metrics. We further demonstrate that the dominant neural MCC recipe can be transformed to a neural ranking pipeline. Based on such generalization, we show that it is intuitive to leverage techniques from the rich information retrieval literature to improve the MCC performance out of the box. Extensive empirical results on both text and image classification tasks with diverse datasets and backbone neural models show the value of our proposed framework. View details
    Preview abstract We explore a novel perspective of knowledge distillation (KD) for learning to rank (LTR), and introduce Self-Distilled neural Rankers (SDR), where student rankers are parameterized identically to their teachers. Unlike the existing ranking distillation work which pursues a good trade-off between performance and efficiency, SDR is able to significantly improve ranking performance of students over the teacher rankers without increasing model capacity. The key success factors of SDR, which differs from common distillation techniques for classification are: (1) an appropriate teacher score transformation function, and (2) a novel listwise distillation framework. Both techniques are specifically designed for ranking problems and are rarely studied in the existing knowledge distillation literature. Building upon the state-of-the-art neural ranking structure, SDR is able to push the limits of neural ranking performance above a recent rigorous benchmark study and significantly outperforms traditionally strong gradient boosted decision tree based models on 7 out of 9 key metrics, the first time in the literature. In addition to the strong empirical results, we give theoretical explanations on why listwise distillation is effective for neural rankers, and provide ablation studies to verify the necessity of the key factors in the SDR framework. View details
    Rax: Composable Learning-to-Rank using JAX
    Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2022), 3051–3060
    Preview abstract Rax is a library for composable Learning-to-Rank (LTR) written entirely in JAX. The goal of Rax is to facilitate easy prototyping of LTR systems by leveraging the flexibility and simplicity of JAX. Rax provides a diverse set of popular ranking metrics and losses that integrate well with the rest of the JAX ecosystem. Furthermore, Rax implements a system of ranking-specific function transformations which allows fine-grained customization of ranking losses and metrics. Most notably Rax provides approx_t12n: a function transformation (t12n) that can transform any of our ranking metrics into an approximate and differentiable form that can be optimized. This provides a systematic way to directly optimize neural ranking models for ranking metrics that are not easily optimizable in other libraries. We empirically demonstrate the effectiveness of Rax by benchmarking neural models implemented using Flax and trained using Rax on two popular LTR benchmarks: WEB30K and Istella. Furthermore, we show that integrating ranking losses with T5, a large language model, can improve overall ranking performance on the MS MARCO passage ranking task. We are sharing the Rax library with the open source community as part of the larger JAX ecosystem at https://github.com/google/rax. View details
    Revisiting two tower models for unbiased learning to rank
    Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (2022), 2410–2414
    Preview abstract Two-tower architecture (one tower to factorize out position-related bias) has now become a common technique in neural network ranking models for Unbiased Learning To Rank (ULTR). In these models, a neural network tower taking in all position related features is designed to model the biases, which are equivalent to the propensity scores used to define the unbiased ranking metrics. It works based on the assumptions that the user interaction (click) is conditioned on the user observation of a ranked item, and only the observation probability depends on the position. So if we factorize out the observation probability, we can then unbiased rank the items by their click rate conditioned on observation. The assumption appears sensible, and the additive two-tower models based on it have been widely implemented in ULTR. However, two-tower models may not always work and sometimes work even worse than the biased models, as the user may not always follow the same pattern. In this work, we stick to the plausible assumption about the user interaction, but we also consider the spectrum of different user behaviors. In this case, the assumption that the position related observation probability may not be able to get explicitly factorized out. We also study generic methods to treat this complexity and show these methods could outperform the simple additive debias models in offline experiments. View details
    On Optimizing Top-K Metrics for Neural Ranking Models
    Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (2022), 2303–2307
    Preview abstract Top-K metrics such as NDCG@K are frequently used to evaluate ranking performance. The traditional tree-based models such as LambdaMART, which are based on Gradient Boosted Decision Trees (GBDT), are designed to optimize NDCG@K using the LambdaRank losses. Recently, there is a good amount of research interest on neural ranking models for learning-to-rank tasks. These models are fundamentally different from the decision tree models and behave differently with respect to different loss functions. For example, the most popular ranking losses used in neural models are the Softmax loss and the GumbelApproxNDCG loss. These losses do not connect to top-K metrics such as NDCG@K naturally. It remains a question on how to effectively optimize NDCG@K for neural ranking models. In this paper, we follow the LambdaLoss framework and design novel and theoretically sound losses for NDCG@K metrics, while the original LambdaLoss paper can only do so using an unsound heuristic. We study the new losses on the LETOR benchmark datasets and show that the new losses work better than other losses for neural ranking models. View details
    Preview abstract Extracting structured information from templatic documents is an important problem with the potential to automate many real-world business workflows such as payment, procurement, and payroll. The core challenge is that such documents can be laid out in virtually infinitely different ways. A good solution to this problem is one that generalizes well not only to known templates such as invoices from a known vendor, but also to unseen ones. We developed a system called Glean to tackle this problem. Given a target schema for a document type and some labeled documents of that type, Glean uses machine learning to automatically extract structured information from other documents of that type. In this paper, we describe the overall architecture of Glean, and discuss three key data management challenges : 1) managing the quality of ground truth data, 2) generating training data for the machine learning model using labeled documents, and 3) building tools that help a developer rapidly build and improve a model for a given document type. Through empirical studies on a real-world dataset, we show that these data management techniques allow us to train a model that is over 5 F1 points better than the exact same model architecture without the techniques we describe. We argue that for such information-extraction problems, designing abstractions that carefully manage the training data is at least as important as choosing a good model architecture. View details
    Ensemble Distillation for BERT-Based Ranking Models
    Shuguang Han
    Proceedings of the 2021 ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR ’21)
    Preview abstract Over the past two years, large pretrained language models such as BERT have been applied to text ranking problems and showed superior performance on multiple public benchmark data sets. Prior work demonstrated that an ensemble of multiple BERT-based ranking models can not only boost the performance, but also reduce the performance variance. However, an ensemble of models is more costly because it needs computing resource and/or inference time proportional to the number of models. In this paper, we study how to retain the performance of an ensemble of models at the inference cost of a single model by distilling the ensemble into a single BERT-based student ranking model. Specifically, we study different designs of teacher labels, various distillation strategies, as well as multiple distillation losses tailored for ranking problems. We conduct experiments on the MS MARCO passage ranking and the TREC-COVID data set. Our results show that even with these simple distillation techniques, the distilled model can effectively retain the performance gain of the ensemble of multiple models. More interestingly, the performances of distilled models are also more stable than models fine-tuned on original labeled data. The results reveal a promising direction to capitalize on the gains achieved by an ensemble of BERT-based ranking models. View details
    Preview abstract The content on the web is in a constant state of flux. New entities, issues, and ideas continuously emerge, while the semantics of the existing conversation topics gradually shift. In recent years, pretrained language models like BERT greatly improved the state-of-the-art for a large spectrum of content understanding tasks. Therefore, in this paper, we aim to study how these language models can be adapted to better handle continuously evolving web content. In our study, we first analyze the evolution of 2013 – 2019 Twitter data, and unequivocally confirm that a BERT model trained on past tweets would heavily deteriorate when directly applied to data from later years. Then, we investigate two possible sources of the deterioration: the semantic shift of existing tokens and the sub-optimal or failed understanding of new tokens. To this end, we both explore two different vocabulary composition methods, as well as propose three sampling methods which help in efficient incremental training for BERT-like models. Compared to a new model trained from scratch offline, our incremental training (a) reduces the training costs, (b) achieves better performance on evolving content, and (c) is suitable for online deployment. The superiority of our methods is validated using two downstream tasks. We demonstrate significant improvements when incrementally evolving the model from a particular base year, on the task of Country Hashtag Prediction, as well as on the OffensEval 2019 task. View details
    Preview abstract When experiencing an information need, users want to engage with a domain expert, but often turn to an information retrieval system, such as a search engine, instead. Classical information retrieval systems do not answer information needs directly, but instead provide references to (hopefully authoritative) answers. Successful question answering systems offer a limited corpus created on-demand by human experts, which is neither timely nor scalable. Pre-trained language models, by contrast, are capable of directly generating prose that may be responsive to an information need, but at present they are dilettantes rather than domain experts -- they do not have a true understanding of the world, they are prone to hallucinating, and crucially they are incapable of justifying their utterances by referring to supporting documents in the corpus they were trained over. This paper examines how ideas from classical information retrieval and pre-trained language models can be synthesized and evolved into systems that truly deliver on the promise of domain expert advice. View details
    WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning
    Jiecao Chen
    Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '21) (2021)
    Preview abstract The milestone improvements brought about by deep representation learning and pre-training techniques have led to large performance gains across downstream NLP, IR and Vision tasks. Multimodal modeling techniques aim to leverage high-quality visio-linguistic datasets for learning complementary information (across image and text modalities). In this paper, we introduce the Wikipedia-based Image Text (WIT) Dataset to better facilitate multimodal, multilingual learning. WIT is composed of 11 million+ unique images with over 37 million entity rich text descriptions associated with these images in Wikipedia from over 100 languages. Its size enables WIT to be used as a pretraining dataset for multimodal models, as we show when applied to downstream tasks such as image-text retrieval. WIT has four main and unique advantages. First, WIT is the largest multimodal dataset (at the time of writing). Second, it is massively multilingual (first of its kind) with coverage over 100+ languages (each of which has at least 10K examples) and provides cross-lingual texts for many images. Third, it represents a more diverse set of concepts and real world entities relative to what previous datasets cover. Lastly, as we demonstrate empirically, WIT provides a very challenging real-world test set that empirically highlights the need for learning improvements in tasks such as Retrieval and Captioning. View details
    Preview abstract Despite the success of neural models in many major machine learning problems and recently published neural learning to rank (LTR) papers in top venues, the effectiveness of neural models on traditional LTR problems is still not widely acknowledged. We first validate the concern by showing that most recent neural LTR models are, by a large margin, inferior to the best publicly available tree-based implementation, which is sometimes ignored in recent neural LTR papers. We then investigate why existing neural LTR suffers by identifying several of its weaknesses. To that end, we propose a new neural LTR framework that mitigates these weaknesses, by borrowing ideas from several research fields. Our models are able to perform comparatively with the strong tree-based baseline, while outperforming recently published neural learning to rank methods by a large margin. Our results also serve as a benchmark for neural learning to rank models. View details
    Preview abstract We describe how we built three recommendation products from scratch at Google Chrome Web Store, namely context-based recommendations, related extension recommendations, and personalized recommendations. Unlike most existing papers that focus on novel algorithms, this paper focuses on sharing practical experiences building large scale recommender systems under various real-world constraints, such as privacy constraints, data sparsity issues, highly skewed data distribution, and product design choices, such as user interface. We show how these constraints make standard approaches difficult to succeed in practice. We share success stories that turn very negative live metrics to very positive, by introducing 1) how we use interpretable neural models to bootstrap the systems, helps identifying pipeline issues, and paves way for more advanced models. 2) A new item-item based recommendation algorithm that works under highly skewed data distributions, and 3) how two products can help bootstrapping the third one, which significantly reduces development cycles and bypasses various real-world difficulties. All the explorations in this work are verified in live traffic on millions of users. We believe the findings in this work can help practitioners to bootstrap and build large-scale recommender systems. View details
    Preview abstract Automating information extraction from form-like documents at scale is a pressing need due to its potential impact on automating business workflows across many industries like financial services, insurance, and healthcare. The key challenge is that form-like documents in these business workflows can be laid out in virtually infinitely many ways; hence, a good solution to this problem should generalize to documents with unseen layouts and languages. A solution to this problem requires a holistic understanding of both the textual segments and the visual cues within a document, which is non-trivial. While the natural language processing and computer vision communities are starting to tackle this problem, there has not been much focus on (1) data-efficiency, and (2) ability to generalize across different document types and languages. In this paper, we show that when we have only a small number of labeled documents for training (~50), a straightforward transfer learning approach from a considerably structurally-different larger labeled corpus yields up to a 27 F1 point improvement over simply training on the small corpus in the target domain. We improve on this with a simple multi-domain transfer learning approach, that is currently in production use, and show that this yields up to a further 8 F1 point improvement. We make the case that data efficiency is critical to enable information extraction systems to scale to handle hundreds of different document-types, and learning good representations is critical to accomplishing this. View details
    Natural Language Understanding with Privacy-Preserving BERT
    Proceedings of the 30th ACM International Conference on Information and Knowledge Management, ACM (2021)
    Preview abstract Privacy preservation remains a key challenge in data mining and Natural Language Understanding (NLU). Previous research shows that the input text or even text embeddings can leak private information. This concern motivates our research on effective privacy preservation approaches for pretrained Language Models (LMs). We investigate the privacy and utility implications of applying dχ-privacy, a variant of Local Differential Privacy, to BERT fine-tuning in NLU applications. More importantly, we further propose privacy-adaptive LM pretraining methods and show that our approach can boost the utility of BERT dramatically while retaining the same level of privacy protection. We also quantify the level of privacy preservation and provide guidance on privacy configuration. Our experiments and findings lay the groundwork for future explorations of privacy-preserving NLU with pretrained LMs. View details
    Improving Cloud Storage Search with User Activity
    Proceedings of the 14th International Conference on Web Search and Data Mining (WSDM '21), ACM (2021)
    Preview abstract Cloud-based file storage platforms such as Google Drive are widely used as a means for storing, editing and sharing personal and organizational documents. In this paper, we improve search ranking quality for cloud storage platforms by utilizing user activity logs. Different from search logs, activity logs capture general document usage activity beyond search, such as opening, editing and sharing documents. We propose to automatically learn text embeddings that are effective for search ranking from activity logs. We develop a novel co-access signal, i.e., whether two documents were accessed by a user around the same time, to train deep semantic matching models that are useful for improving the search ranking quality. We confirm that activity-trained semantic matching models can improve ranking by conducting extensive offline experimentation using Google Drive search and activity logs. To the best of our knowledge, this is the first work to examine the benefits of leveraging document usage activity at large scale for cloud storage search; as such it can shed light on using such activity in scenarios where direct collection of search-specific interactions (e.g., query and click logs) may be expensive or infeasible. View details
    Scalable Hierarchical Agglomerative Clustering
    Nick Monath
    Avinava Dubey
    Guru Prashanth Guruganesh
    Andrew McCallum
    Gokhan Mergen
    Mert Terzihan
    Bryon Tjanaka
    Yuchen Wu
    Proceedings of the 27th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2021), 1245–1255
    Preview abstract The applicability of agglomerative clustering, for inferring both hierarchical and flat clustering, is limited by its scalability. Existing scalable hierarchical clustering methods sacrifice quality for speed and often lead to over-merging of clusters. In this paper, we present a scalable, agglomerative method for hierarchical clustering that does not sacrifice quality and scales to billions of data points. We perform a detailed theoretical analysis, showing that under mild separability conditions our algorithm can not only recover the optimal flat partition but also provide a two-approximation to non-parametric DP-Means objective [32]. This introduces a novel application of hierarchical clustering as an approximation algorithm for the non-parametric clustering objective. We additionally relate our algorithm to the classic hierarchical agglomerative clustering method. We perform extensive empirical experiments in both hierarchical and flat clustering settings and show that our proposed approach achieves state-of-the-art results on publicly available clustering benchmarks. Finally, we demonstrate our method’s scalability by applying it to a dataset of 30 billion queries. Human evaluation of the discovered clusters show that our method finds better quality of clusters than the current state-of-the-art. View details
    Preview abstract How to leverage cross-document interactions to improve ranking performance is an important topic in information retrieval research. The recent developments in deep learning show strength in modeling complex relationships across sequences and sets. It thus motivates us to study how to leverage cross-document interactions for learning-to-rank in the deep learning framework. In this paper, we formally define the permutation equivariance requirement for a scoring function that captures cross-document interactions. We then propose a self-attention based document interaction network that extends any univariate scoring function with contextual features capturing cross-document interactions. We show that it satisfies the permutation equivariance requirement, and can generate scores for document sets of varying sizes. Our proposed methods can automatically learn to capture document interactions without any auxiliary information, and can scale across large document sets. We conduct experiments on four ranking datasets: the public benchmarks WEB30K and Istella, as well as Gmail search and Google Drive Quick Access datasets. Experimental results show that our proposed methods lead to significant quality improvements over state-of-the-art neural ranking models, and are competitive with state-of-the-art gradient boosted decision tree (GBDT) based models on the WEB30K dataset. View details
    Preview abstract Consider a sequential active learning problem where, at each round, an agent selects a batch of unlabeled data points, queries their labels and updates a binary classifier. While there exists a rich body of work on active learning in this general form, in this paper, we focus on problems with two distinguishing characteristics: severe class imbalance (skew) and small amounts of training data. Both of these problems occur with surprising frequency in many web applications. For instance, detecting offensive or sensitive content in online communities (pornography, violence, and hate-speech) is receiving enormous attention from industry as well as research communities. Such problems have both the characteristics we describe -- a vast majority content is {\em not} offensive, so the number of positive examples for such content is orders of magnitude smaller than the negative examples. Further, there is usually only a small amount of initial training data available when building machine-learned models to solve such problems. To address both these issues, we propose a hybrid active learning algorithm (HAL) that balances exploiting the knowledge available through the currently labeled training examples with exploring the large amount of unlabeled data available. Through simulation results, we show that HAL makes significantly better choices for what points to label when compared to strong baselines like margin-sampling. Classifiers trained on the examples selected for labeling by HAL easily out-perform the baselines on target metrics (like recall at a high precision threshold and area under the precision-recall curve) given the same budget for labeling examples. We believe HAL offers a simple, intuitive, and computationally tractable way to structure active learning that can significantly amplify the impact (or alternately, reduce the cost) of human labeling for a wide range of web applications. View details
    Feature Transformation for Neural Ranking Models
    Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2020), pp. 1649-1652
    Preview abstract Although neural network models enjoy tremendous advantages in handling image and text data, tree-based models still remain competitive for learning-to-rank tasks with numerical data. A major strength of tree-based ranking models is the insensitivity to different feature scales, while neural ranking models may suffer from features with varying scales or skewed distributions. Feature transformation or normalization is a simple technique which preprocesses input features to mitigate their potential adverse impact on neural models. However, due to lack of studies, it is unclear to what extent feature transformation can benefit neural ranking models. In this paper, we aim to answer this question by providing empirical evidence for learning-to-rank tasks. First, we present a list of commonly used feature transformation techniques and perform a comparative study on multiple learning-to-rank data sets. Then we propose a mixture feature transformation mechanism which can automatically derive a mixture of basic feature transformation functions to achieve the optimal performance. Our experiments show that applying feature transformation can substantially improve the performance of neural ranking models compared to directly using the raw features. In addition, the proposed mixture transformation method can further improve the performance of the ranking model without any additional human effort. View details
    Preview abstract Search engines often follow a 2-phase paradigm where in the first step an initial set of documents is retrieved (the \emp{retrieval} step) and in the second step the documents are ranked so as to obtain the final result list (the \emp{re-ranking} step). The focus of this paper is on improving the \emph{retrieval} step (measured mainly by recall) using deep neural network-based approaches. While deep neural networks were shown to improve the performance of the re-ranking step, there is little literature about using deep neural networks to improve the retrieval step. Previous works on deep neural networks for IR usually apply a simple lexical retrieval model for the retrieval step (e.g., BM25) and emphasize on the re-ranking step. In this paper, we propose and study a hybrid retrieval approach, which leverages both semantic (deep neural network based) and lexical (keyword matching based like BM25) matching techniques. The main idea is to perform semantic and lexical retrieval in parallel, and then to combine the result lists to generate the initial result set for re-ranking. An empirical evaluation, using a public TREC collection, shows that semantic retrieval model generated result lists often contain a substantial number of relevant documents not covered by the lexical-based generated lists. Further analysis of these relevant documents shows that they often also exhibit different characteristics than the lexical-based documents, attesting to the complementary nature of the two approaches. Finally, the experiments show that by combining the two result lists, the recall of the result list can increase significantly, the retrieval step can be greatly improved and these improvements are highly robust. View details
    A Stochastic Treatment of Learning to Rank Scoring Functions
    Sebastian Bruch
    Shuguang Han
    Proceedings of the 13th ACM International Conference on Web Search and Data Mining (WSDM 2020), pp. 61-69
    Preview abstract Learning to Rank, a central problem in information retrieval, is a class of machine learning algorithms that formulate ranking as an optimization task. The objective is to learn a function that produces an ordering of a set of objects in such a way that the utility of the entire ordered list is maximized. Learning-to-rank methods do so by constructing a function that computes a score for each object in the set. A ranked list is then compiled by sorting objects according to their scores. While such a deterministic mapping of scores to permutations makes sense during inference where stability of ranked lists is required, we argue that its greedy nature during training leads to less robust models. This is particularly problematic when the loss function under optimization---in agreement with ranking metrics---only penalizes incorrect rankings and does not take into account the distribution of raw scores. In this work, we present a stochastic framework where, instead of a deterministic derivation of permutations from raw scores, permutations are sampled from a distribution defined by raw scores. Our proposed sampling method is differentiable and works well with gradient descent optimizers. We analytically study our proposed method and demonstrate when and why it leads to model robustness. We also show empirically, through experiments on publicly available learning-to-rank datasets, that the application of our proposed method to a class of ranking loss functions leads to significant model quality improvements. View details
    Representation Learning for Information Extraction from Form-like Documents
    Bodhisattwa Majumder
    Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics (ACL 2020), pp. 6495-6504
    Preview abstract We propose a novel approach using representation learning for tackling the problem of extracting structured information from form-like document images. We propose an extraction system that uses knowledge of the types of the target fields to generate extraction candidates, and a neural network architecture that learns a dense representation of each candidate based on neighboring words in the document. These learned representations are not only useful in solving the extraction task for unseen document templates from two different domains, but are also interpretable, as we show using loss cases. View details
    Preview abstract This paper describes a machine learning algorithm for document (re)ranking, in which queries and documents are firstly encoded using BERT [1], and on top of that a learning-to-rank (LTR) model constructed with TF-Ranking (TFR) [2] is applied to further optimize the ranking performance. This approach is proved to be effective in a public MS MARCO benchmark [3]. Our submissions achieve the best performance for the passage re-ranking task as of March 30, 2020 [4], and the second best performance for the passage full-ranking task as of April 10, 2020 [5], demonstrating the effectiveness of combining ranking losses with BERT representations for document ranking. View details
    Adversarial Bandits Policy for Crawling Commercial Web Content
    Shuguang Han
    Przemek Gajda
    Sergey Novikov
    Alexandrin Popescul
    Proceedings of the Web Conference 2020 (WWW 2020), pp. 407-417
    Preview abstract The rapid growth of commercial web content has driven the development of shopping search services to help users find product offers. Due to the dynamic nature of commercial content, an effective recrawl policy is a key component in a shopping search service; it ensures that users have access to the up-to-date product details. Most of the existing strategies either relied on simple heuristics, or overlooked the resource budgets. To address this, Azar et al. [5] recently proposed an optimization strategy LambdaCrawl aiming to maximize content freshness within a given resource budget. In this paper, we demonstrate that the effectiveness of LambdaCrawl is governed in large part by how well future content change rate can be estimated. By adopting the state-of-the-art deep learning models for change rate prediction, we obtain a substantial increase of content freshness over the common LambdaCrawl implementation with change rate estimated from the past history. Moreover, we demonstrate that while LambdaCrawl is a significant advancement upon existing recrawl strategies, it can be further improved upon by a unified multi-strategy recrawl policy. To this end, we adopt the $K$-armed adversarial bandits algorithm that can provably optimize the overall freshness by combining multiple strategies. Empirical results over a large-scale production dataset confirm its superiority to LambdaCrawl, especially under tight resource budgets. View details
    Learning to Cluster Documents into Workspaces Using Large Scale Activity Logs
    Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’20), ACM (2020), 2416–2424
    Preview abstract Google Drive is widely used for managing personal and work-related documents in the cloud. To help users organize their documents in Google Drive, we develop a new feature to allow users to create a set of working files for ongoing easy access, called workspace. A workspace is a cluster of documents, but unlike a typical document cluster, it contains documents that are not only topically coherent, but are also useful in the ongoing user tasks. To alleviate the burden of creating workspaces manually, we automatically cluster documents into suggested workspaces. We go beyond the textual similarity-based unsupervised clustering paradigm and instead directly learn from users’ activity for document clustering. More specifically, we extract co-access signals (i.e., whether a user accessed two documents around the same time) to measure document relatedness. We then use a neural document similarity model that incorporates text, metadata, as well as co-access features. Since human labels are often difficult or expensive to collect, we extract weak labels based on co-access data at large scale for model training. Our offline and online experiments based on Google Drive show that (a) co-access features are very effective for document clustering; (b) our weakly supervised clustering achieves comparable or even better performance compared to the models trained with human labels; and (c) the weakly supervised method leads to better workspace suggestions that the users accept more often in the production system than baseline approaches. View details
    Migrating a Privacy-Safe Information Extraction System to a Software 2.0 Design
    Nguyen Ha Vo
    Proceedings of the 10th Annual Conference on Innovative Data Systems Research (2020)
    Preview abstract This paper presents a case study of migrating a privacy-safe information extraction system for Gmail from a traditional rule-based architecture to a machine-learned Software 2.0 architecture. The key idea is to use the extractions from the existing rule-based system as training data to learn ML models that in turn replace all the machinery for the rule-based system. The resulting system a) delivers better precision and recall, b) is significantly smaller in terms of lines of code, c) has been easier to maintain and improve, and d) has opened up the possibility of leveraging ML advances to build a cross-language extraction system even though our original training data was only in English. We describe challenges encountered during this migration around generation and management of training data, evaluation of models, and report on many traditional ``Software 1.0'' components we built to address them. View details
    Preview abstract Pre-trained models like BERT have dominated NLP / IR applications such as single sentence classification, text pair classification, and question answering. However, deploying these models in real systems is highly non-trivial due to their exorbitant computational costs. A common remedy to this is knowledge distillation, leading to faster inference. However –as we show here – existing works are not optimized for dealing with pairs (or tuples) of texts. Consequently, they are either not scalable or demonstrate subpar performance. In this work,we propose DiPair— a novel framework for distilling fast and accurate models on text pair tasks. Coupled with an end-to-end training strategy, DiPair is both highly scalable and offers improved quality-speed tradeoffs. Empirical studies conducted on both academic and real-world e-commerce benchmarks demonstrate the efficacy of the proposed approach with speedups of over 350x and minimal quality drop relative to the cross-attention teacherBERT model. View details
    Preview abstract Many natural language processing and information retrieval problems can be formalized as the task of semantic matching. Existing work in this area has been largely focused on matching between short texts (e.g., question answering), or between a short and a long text (e.g., ad-hoc retrieval). Semantic matching between long-form documents, which has many important applications like news recommendation, related article recommendation and document clustering, is relatively less explored and needs more research effort. In recent years, self-attention based models like Transformers and BERT have achieved state-of-the-art performance in the task of text matching. These models, however, are still limited to short text like a few sentences or one paragraph due to the quadratic computational complexity of self-attention with respect to input text length. In this paper, we address the issue by proposing the Siamese Multi-depth Transformer-based Hierarchical (SMITH) Encoder for long-form document matching. Our model contains several innovations to adapt self-attention models for longer text input. We propose a transformer based hierarchical encoder to capture the document structure information. In order to better capture sentence level semantic relations within a document, we pre-train the model with a novel masked sentence block language modeling task in addition to the masked word language modeling task used by BERT. Our experimental results on several benchmark datasets for long-form document matching show that our proposed SMITH model outperforms the previous state-of-the-art models including hierarchical attention, multi-depth attention-based hierarchical recurrent neural network, and BERT. Comparing to BERT based baselines, our model is able to increase maximum input text length from 512 to 2048. We will open source a Wikipedia based benchmark dataset, code and a pre-trained checkpoint to accelerate future research on long-form document matching. View details
    Multi-view Embedding-based Synonyms for Personal Search
    Hongbo Deng
    Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '19) (2019), pp. 575-584
    Preview abstract Synonym expansion is a technique that adds related words to search queries, which may lead to more relevant documents being retrieved, thus improving recall. There is extensive prior work on synonym expansion for web search, however very few studies have tackled its application for email search. Synonym expansion for private corpora like emails poses several unique research challenges. First, the emails are not shared across users, which precludes us from directly employing query-document bipartite graphs, which are standard in web search synonym expansion. Second, user search queries are of personal nature, and may not be generalizable across users. Third, the size of the underlying corpora from which the synonyms may be mined is relatively small (i.e., user's private email inbox) compared to the size of the web corpus. Therefore, in this paper, we propose a solution tailored to the challenges of synonym expansion for email search. We formulate it as a multi-view learning problem, and propose a novel embedding-based model that joins information from multiple sources to obtain the optimal synonym candidates. To demonstrate the effectiveness of the proposed technique, we evaluate our model using both explicit human ratings as well as a live experiment using the Gmail Search service, one of the world's largest email search engines. View details
    An Analysis of the Softmax Cross Entropy Loss for Learning-to-Rank with Binary Relevance
    Sebastian Bruch
    Proceedings of the 2019 ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR 2019), pp. 75-78
    Preview abstract One of the challenges of learning-to-rank for information retrieval is that ranking metrics are not smooth and as such cannot be optimized directly with gradient descent optimization methods. This gap has given rise to a large body of research that reformulates the problem to fit into existing machine learning frameworks or defines a surrogate, ranking-appropriate loss function. One such loss ListNet's which measures the cross entropy between a distribution over documents obtained from scores and another from ground-truth labels. This loss was designed to capture permutation probabilities and as such is considered to be only loosely related to ranking metrics. In this work, however, we show that the above statement is not entirely accurate. In fact, we establish an analytical connection between softmax cross entropy and two popular ranking metrics in a learning-to-rank setup with binary relevance labels. In particular, we show that ListNet's loss bounds Mean Reciprocal Rank as well as Normalized Discounted Cumulative Gain. Our analysis sheds light on the behavior of that loss function and explains its superior performance on binary labeled data over data with graded relevance. View details
    Preview abstract Most consumer email in the world is machine-generated communication from a businesses to a human. Understanding the underlying templates that are used to instantiate these templates is a key step to enabling a variety of intelligent experiences. In this paper, we present the first description of the template-induction problem in an online setting for a planet-scale email system. While previous work has addressed the problem of discovering these templates using an offline batch job (perhaps architected as a MapReduce), discovering these templates online has several advantages. In this paper, we present the design of an online template induction system and describe the design choices we had to make. The resulting system handles online template induction over a stream of several billion emails a day. With the new system, new incoming email can be identified as belonging to a known template within minutes of discovering a template compared to several days worth of delay with the previous batch approach. Further, the online system has a resource consumption footprint that is 10x smaller than the batch approach. We also report on the surprising lesson we learned that conventional stream processing systems did not present a good framework on which to build this system. We hope that the lessons from this system help designers of future stream processing systems accommodate a broader range of applications like online template induction. View details
    TF-Ranking: Scalable TensorFlow Library for Learning-to-Rank
    Sebastian Bruch
    Rohan Anil
    Stephan Wolf
    Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) (2019), pp. 2970-2978
    Preview abstract Learning-to-Rank deals with maximizing the utility of a list of examples presented to the user, with items of higher relevance being prioritized. It has several practical applications such as large-scale search, recommender systems, document summarization and question answering. While there is widespread support for classification and regression based learning, support for learning-to-rank in deep learning has been limited. We propose TensorFlow Ranking, the first open source library for solving large-scale ranking problems in a deep learning framework. It is highly configurable and provides easy-to-use APIs to support different scoring mechanisms, loss functions and evaluation metrics in the learning-to-rank setting. Our library is developed on top of TensorFlow and can thus fully leverage the advantages of this platform. For example, it is highly scalable, both in training and in inference, and can be used to learn ranking models over massive amounts of user activity data, which can include heterogeneous dense and sparse features. We empirically demonstrate the effectiveness of our library in learning ranking functions for large-scale search and recommendation applications in Gmail and Google Drive. We also show that ranking models built using our model scale well for distributed training, without significant impact in metrics. The proposed library is available to the open source community, with the hope that it facilitates further academic research and industrial applications in the field of learning-to-rank. View details
    Preview abstract Existing unbiased learning-to-rank models use counterfactual inference, notably Inverse Propensity Scoring (IPS), to learn a ranking function from biased click data. They handle the click incompleteness bias, but usually assume that the clicks are noise-free, i.e., a clicked document is always assumed to be relevant. In this paper, we relax this unrealistic assumption and study click noise explicitly in the unbiased learning-to-rank setting. Specifically, we model the noise as the position-dependent trust bias and propose a noise-aware Position-Based Model, named TrustPBM, to better capture user click behavior. We propose an Expectation-Maximization algorithm to estimate both examination and trust bias from click data in TrustPBM. Furthermore, we show that it is difficult to use a pure IPS method to incorporate click noise and thus propose a novel method that combines a Bayes rule application with IPS for unbiased learning-to-rank. We evaluate our proposed methods on three personal search data sets and demonstrate that our proposed model can significantly outperform the existing unbiased learning-to-rank methods. View details
    Uncovering Hidden Structure in Sequence Data via Threading Recurrent Models
    Daniel Silva
    Yuchen Wu
    Shibani Sanan
    Surojit Chatterjee
    Proceedings of the 12 ACM International Conference on Web Search and Data Mining (2019), pp. 186-194
    Preview abstract Long Short-Term Memory (LSTM) is one of the most powerful sequence models for user browsing history [17, 22] or natural language text [19]. Despite the strong performance, it has not gained popularity for user-facing applications, mainly owing to a large number of parameters and lack of interpretability. Recently Zaheer et al. [25] introduced latent LSTM Allocation (LLA) to address these problems by incorporating topic models with LSTM, where the topic model maps observed words in each sequence to topics that evolve using an LSTM model. In our experiments, we found the resulting model, although powerful and interpretable, to show shortcomings when applied to sequence data that exhibit multi-modes of behaviors with abrupt dynamic changes. To address this problem we introduce thLLA: a threading LLA model. thLLA has the ability to break each sequence into a set of segments and then model the dynamic in each segment using an LSTM mixture. In that way, thLLA can model abrupt changes in sequence dynamics and provides a better fit for sequence data while still being interpretable and requiring fewer parameters. In addition, thLLA uncovers hidden themes in the data via its dynamic mixture components. However, such generalization and interpretability come at a cost of complex dependence structure, for which inference would be extremely non-trivial. To remedy this, we present an efficient sampler based on particle MCMC method for inference that can draw from the joint posterior directly. Experimental results confirm the superiority of thLLA and the stability of the new inference algorithm on a variety of domains. View details
    Estimating Position Bias without Intrusive Interventions
    Aman Agarwal
    Ivan Zaitsev
    Thorsten Joachims
    Proceedings of the 12 ACM International Conference on Web Search and Data Mining (2019), pp. 474-482
    Preview abstract Presentation bias is one of the key challenges when learning from implicit feedback in search engines, as it confounds the relevance signal. While it was recently shown how counterfactual learning-to-rank (LTR) approaches can provably overcome presentation bias when observation propensities are known, it remains to show how to effectively estimate these propensities. In this paper, we propose the first method for producing consistent propensity estimates without manual relevance judgments, disruptive interventions, or restrictive relevance modeling assumptions. First, we show how to harvest a specific type of intervention data from historic feedback logs of multiple different ranking functions, and show that this data is sufficient for consistent propensity estimation in the position-based model. Second, we propose a new extremum estimator that makes effective use of this data. In an empirical evaluation, we find that the new estimator provides superior propensity estimates in two real-world systems -- Arxiv Full-text Search and Google Drive Search. Beyond these two points, we find that the method is robust to a wide range of settings in simulation studies. View details
    Learning Groupwise Scoring Functions Using Deep Neural Networks
    Qingyao Ai
    Proceedings of the First International Workshop On Deep Matching In Practical Applications (2019)
    Preview abstract While in a classification or a regression setting a label or a value is assigned to each individual document, in a ranking setting we determine the relevance ordering of the entire input document list. This difference leads to the notion of relative relevance between documents in ranking. The majority of the existing learning-to-rank algorithms model such relativity at the loss level using pairwise or listwise loss functions. However, they are restricted to pointwise scoring functions, i.e., the relevance score of a document is computed based on the document itself, regardless of the other documents in the list. In this paper, we overcome this limitation by proposing generalized groupwise scoring functions (GSFs), in which the relevance score of a document is determined jointly by groups of documents in the list. We learn GSFs with a deep neural network architecture, and demonstrate that several representative learning-to-rank algorithms can be modeled as special cases in our framework. We conduct evaluation using the public MSLR-WEB30K dataset, and our experiments show that GSFs lead to significant performance improvements both in a standalone deep learning architecture, or when combined with a state-of-the-art tree-based learning-to-rank algorithm. View details
    Learning Groupwise Multivariate Scoring Functions Using Deep Neural Networks
    Qingyao Ai
    Sebastian Bruch
    Proceedings of the 5th ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR) (2019), pp. 85-92
    Preview abstract While in a classification or a regression setting a label or a value is assigned to each individual document, in a ranking setting we determine the relevance ordering of the entire input document list. This difference leads to the notion of relative relevance between documents in ranking. The majority of the existing learning-to-rank algorithms model such relativity at the loss level using pairwise or listwise loss functions. However, they are restricted to univariate scoring functions, i.e., the relevance score of a document is computed based on the document itself, regardless of other documents in the list. To overcome this limitation, we propose a new framework for multivariate scoring functions, in which the relevance score of a document is determined jointly by multiple documents in the list. We refer to this framework as GSFs---groupwise scoring functions. We learn GSFs with a deep neural network architecture, and demonstrate that several representative learning-to-rank algorithms can be modeled as special cases in our framework. We conduct evaluation using click logs from one of the largest commercial email search engines, as well as a public benchmark dataset. In both cases, GSFs lead to significant performance improvements, especially in the presence of sparse textual features. View details
    Revisiting Approximate Metric Optimization in the Age of Deep Neural Networks
    Sebastian Bruch
    Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '19) (2019), pp. 1241-1244
    Preview abstract Learning-to-Rank is a branch of supervised machine learning that seeks to produce an ordering of a list of items such that the utility of the ranked list is maximized. Unlike most machine learning techniques, however, the objective cannot be directly optimized using gradient descent methods as it is either discontinuous or flat everywhere. As such, learning-to-rank methods often optimize a loss function that either is loosely related to or upper-bounds a ranking utility instead. A notable exception is the approximation framework originally proposed by Qin et al. that facilitates a more direct approach to ranking metric optimization. We revisit that framework almost a decade later in light of recent advances in neural networks and demonstrate its superiority empirically. Through this study, we hope to show that the ideas from that work are more relevant than ever and can lay the foundation of learning-to-rank research in the age of deep neural networks. View details
    Preview abstract Semantic text matching is one of the most important research problems in many domains, including, but not limited to, information retrieval, question answering, and recommendation. Among the different types of semantic text matching, long-document-to-long-document text matching has many applications, but has rarely been studied. Most existing approaches for semantic text matching have limited success in this setting, due to their inability to capture and distill the main ideas and topics from long-form text. In this paper, we propose a novel Siamese multi-depth attention-based hierarchical recurrent neural network (SMASH RNN) that learns the long-form semantics, and enables long-form document based semantic text matching. In addition to word information, SMASH RNN is using the document structure to improve the representation of long-form documents. Specifically, SMASH RNN synthesizes information from different document structure levels, including paragraphs, sentences, and words. An attention-based hierarchical RNN derives a representation for each document structure level. Then, the representations learned from the different levels are aggregated to learn a more comprehensive semantic representation of the entire document. For semantic text matching, a Siamese structure couples the representations of a pair of documents, and infers a probabilistic score as their similarity. We conduct an extensive empirical evaluation of SMASH RNN with three practical applications, including email attachment suggestion, related article recommendation, and citation recommendation. Experimental results on public data sets demonstrate that SMASH RNN significantly outperforms competitive baseline methods across various classification and ranking scenarios in the context of semantic matching of long-form documents. View details
    RiSER: Learning Better Representations for Richly Structured Emails
    Furkan Kocayusufoğlu
    Nguyen Ha Vo
    Proceedings of the 2019 World Wide Web Conference, pp. 886-895
    Preview abstract Recent studies show that an overwhelming majority of emails are machine-generated and sent by businesses to consumers. Many large email services are interested in extracting structured data from such emails to enable intelligent assistants. This allows experiences like being able to answer questions such as ``What is the address of my hotel in New York?'' or ``When does my flight leave?''. A high-quality email classifier is a critical piece in such a system. In this paper, we argue that the rich formatting used in business-to-consumer emails contains valuable information that can be used to learn better representations. Most existing methods focus only on textual content and ignore the rich HTML structure of emails. We introduce RiSER (Richly Structured Email Representation) -- an approach for incorporating both the structure and content of emails. RiSER projects the email into a vector representation by jointly encoding the HTML structure and the words in the email. We then use this representation to train a classifier. To our knowledge, this is the first description of a neural technique for combining formatting information along with the content to learn improved representations for richly formatted emails. Experimenting with a large corpus of emails received by users of Gmail, we show that RiSER outperforms strong attention-based LSTM baselines. We expect that these benefits will extend to other corpora with richly formatted documents. We also demonstrate with examples where leveraging HTML structure leads to better predictions. View details
    Predictive Crawling for Commercial Web Content
    Shuguang Han
    Przemek Gajda
    Sergey Novikov
    Robin Dua
    Alexandrin Popescul
    Proceedings of the 2019 World Wide Web Conference, pp. 627-637
    Preview abstract Web crawlers spend significant resources to maintain freshness of their crawled data. This paper describes the optimization of resources to ensure that product prices shown in ads in a context of a shopping sponsored search service are synchronized with current merchant prices. We are able to use the predictability of price changes to build a machine learned system leading to considerable resource savings for both the merchants and the crawler. We describe our solution to technical challenges due to partial observability of price history, feedback loops arising from applying machine learned models, and offers in cold start state. Empirical evaluation over large-scale product crawl data demonstrates the effectiveness of our model and confirms its robustness towards unseen data. We argue that our approach can be applicable in more general data pull settings. View details
    Learning with Sparse and Biased Feedback for Personal Search
    Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI) (2018), pp. 5219-5223
    Preview abstract Personal search, including email, on-device, and personal media search, has recently attracted a considerable attention from the information retrieval community. In this paper, we provide an overview of challenges and opportunities of learning with implicit user feedback (e.g., click data) in personal search. Implicit user feedback provides a convenient source of supervision for ranking models in personal search. This feedback, however, has two major drawbacks: it is highly sparse and biased due to the personal nature of queries and documents. We demonstrate how these drawbacks can be overcome, and empirically demonstrate the benefits of learning with implicit feedback in the context of a large-scale email search engine. View details
    Learning Effective Embeddings for Machine Generated Emails with Applications to Email Category Prediction
    Yu Sun
    Luis Garcia Pueyo
    Proceedings of the IEEE International Conference on Big Data (2018), pp. 1846-1855
    Preview abstract Machine-generated business-to-consumer (B2C) emails such as receipts, newsletters, and promotions constitute today a large portion of users' inbox. These emails reflect the users' interests and often are sequentially correlated, e.g., users interested in relocating may receive a sequence of messages on housing, moving, job availability, etc. We aim to infer (and eventually serve) the users' future interests by predicting the categories of their future emails. There are many good methods such as recurrent neural networks that can be applied for such predictions but in all cases the key to better performance is an effective representation of emails and users. To this end, we propose a general framework for embedding learning for emails and users, using as input only the sequence of B2C templates users receive and open. (A template is a B2C email stripped of all transient information related to specific users.) These learned embeddings allow us to identify both sequentially correlated emails and users with similar sequential interests. We can also use the learned embeddings either as input features or embedding initializers for email category predictions. Extensive experiments with millions of fully anonymized B2C emails demonstrate that the learned embeddings can significantly improve the prediction accuracy for future email categories. We hope that this effective yet simple embedding learning framework will inspire new machine intelligence applications that will improve the users' email experience. View details
    Preview abstract Extracting structured data from emails can enable several assistive experiences, such as reminding the user when a bill payment is due, answering queries about the departure time of a booked flight, or proactively surfacing an emailed discount coupon while the user is at that store. This paper presents Juicer, a system for extracting information from email that is serving over a billion Gmail users daily. We describe how the design of the system was informed by three key principles: scaling to a planet-wide email service, isolating the complexity to provide a simple experience for the developer, and safeguarding the privacy of users (our team and the developers we support are not allowed to view any single email). We describe the design tradeoffs made in building this system, the challenges faced and the approaches used to tackle them. We present case studies of three extraction tasks implemented on this platform—bill reminders, commercial offers, and hotel reservations—to illustrate the effectiveness of the platform despite challenges unique to each task. Finally, we outline several areas of ongoing research in large-scale machine-learned information extraction from email. View details
    Training On-Device Ranking Models from Cross-User Interactions in a Privacy-Preserving Fashion
    Proc. of the First Biennial Conference on Design of Experimental Search & Information Retrieval Systems (DESIRES) (2018), pp. 108
    Preview abstract (See the attached PDF -- a one-page abstract for the upcoming DESIRES 2018 workshop) View details
    Position Bias Estimation for Unbiased Learning to Rank in Personal Search
    Proceedings of the 11th ACM International Conference on Web Search and Data Mining (WSDM), ACM (2018), pp. 610-618
    Preview abstract A well-known challenge in learning from click data is its inherent bias and most notably position bias. Traditional click models aim to extract the (query, document) relevance and the estimated bias is usually discarded after relevance is extracted. In contrast, the most recent work on unbiased learning-to-rank can effectively leverage the bias and thus focuses on estimating bias rather than relevance. Existing approaches use search result randomization over a small percentage of production traffic to estimate the position bias. This is not desired because result randomization can negatively impact users' search experience. In this paper, we compare different schemes for result randomization (i.e., RandTopN and RandPair) and show their negative effect in personal search. Then we study how to infer such bias from regular click data without relying on randomization. We propose a regression-based Expectation-Maximization (EM) algorithm that is based on a position bias click model and that can handle highly sparse clicks in personal search. We evaluate our EM algorithm and the extracted bias in the learning-to-rank setting. Our results show that it is promising to extract position bias from regular clicks without result randomization. The extracted bias can improve the learning-to-rank algorithms significantly. In addition, we compare the pointwise and pairwise learning-to-rank models. Our results show that pairwise models are more effective in leveraging the estimated bias. View details
    Preview abstract Ranking functions are used to return ranked lists of items for users to interact. How to evaluate ranking functions using historical user interaction logs, as known as off-policy evaluation, is an important but challenging problem. The commonly used Inverse Propensity Scores (IPS) approaches works better for the single item case, but suffer from extremely low data efficiency for the ranked list case. In this paper, we study how to improve the data efficiency of IPS approaches in the offline comparison setting. We propose two approaches Trunc-match and Rand-interleaving for offline comparison using uniformly randomized data. We show that these methods can improve the data efficiency and also the comparison sensitivity based on one of the largest email search engines. View details
    The LambdaLoss Framework for Ranking Metric Optimization
    Proceedings of The 27th ACM International Conference on Information and Knowledge Management (CIKM '18), ACM (2018), pp. 1313-1322
    Preview abstract How to optimize ranking metrics such as Normalized Discounted Cumulative Gain (NDCG) is an important but challenging problem, because ranking metrics are either flat or discontinuous everywhere, which makes them hard to be optimized directly. Among existing approaches, LambdaRank is a novel algorithm that incorporates ranking metrics into its learning procedure. Though empirically effective, it still lacks theoretical justification. For example, the underlying loss that LambdaRank optimizes for remains unknown until now. Due to this, there is no principled way to advance the LambdaRank algorithm further. In this paper, we present LambdaLoss, a probabilistic framework for ranking metric optimization. We show that LambdaRank is a special configuration with a well-defined loss in the LambdaLoss framework, and thus provide theoretical justification for it. More importantly, the LambdaLoss framework allows us to define metric-driven loss functions that have clear connection to different ranking metrics. We show a few cases in this paper and evaluate them on three publicly available data sets. Experimental results show that our metric-driven loss functions can significantly improve the state-of-the-art learning-to-rank algorithms. View details
    Preview abstract TensorFlow Ranking is the first open source library for solving large-scale ranking problems in a deep learning framework. It is highly configurable and provides easy-to-use APIs to support different scoring mechanisms, loss functions and evaluation metrics in the learning-to-rank setting. Our library is developed on top of TensorFlow and can thus fully leverage the advantages of this platform. For example, it is highly scalable, both in training and in inference, and can be used to learn ranking models over massive amounts of user activity data. We empirically demonstrate the effectiveness of our library in learning ranking functions for large-scale search and recommendation applications in Gmail and Google Drive. View details
    Semantic Location in Email Query Suggestion
    John Foley
    Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (2018), pp. 977-980
    Preview abstract Mobile devices are pervasive, which means that users have access to web content and their personal documents at all locations, not just their home or office. Existing work has studied how locations can influence information needs, focusing on web queries. We explore whether or not location information can be helpful to users who are searching their own personal documents. We wish to study whether a users’ location can predict their queries over their own personal data, so we focus on the task of query suggestion. While we find that using location directly can be helpful, it does not generalize well to novel locations. To improve this situation, we explore using semantic location: that is, rather than memorizing location-query associations, we generalize our location information to names of the closest point of interest. By using short, semantic descriptions of locations, we find that we can more robustly improve query completion and observe that users are already using locations to extend their own queries in this domain. We present a simple but effective model that can use location to predict queries for a user even before they type anything into a search box, and which learns effectively even when not all queries have location information. View details
    Preview abstract A vast majority of the emails received by people today are machine-generated by businesses communicating with consumers. While some emails originate as a result of a transaction (e.g., hotel or restaurant reservation confirmations, online purchase receipts, shipping notifications, etc.), a large fraction are commercial emails promoting an offer (a special sale, free shipping, available for a limited time, etc.). The sheer number of these promotional emails makes it difficult for users to read all these emails and decide which ones are actually interesting and actionable. In this paper, we tackle the problem of extracting information from commercial emails promoting an offer to the user. This information enables an email platform to build several new experiences that can unlock the value in these emails without the user having to navigate and read all of them. For instance, we can highlight offers that are expiring soon, or display a notification when there's an unexpired offer from a merchant if your phone recognizes that you are at that merchant's store. A key challenge in extracting information from such commercial emails is that they are often image-rich and contain very little text. Training a machine learning (ML) model on a rendered image-rich email and applying it to each incoming email can be prohibitively expensive. In this paper, we describe a cost-effective approach for extracting signals from both the text and image content of commercial emails in the context of a free email platform that serves over a billion users around the world. The key insight is to leverage the template structure of emails, and use off-the-shelf OCR techniques to obtain the text from images to augment the existing text features offline. Compared to a text-only approach, we show that we are able to identify 9.12% more email templates corresponding to ~5% more emails being identified as offers. Interestingly, our analysis shows that this 5% improvement in coverage is across the board, irrespective of whether the emails were sent by large merchants or small local merchants, allowing us to deliver an improved experience for everyone. View details
    Quick Access: Building a Smart Experience for Google Drive
    Alexandrin Popescul
    Julian Gibbons
    Alan Green
    Michael James Smith
    Cayden Meyer
    Reuben Kan
    Proc. of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2017), pp. 1643-1651
    Preview abstract Google Drive is a cloud storage and collaboration service used by hundreds of millions of users around the world. Quick Access is a new feature in Google Drive that surfaces the relevant documents to the user on the home page. We describe the development of a machine-learned service behind this feature. Our metrics show that this feature cuts the time it takes for users to locate their documents in half. The development of this product feature is an illustration of a number of more general challenges and constraints associated with machine learning product deployment such as dealing with private corpora and protecting user privacy, working with data services that are not designed with machine-learning in mind and may be owned and operated by different teams with different constraints, and evolving product definitions which inform the metric being optimized. We believe that the lessons learned from this experience will be useful to practitioners tackling a wide range of applied machine-learning problems. View details
    Learning from User Interactions in Personal Search via Attribute Parameterization
    Proceedings of the 10th ACM International Conference on Web Search and Data Mining (WSDM), ACM (2017), pp. 791-800
    Preview abstract User interaction data (e.g., click data) has proven to be a powerful signal for learning-to-rank models in web search. However, such models require observing multiple interactions across many users for the same query-document pair to achieve statistically meaningful gains. Therefore, utilizing user interaction data for improving search over personal, rather than public, content is a challenging problem. First, the documents (e.g., emails or private files) are not shared across users. Second, user search queries are of personal nature (e.g., [alice's address]) and may not generalize well across users. In this paper, we propose a solution to these challenges, by projecting user queries and documents into a multi-dimensional space of fine-grained and semantically coherent attributes. We then introduce a novel parameterization technique to overcome sparsity in the multi-dimensional attribute space. Attribute parameterization enables effective usage of cross-user interactions for improving personal search quality -- which is a first such published result, to the best of our knowledge. Experiments with a dataset derived from interactions of users of one of the worlds' largest personal search engines demonstrate the effectiveness of the proposed attribute parameterization technique. View details
    Email Category Prediction
    Aston Zhang
    Luis Garcia Pueyo
    Companion Proc. of the 26th International World Wide Web Conference (2017), pp. 495-503
    Preview abstract According to recent estimates, about 90% of consumer received emails are machine-generated. Such messages include shopping receipts, promotional campaigns, newsletters, booking confirmations, etc. Most such messages are created by populating a fixed template with a small amount of personalized information, such as name, salutation, reservation numbers, dates, etc. Web mail providers (Gmail, Hotmail, Yahoo) are leveraging the structured nature of such emails to extract salient information and use it to improve the user experience: e.g. by automatically entering reservation data into a user calendar, or by sending alerts about upcoming shipments. To facilitate these extraction tasks it is helpful to classify templates according to their category, e.g. restaurant reservations or bill reminders, since each category triggers a particular user experience. Recent research has focused on discovering the causal thread of templates, e.g. inferring that a shopping order is usually followed by a shipping confirmation, an airline booking is followed by a confirmation and then by a “ready to check in” message, etc. Gamzu et al. took this idea one step further by implementing a method to predict the template category of future emails for a given user based on previously received templates. The motivation is that predicting future emails has a wide range of potential applications, including better user experiences (e.g. warning users of items ordered but not shipped), targeted advertising (e.g. users that recently made a flight reservation may be interested in hotel reservations), and spam classification (a message that is part of a legitimate causal thread is unlikely to be spam). The gist of the Gamzu et al. approach is modeling the problem as a Markov chain, where the nodes are templates or temporal events (e.g. the first day of the month). This paper expands on their work by investigating the use of neural networks for predicting the category of emails that will arrive during a fixed-sized time window in the future. We consider two types of neural networks: multi-layer perceptrons (MLP), a type of feedforward neural network; and long short-term memory (LSTM), a type of recurrent neural network. For each type of neural network, we explore the effects of varying their configuration (e.g. number of layers or number of neurons) and hyper-parameters (e.g. drop-out ratio). We find that the prediction accuracy of neural networks vastly outperforms the Markov chain approach, and that LSTMs perform slightly better than MLPs. We offer some qualitative interpretation of our findings and identify some promising future directions. View details
    Learning to Rank with Selection Bias in Personal Search
    Proc. of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (2016), pp. 115-124
    Preview abstract Click-through data has proven to be a critical resource for improving search ranking quality. Though a large amount of click data can be easily collected by search engines, various biases make it difficult to fully leverage this type of data. In the past, many click models have been proposed and successfully used to estimate the relevance for individual query-document pairs in the context of web search. These click models typically require a large quantity of clicks for each individual pair and this makes them difficult to apply in systems where click data is highly sparse due to personalized corpora and information needs, e.g., personal search. In this paper, we study the problem of how to leverage sparse click data in personal search and introduce a novel selection bias problem and address it in the learning-to-rank framework. This paper proposes a few bias estimation methods, including a novel query-dependent one that captures queries with similar results and can successfully deal with sparse data. We empirically demonstrate that learning-to-rank that accounts for query-dependent selection bias yields significant improvements in search effectiveness through online experiments with one of the world's largest personal search engines. View details
    Using Machine Learning to Improve the Email Experience
    Proc. of the 25th ACM International Conference on Information and Knowledge Management, ACM (2016), pp. 891
    Preview abstract Email is an essential communication medium for billions of people, with most users relying on web-based email services. Two recent trends are changing the email experience: smartphones have become the primary tool for accessing online services including email, and machine learning has come of age. Smartphones have a number of compelling properties (they are location-aware, usually with us, and allow us to record and share photos and videos), but they also have a few limitations, notably limited screen size and small and tedious virtual keyboards. Over the past few years, Google researchers and engineers have leveraged machine learning to ameliorate these weaknesses, and in the process created novel experiences. In this talk, I will give three examples of machine learning improving the email experience. The first example describes how we are improving email search. Displaying the most relevant results as the query is being typed is particularly useful on smartphones due to the aforementioned limitations. Combining hand-crafted and machine-learned rankers is powerful, but training learned rankers requires a relevance-labeled training set. User privacy prohibits us from employing raters to produce relevance labels. Instead, we leverage implicit feedback (namely clicks) provided by the users themselves. Using click logs as training data in a learning-to-rank setting is intriguing, since there is a vast and continuous supply of fresh training data. However, the click stream is biased towards queries that receive more clicks -- e.g. queries for which we already return the best result in the top-ranked position. I will summarize our work~\cite{Pointer} on neutralizing that bias. The second example describes how we extract key information from appointment and reservation emails and surface it at the appropriate time as a reminder on the user's smartphone. Our basic approach~\cite{Juicer} is to learn the templates that were used to generate these emails, use these templates to extract key information such as times and dates, store the extracted records in a personal information store, and surface them at the right time (taking contextual information such as estimated transit time into account). The third example describes Smart Reply~\cite{SmartReply}, a system that offers a set of three short responses to those incoming emails for which a short response is appropriate, allowing users to respond quickly with just a few taps, without typing or involving voice-to-text transcription. The basic approach is to learn a model of likely short responses to original emails from the corpus, and then to apply the model whenever a new message arrives. Other considerations include offering a set of responses that are all appropriate and yet diverse, and triggering only when sufficiently confident that each responses is of high quality and appropriate. View details
    Debugging a Crowdsourced Task with Low Inter-Rater Agreement
    Omar Alonso
    Catherine C. Marshall
    Joint Conference on Digital Libraries (2015)
    Social Search
    14th International Conference on Web Engineering (2014)
    A Human-Centered Framework for Ensuring Reliability on Crowdsourced Labeling Tasks
    Omar Alonso
    Catherine C Marshall
    First AAAI Conference on Human Computation and Crowdsourcing, AAAI (2013)
    Robust query rewriting using anchor data
    Nick Craswell
    Bodo Billerbeck
    6th ACM Intl. Conference on Web Search and Data Mining (WSDM), ACM (2013), pp. 335-344
    Are Some Tweets More Interesting Than Others?# HardQuestion
    Omar Alonso
    Catherine C Marshall
    7th Annual Symposium on Human-Computer Interaction and Information Retrieval, ACM (2013)
    Boot-Strapping Language Identifiers for Short Colloquial Postings
    Moisés Goldszmidt
    Stelios Paparizos
    European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases (ECMLPKDD), Springer (2013), pp. 95-111
    Editorial
    Helen Ashman
    Arun Iyengar
    TWEB, vol. 6 (2012), pp. 5
    How user behavior is related to social affinity
    Rina Panigrahy
    Yinglian Xie
    WSDM (2012), pp. 713-722
    Detecting quilted web pages at scale
    SIGIR (2012), pp. 385-394
    Of hammers and nails: an empirical comparison of three paradigms for processing large graphs
    Alan Halverson
    Krishnaram Kenthapadi
    Sreenivas Gollapudi
    WSDM (2012), pp. 103-112
    The Power of Peers
    Nick Craswell
    ECIR (2011), pp. 497-502
    Microsoft Research at TREC 2011 Web Track
    Bodo Billerbeck
    Nick Craswell
    TREC (2011)
    Querying the Web Graph - (Invited Talk)
    SPIRE (2010), pp. 1-12
    A Sketch-Based Distance Oracle for Web-Scale Graphs
    Atish Das Sarma
    Sreenivas Gollapudi
    Rina Panigrahy
    Web Search and Data Mining (WSDM) (2010)
    Microsoft Research at TREC 2010 Web Track
    Nick Craswell
    TREC (2010)
    Web Crawling
    Christopher Olston
    Foundations and Trends in Information Retrieval, vol. 4 (2010), pp. 175-246
    Web Spam Detection
    Encyclopedia of Database Systems (2009), pp. 3520-3523
    Web Crawler Architecture
    Encyclopedia of Database Systems (2009), pp. 3462-3465
    Microsoft Research at TREC 2009: Web and Relevance Feedback Track
    Nick Craswell
    Stephen Robertson
    Emine Yilmaz
    TREC (2009)
    Web Search Relevance Ranking
    Hugo Zaragoza
    Encyclopedia of Database Systems (2009), pp. 3497-3501
    Less is more: sampling the neighborhood graph makes SALSA better and faster
    Sreenivas Gollapudi
    Rina Panigrahy
    WSDM (2009), pp. 242-251
    The scalable hyperlink store
    Hypertext (2009), pp. 89-98
    Efficient and effective link analysis with precomputed salsa maps
    Nick Craswell
    CIKM (2008), pp. 53-62
    Introduction to special section on adversarial issues in Web search
    Brian D. Davison 0001
    TWEB, vol. 2 (2008)
    Computing Information Retrieval Performance Measures Efficiently in the Presence of Tied Scores
    Frank McSherry
    ECIR (2008), pp. 414-421
    Hits on the web: how does it compare?
    Hugo Zaragoza
    Michael J. Taylor
    SIGIR (2007), pp. 471-478
    Using Bloom Filters to Speed Up HITS-Like Ranking Algorithms
    Sreenivas Gollapudi
    Rina Panigrahy
    WAW (2007), pp. 195-201
    Comparing the effectiveness of hits and salsa
    CIKM (2007), pp. 157-164
    Adversarial information retrieval on the web (AIRWeb 2006)
    Brian D. Davison 0001
    Tim Converse
    SIGIR Forum, vol. 40 (2006), pp. 27-30
    Detecting spam web pages through content analysis
    Alexandros Ntoulas
    Mark Manasse
    WWW (2006), pp. 83-92
    How search engines shape the web
    Byron Dom
    Krishna Bharat
    Jan O. Pedersen
    Yoshinobu Tonomura
    WWW (Special interest tracks and posters) (2005), pp. 879
    Detecting phrase-level duplication on the world wide web
    Mark Manasse
    SIGIR (2005), pp. 170-177
    Boxwood: Abstractions as the Foundation for Storage Infrastructure
    John MacCormick
    Nick Murphy
    Chandramohan A. Thekkath
    Lidong Zhou
    OSDI (2004), pp. 105-120
    On The Evolution of Clusters of Near-Duplicate Web Pages
    Mark Manasse
    J. Web Eng., vol. 2 (2004), pp. 228-246
    Spam, Damn Spam, and Statistics: Using Statistical Analysis to Locate Spam Web Pages
    Mark Manasse
    WebDB (2004), pp. 1-6
    A large-scale study of the evolution of Web pages
    Mark Manasse
    Janet L. Wiener
    Softw., Pract. Exper., vol. 34 (2004), pp. 213-237
    A large-scale study of the evolution of web pages
    Mark Manasse
    Janet L. Wiener
    WWW (2003), pp. 669-678
    Efficient URL caching for world wide web crawling
    Janet L. Wiener
    WWW (2003), pp. 679-689
    On the Evolution of Clusters of Near-Duplicate Web Pages
    Mark Manasse
    LA-WEB (2003), pp. 37-45
    Breadth-first crawling yields high-quality pages
    Janet L. Wiener
    WWW (2001), pp. 114-118
    Web-based Algorithm Animation
    DAC (2001), pp. 506-511
    Performance limitations of the Java core libraries
    Allan Heydon
    Concurrency - Practice and Experience, vol. 12 (2000), pp. 363-373
    Performance Limitations of the Java Core Libraries
    Allan Heydon
    Java Grande (1999), pp. 35-41
    Mercator: A Scalable, Extensible Web Crawler
    Allan Heydon
    World Wide Web, vol. 2 (1999), pp. 219-229
    Distributed Applets
    Marc H. Brown
    CHI Extended Abstracts (1997), pp. 204-205
    Collaborative Active Textbooks
    Marc H. Brown
    J. Vis. Lang. Comput., vol. 8 (1997), pp. 453-486
    A Java-Based Implementation of Collaborative Active Textbooks
    Marc H. Brown
    Roope Raisamo
    VL (1997), pp. 376-383
    Programming in Three Dimensions
    J. Vis. Lang. Comput., vol. 7 (1996), pp. 219-242
    Distributed Active Objects
    Marc H. Brown
    Computer Networks, vol. 28 (1996), pp. 1037-1052
    Collaborative Active Textbooks: A Web-Based Algorithm Animation System for an Electronic Classroom
    Marc H. Brown
    VL (1996), pp. 266-275
    Obliq-3D: A High-Level, Fast-Turnaround 3D Animation System
    Marc H. Brown
    IEEE Trans. Vis. Comput. Graph., vol. 1 (1995), pp. 175-193
    A Library for Visualizing Combinatorial Structures
    Marc H. Brown
    IEEE Visualization (1994), pp. 164-171
    Cube: Eine dreidimensionale visuelle Programmiersprache
    Simon M. Kaplan
    GI Jahrestagung (1993), pp. 340-345
    Algorithm Animation Using 3D Interactive Graphics
    Marc H. Brown
    ACM Symposium on User Interface Software and Technology (1993), pp. 93-100
    Specifying Visual Languages with Conditional Set Rewrite Systems
    Simon M. Kaplan
    VL (1993), pp. 12-18
    A Prototype Implementation of the Cube Language
    Simon M. Kaplan
    VL (1992), pp. 270-272
    The CUBE Language
    Simon M. Kaplan
    VL (1991), pp. 218-224
    Enhancing Show-and-Tell with a polymorphic type system and higher-order functions
    Eric J. Golin
    VL (1990), pp. 215-220
    Roles and their role in posing recursive queries
    Sharon Kuck
    Roland John
    Arnd Lewe
    Inf. Syst., vol. 15 (1990), pp. 173-186