Jump to Content
Michael Bendersky

Michael Bendersky

For a full list of pre-Google publication see my personal page .
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Preview abstract Facilitated by large language models (LLMs), personalized text generation has become a rapidly growing research direction. Most existing studies focus on designing specialized models for a particular domain, or they require fine-tuning the LLMs to generate personalized text. We consider a typical scenario in which the large language model, which generates personalized output, is frozen and can only be accessed through APIs. Under this constraint, all one can do is to improve the input text (i.e., text prompts) sent to the LLM, a procedure that is usually done manually. In this paper, we propose a novel method to automatically revise prompts for personalized text generation. The proposed method takes the initial prompts generated by a state-of-the-art, multistage framework for personalized generation and rewrites a few critical components that summarize and synthesize the personal context. The prompt rewriter employs a training paradigm that chains together supervised learning (SL) and reinforcement learning (RL), where SL reduces the search space of RL and RL facilitates end-to-end training of the rewriter. Using datasets from three representative domains, we demonstrate that the rewritten prompts outperform both the original prompts and the prompts optimized via supervised learning or reinforcement learning alone. In-depth analysis of the rewritten prompts shows that they are not only human readable, but also able to guide manual revision of prompts when there is limited resource to employ reinforcement learning to train the prompt rewriter, or when it is costly to deploy an automatic prompt rewriter for inference. View details
    End-to-End Query Term Weighting
    Karan Samel
    Swaraj Khadanga
    Wensong Xu
    Xingyu Wang
    Kashyap Kolipaka
    Proceedings of the 29th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD '23) (2023)
    Preview abstract Bag-of-words based lexical retrieval systems are still the most commonly used methods for real-world search applications. Recently deep learning methods have shown promising results to improve this retrieval performance but are expensive to run in an online fashion, non-trivial to integrate into existing production systems, and might not generalize well in out-of-domain retrieval scenarios. Instead, we build on top of lexical retrievers by proposing a Term Weighting BERT (TW-BERT) model. TW-BERT learns to predict the weight for individual n-gram (e.g., uni-grams and bi-grams) query input terms. These inferred weights and terms can be used directly by a retrieval system to perform a query search. To optimize these term weights, TW-BERT incorporates the scoring function used by the search engine, such as BM25, to score query-document pairs. Given sample query-document pairs we can compute a ranking loss over these matching scores, optimizing the learned query term weights in an end-to-end fashion. Aligning TW-BERT with search engine scorers minimizes the changes needed to integrate it into existing production applications, whereas existing deep learning based search methods would require further infrastructure optimization and hardware requirements. The learned weights can be easily utilized by standard lexical retrievers and by other retrieval techniques such as query expansion. We show that TW-BERT improves retrieval performance over strong term weighting baselines within MSMARCO and in out-of-domain retrieval on TREC datasets. View details
    Preview abstract Automatic headline generation enables users to comprehend ongoing news events promptly and has recently become an important task in web mining and natural language processing. With the growing need for news headline generation, we argue that the hallucination issue, namely the generated headlines being not supported by the original news stories, is a critical challenge for the deployment of this feature in web-scale systems Meanwhile, due to the infrequency of hallucination cases and the requirement of careful reading for raters to reach the correct consensus, it is difficult to acquire a large dataset for training a model to detect such hallucinations through human curation. In this work, we present a new framework named ExHalder to address this challenge for headline hallucination detection. ExHalder adapts the knowledge from public natural language inference datasets into the news domain and learns to generate natural language sentences to explain the hallucination detection results. To evaluate the model performance, we carefully collect a dataset with more than six thousand labeled "article, headline" pairs. Extensive experiments on this dataset and another six public ones demonstrate that ExHalder can identify hallucinated headlines accurately and justifies its predictions with human-readable natural language explanations. View details
    Preview abstract Ranking is at the core of Information Retrieval. Classic ranking optimization studies often treat ranking as a sorting problem with the assumption that the best performance of ranking would be achieved if we rank items according to their individual utility. Accordingly, considerable ranking metrics have been developed and learning-to-rank algorithms that have been designed to optimize these simple performance metrics have been widely used in modern IR systems. As applications evolve, however, people's need for information retrieval have shifted from simply retrieving relevant documents to more advanced information services that satisfy their complex working and entertainment needs. Thus, more complicated and user-centric objectives such as user satisfaction and engagement have been adopted to evaluate modern IR systems today. Those objectives, unfortunately, are difficult to be optimized under existing learning-to-rank frameworks as they are subject to great variance and complicated structures that cannot be explicitly explained or formulated with math equations like those simple performance metrics. This leads to the following research question -- how to optimize result ranking for complex ranking metrics without knowing their internal structures? To address this question, we conduct formal analysis on the limitation of existing ranking optimization techniques and describe three research tasks in Metric-agnostic Ranking Optimization: (1) develop surrogate metric models to simulate complex online ranking metrics on offline data; (2) develop differentiable ranking optimization frameworks for list or session level performance metrics without fine-grained supervision signals; and (3) develop efficient parameter exploration and exploitation techniques for ranking optimization in metric-agnostic scenarios. Through the discussion of potential solutions to these tasks, we hope to encourage more people to look into the problem of ranking optimization in complex search and recommendation scenarios. View details
    Regression Compatible Listwise Objectives for Calibrated Ranking with Binary Relevance
    Pratyush Kar
    Bing-Rong Lin
    Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (2023)
    Preview abstract As Learning-to-Rank (LTR) approaches primarily seek to improve ranking quality, their output scores are not scale-calibrated by design. This fundamentally limits LTR usage in score-sensitive applications. Though a simple multi-objective approach that combines a regression and a ranking objective can effectively learn scale-calibrated scores, we argue that the two objectives are not necessarily compatible, which makes the trade-off less ideal for either of them. In this paper, we propose a practical regression compatible ranking (RCR) approach that achieves a better trade-off, where the two ranking and regression components are proved to be mutually aligned. Although the same idea applies to ranking with both binary and graded relevance, we mainly focus on binary labels in this paper. We evaluate the proposed approach on several public LTR benchmarks and show that it consistently achieves either best or competitive result in terms of both regression and ranking metrics, and significantly improves the Pareto frontiers in the context of multi-objective optimization. Furthermore, we evaluated the proposed approach on YouTube Search and found that it not only improved the ranking quality of the production pCTR model, but also brought gains to the click prediction accuracy. The proposed approach has been successfully deployed in the YouTube production system. View details
    Learning List-Level Domain-Invariant Representations for Ranking
    Ruicheng Xian
    Hamed Zamani
    Han Zhao
    37th Conference on Neural Information Processing Systems (NeurIPS 2023)
    Preview abstract Domain adaptation aims to transfer the knowledge learned on (data-rich) source domains to (low-resource) target domains, and a popular method is invariant representation learning, which matches and aligns the data distributions on the feature space. Although this method is studied extensively and applied on classification and regression problems, its adoption on ranking problems is sporadic, and the few existing implementations lack theoretical justifications. This paper revisits invariant representation learning for ranking. Upon reviewing prior work, we found that they implement what we call item-level alignment, which aligns the distributions of the items being ranked from all lists in aggregate but ignores their list structure. However, the list structure should be leveraged, because it is intrinsic to ranking problems where the data and the metrics are defined and computed on lists, not the items by themselves. To close this discrepancy, we propose list-level alignment—learning domain-invariant representations at the higher level of lists. The benefits are twofold: it leads to the first domain adaptation generalization bound for ranking, in turn providing theoretical support for the proposed method, and it achieves better empirical transfer performance for unsupervised domain adaptation on ranking tasks, including passage reranking. View details
    Job Type Extraction for Service Businesses
    Yaping Qi
    Hayk Zakaryan
    Yonghua Wu
    Companion Proceedings of the ACM Web Conference 2023
    Preview abstract Google My Business (GMB) is a platform that allows business owners to manage their business profiles, which will be displayed when a user issues a relevant query on Google Search or Maps. Many GMB businesses provide diverse services from home cleaning and plumbing to legal services and education. However the exact service content, which we call job types, is often missing in their profiles. This leaves the burden of finding such content to the users, either by the tedious work of scanning through business websites or time-consuming calling of the owners. In the present paper, we describe how we build a pipeline to automatically extract the job types from websites of business owners and how we solve scalability issues for deployment. Rather than focusing on developing novel and sophisticated machine learning models, we share various challenges we have faced and practical experiences of building such a pipeline, including the cold start problem of dataset collection with limited human annotation resource, scalability, reaching a launch bar of high precision, and building a general pipeline with reasonable coverage of any free-text web pages without relying on the Document Object Model (DOM) structure. With these challenges, standard approaches for information extraction do not directly apply or are not scalable to be served. In this paper, we show how we address these challenges in different stages of the extraction pipeline, including: (1) utilizing structured content like tables and lists to tackle the cold start problem of dataset collection; (2) exploitation of various context information to improve model performance without hurting scalability; and (3) formulating the extraction problem as a retrieval task to improve generalizability, efficiency as well as coverage. The pipeline has been successfully deployed, and is scalable enough to be refreshed every few days to extract the latest online information. The extracted job types are serving millions of users of Google Search and Google Maps with at least three use cases: (1) job types of a place are directly displayed on mobile devices; (2) job types provide explanation as to why a place shows up given a query; (3) job types are used as a signal to rank business places. According to a user survey, the displayed job types has greatly enhanced the probability of a user hiring a service provider. View details
    Preview abstract Query-document relevance prediction is a critical problem in Information Retrieval systems. This problem has increasingly been tackled using (pretrained) transformer-based models which are finetuned using large collections of labeled data. However, in specialized domains such as e-commerce and healthcare, the viability of this approach is limited by the dearth of large in-domain data. To address this paucity, recent methods leverage these powerful models to generate high-quality task and domain-specific synthetic data. Prior work has largely explored synthetic data generation or query generation (QGen) for Question-Answering (QA) and binary (yes/no) relevance prediction, where for instance, the QGen models are given a document, and trained to generate a query relevant to that document. However in many problems, we have a more fine-grained notion of relevance than a simple yes/no label. Thus, in this work, we conduct a detailed study into how QGen approaches can be leveraged for nuanced relevance prediction. We demonstrate that – contrary to claims from prior works – current QGen approaches fall short of the more conventional cross-domain transfer-learning approaches. Via empirical studies spanning three public e-commerce benchmarks, we identify new shortcomings of existing QGen approaches – including their inability to distinguish between different grades of relevance. To address this, we introduce label-conditioned QGen models which incorporates knowledge about the different relevance. While our experiments demonstrate that these modifications help improve performance of QGen techniques, we also find that QGen approaches struggle to capture the full nuance of the relevance label space and as a result the generated queries are not faithful to the desired relevance label. View details
    Creator Context for Tweet Recommendation
    Matt Colen
    Sergey Levi
    Vladimir Ofitserov
    Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: Industry Track
    Preview abstract When discussing a tweet, people usually not only refer to the content it delivers, but also to the person behind the tweet. In other words, grounding the interpretation of the tweet in the context of its creator plays an important role in deciphering the true intent and the importance of the tweet. In this paper, we attempt to answer the question of how creator context should be used to advance tweet understanding. Specifically, we investigate the usefulness of different types of creator context, and examine different model structures for incorporating creator context in tweet modeling. We evaluate our tweet understanding models on a practical use case -- recommending relevant tweets to news articles. This use case already exists in popular news apps, and can also serve as a useful assistive tool for journalists. We discover that creator context is essential for tweet understanding, and can improve application metrics by a large margin. However, we also observe that not all creator contexts are equal. Creator context can be time sensitive and noisy. Careful creator context selection and deliberate model structure design play an important role in creator context effectiveness. View details
    RankT5: Fine-Tuning T5 for Text Ranking with Ranking Losses
    Jianmo Ni
    Proc. of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) (2023)
    Preview abstract Pretrained language models such as BERT have been shown to be exceptionally effective for text ranking. However, there are limited studies on how to leverage more powerful sequence-to-sequence models such as T5. Existing attempts usually formulate text ranking as a classification problem and rely on postprocessing to obtain a ranked list. In this paper, we propose RankT5 and study two T5-based ranking model structures, an encoder-decoder and an encoder-only one, so that they not only can directly output ranking scores for each query-document pair, but also can be fine-tuned with "pairwise" or "listwise" ranking losses to optimize ranking performance. Our experiments show that the proposed models with ranking losses can achieve substantial ranking performance gains on different public text ranking data sets. Moreover, ranking models fine-tuned with listwise ranking losses have better zero-shot ranking performance on out-of-domain data than models fine-tuned with classification losses. View details
    SparseEmbed: Learning Sparse Lexical Representations with Contextual Embeddings for Retrieval
    Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '23), ACM (2023) (to appear)
    Preview abstract In dense retrieval, prior work has largely improved retrieval effectiveness using multi-vector dense representations, exemplified by ColBERT. In sparse retrieval, more recent work, such as SPLADE, demonstrated that one can also learn sparse lexical representations to achieve comparable effectiveness while enjoying better interpretability. In this work, we combine the strengths of both the sparse and dense representations for first-stage retrieval. Specifically, we propose SparseEmbed – a novel retrieval model that learns sparse lexical representations with contextual embeddings. Compared with SPLADE, our model leverages the contextual embeddings to improve model expressiveness. Compared with ColBERT, our sparse representations are trained end-to-end to optimize both efficiency and effectiveness. View details
    SMILE: Evaluation and Domain Adaptation for Social Media Language Understanding
    Vasilisa Bashlovkina
    Riley Matthews
    Charles Kwong
    29TH ACM SIGKDD CONFERENCE ON KNOWLEDGE DISCOVERY AND DATA MINING (KDD) (2023)
    Preview abstract We study the ability of transformer-based language models (LMs) to understand social media language. Social media (SM) language is distinct from standard written language, yet existing benchmarks fall short of capturing LM performance in this socially, economically, and politically important domain. We quantify the degree to which social media language differs from conventional language and conclude that the difference is significant both in terms of token distribution and rate of linguistic shift. Next, we introduce a new benchmark for Social MedIa Language Evaluation (SMILE) that covers four SM platforms and eleven tasks. Finally, we show that learning a tokenizer and pretraining on a mix of social media and conventional language yields an LM that outperforms the best similar-sized alternative by 4.2 points on the overall SMILE score. View details
    Preview abstract Unbiased learning to rank (ULTR) studies the problem of mitigating various biases from implicit user feedback data such as clicks, and has been receiving considerable attention recently. A popular ULTR approach for real-world applications uses a two-tower architecture, where click modeling is factorized into a relevance tower with regular input features, and a bias tower with bias-relevant inputs such as the position of a document. A successful factorization will allow the relevance tower to be exempt from biases. In this work, we identify a critical issue that existing ULTR methods ignored - the bias tower can be confounded with the relevance tower via the underlying true relevance. In particular, the positions were determined by the logging policy, i.e., the previous production model, which would possess relevance information. We give both theoretical analysis and empirical results to show the negative effects on relevance tower due to such a correlation. We then propose two methods to mitigate the negative confounding effects by better disentangling relevance and bias. Offline empirical results on both controlled public datasets and a large-scale industry dataset show the effectiveness of the proposed approaches. We conduct a live experiment on a popular web store for four weeks, and find a significant improvement in user clicks over the baseline, which ignores the negative confounding effect. View details
    Preview abstract Market sentiment analysis on social media content requires knowledge of both financial markets and social media slang, which makes it a challenging task for human raters. The resulting lack of high-quality labeled data stands in the way of conventional supervised learning methods. Instead, we approach this problem using semi-supervised learning with a large language model (LLM). Our pipeline generates weak financial sentiment labels for Reddit posts with an LLM and then uses that data to train a small model that can be served in production. We find that prompting the LLM to produce chain-of-thought summaries and forcing it through several reasoning paths helps generate more stable and accurate labels, while using a regression loss further improves distillation quality. With only a handful of prompts, the final model performs on par with existing supervised models. Though production applications of our model are limited by ethical considerations, the model's competitive performance points to the great potential of using LLMs for tasks that otherwise require skill-intensive annotation. View details
    Learning Sparse Lexical Representations Over Expanded Vocabularies for Retrieval
    Proceedings of the 32nd ACM International Conference on Information and Knowledge Management (CIKM '23) (2023)
    Preview abstract A recent line of work in first-stage Neural Information Retrieval has focused on learning sparse lexical representations instead of dense embeddings. One such work is SPLADE, which has been shown to lead to state-of-the-art results in both the in-domain and zero-shot settings, can leverage inverted indices for efficient retrieval, and offers enhanced interpretability. However, existing SPLADE models are fundamentally limited to learning a sparse representation based on the native BERT WordPiece vocabulary. In this work, we extend SPLADE to support learning sparse representations over arbitrary sets of tokens to improve flexibility and aid integration with existing retrieval systems. As an illustrative example, we focus on learning a sparse representation over a large (300k) set of unigrams. We add an unsupervised pretraining task on C4 to learn internal representations for new tokens. Our experiments show that our Expanded-SPLADE model maintains the performance of WordPiece-SPLADE on both in-domain and zero-shot retrieval while allowing for custom output vocabularies. View details
    Preview abstract The pre-trained language model (eg, BERT) based deep retrieval models achieved superior performance over lexical retrieval models (eg, BM25) in many passage retrieval tasks. However, limited work has been done to generalize a deep retrieval model to other tasks and domains. In this work, we carefully select five datasets, including two in-domain datasets and three out-of-domain datasets with different levels of domain shift, and study the generalization of a deep model in a zero-shot setting. Our findings show that the performance of a deep retrieval model is significantly deteriorated when the target domain is very different from the source domain that the model was trained on. On the contrary, lexical models are more robust across domains. We thus propose a simple yet effective framework to integrate lexical and deep retrieval models. Our experiments demonstrate that these two models are complementary, even when the deep model is weaker in the out-of-domain setting. The combined model obtains an average of 20.4% relative gain over the deep retrieval model, and an average of 9.54% over the lexical model in three out-of-domain datasets. View details
    Preview abstract Multiclass classification (MCC) is a fundamental machine learning problem of classifying each instance into one of a predefined set of classes. Given an instance, an MCC model computes a score for each class, all of which are used to sort the classes. The performance of a model is usually measured by Top-K Accuracy/Error (e.g. K=1 or 5). In this paper, we do not aim to propose new neural network architectures as most recent works do, but to show that it is promising to boost MCC performance with a novel formulation through the lens of ranking. In particular, by viewing MCC as \emph{an instance class ranking problem}, we first argue that ranking metrics, such as Normalized Discounted Cumulative Gain, can be more informative than the existing Top-K metrics. We further demonstrate that the dominant neural MCC recipe can be transformed to a neural ranking pipeline. Based on such generalization, we show that it is intuitive to leverage techniques from the rich information retrieval literature to improve the MCC performance out of the box. Extensive empirical results on both text and image classification tasks with diverse datasets and backbone neural models show the value of our proposed framework. View details
    Preview abstract We explore a novel perspective of knowledge distillation (KD) for learning to rank (LTR), and introduce Self-Distilled neural Rankers (SDR), where student rankers are parameterized identically to their teachers. Unlike the existing ranking distillation work which pursues a good trade-off between performance and efficiency, SDR is able to significantly improve ranking performance of students over the teacher rankers without increasing model capacity. The key success factors of SDR, which differs from common distillation techniques for classification are: (1) an appropriate teacher score transformation function, and (2) a novel listwise distillation framework. Both techniques are specifically designed for ranking problems and are rarely studied in the existing knowledge distillation literature. Building upon the state-of-the-art neural ranking structure, SDR is able to push the limits of neural ranking performance above a recent rigorous benchmark study and significantly outperforms traditionally strong gradient boosted decision tree based models on 7 out of 9 key metrics, the first time in the literature. In addition to the strong empirical results, we give theoretical explanations on why listwise distillation is effective for neural rankers, and provide ablation studies to verify the necessity of the key factors in the SDR framework. View details
    Preview abstract Large Language Models (LLMs) have shown impressive results on a variety of text understanding tasks. Search queries though pose a unique challenge, given their short-length and lack of nuance or context. Complicated feature engineering efforts do not always lead to downstream improvements as their performance benefits may be offset by increased complexity of knowledge distillation. Thus, in this paper we make the following contributions: (1) We demonstrate that Retrieval Augmentation of queries provides LLMs with valuable additional context enabling improved understanding. While Retrieval Augmentation typically increases latency of LMs (thus hurting distillation efficacy), (2) we provide a practical and effective way of distilling Retrieval Augmentation LLMs. Specifically, we use a novel two-stage distillation approach that allows us to carry over the gains of retrieval augmentation, without suffering the increased compute typically associated with it. (3) We demonstrate the benefits of the proposed approach on a billion-scale, real-world query understanding system resulting in an X\% improvement. Via extensive experiments, including on public benchmarks, we believe this work offers a recipe for practical use of retrieval-augmented query understanding. View details
    Stochastic Retrieval-Conditioned Reranking
    Hamed Zamani
    The ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR) 2022
    Preview abstract The multi-stage cascaded architecture has been adopted by many search engines for efficient and effective retrieval. This architecture consists of a stack of retrieval and reranking models in which efficient retrieval models are followed by effective (neural) learning to rank models. The optimization of these learning to rank models is loosely connected to the early stage retrieval models. In many cases these learning to rank models are often trained in isolation of the early stage retrieval models. This paper draws theoretical connections between the early stage retrieval and late stage reranking models by deriving expected reranking performance conditioned on the early stage retrieval results. Our findings shed light on optimization of both retrieval and reranking models. As a result, we also introduce a novel loss function for training reranking models that leads to significant improvement in multiple public benchmarks. View details
    Multi-Aspect Dense Retrieval
    Swaraj Khadanga
    Wensong Xu
    Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining, ACM (2022)
    Preview abstract Prior work in Dense Retrieval usually encodes queries and documents using single-vector representations (also called embeddings) and performs retrieval in the embedding space using approximate nearest neighbor search. This paradigm enables efficient semantic retrieval. However, the single-vector representations can be ineffective at capturing different aspects of the queries and documents in relevance matching, especially for some vertical domains. For example, in e-commerce search, these aspects could be category, brand and color. Given a query "white nike socks", a Dense Retrieval model may mistakenly retrieve some "white adidas socks" while missing out the intended brand. We propose to explicitly represent multiple aspects using one embedding per aspect. We introduce an aspect prediction task to teach the model to capture aspect information with particular aspect embeddings. We design a lightweight network to fuse the aspect embeddings for representing queries and documents. Our evaluation using an e-commerce dataset shows impressive improvements over strong Dense Retrieval baselines. We also discover that the proposed aspect embeddings can enhance the interpretability of Dense Retrieval models as a byproduct. View details
    Scale Calibration of Deep Ranking Models
    28TH ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD) (2022), pp. 4300-4309
    Preview abstract Learning-to-Rank (LTR) systems are ubiquitous in web applications nowadays. The existing literature mainly focuses on improving ranking performance by trying to generate the optimal order of candidate items. However, virtually all advanced ranking functions are not scale calibrated. For example, rankers have the freedom to add a constant to all item scores without changing their relative order. This property has resulted in several limitations in deploying advanced ranking methods in practice. On the one hand, it limits the use of effective ranking functions in important applications. For example, in ads ranking, predicted Click-Through Rate (pCTR) is used for ranking and is required to be calibrated for the downstream ads auction. This is a major reason that existing ads ranking methods use scale calibrated pointwise loss functions that may sacrifice ranking performance. On the other hand, popular ranking losses are translation-invariant. We rigorously show that, both theoretically and empirically, this property leads to training instability that may cause severe practical issues. In this paper, we study how to perform scale calibration of deep ranking models to address the above concerns. We design three different formulations to calibrate ranking models through calibrated ranking losses. Unlike existing post-processing methods, our calibration is performed during training, which can resolve the training instability issue without any additional processing. We conduct experiments on the standard LTR benchmark datasets and one of the largest sponsored search ads dataset from Google. Our results show that our proposed calibrated ranking losses can achieve nearly optimal results in terms of both ranking quality and score scale calibration. View details
    Revisiting two tower models for unbiased learning to rank
    Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (2022), 2410–2414
    Preview abstract Two-tower architecture (one tower to factorize out position-related bias) has now become a common technique in neural network ranking models for Unbiased Learning To Rank (ULTR). In these models, a neural network tower taking in all position related features is designed to model the biases, which are equivalent to the propensity scores used to define the unbiased ranking metrics. It works based on the assumptions that the user interaction (click) is conditioned on the user observation of a ranked item, and only the observation probability depends on the position. So if we factorize out the observation probability, we can then unbiased rank the items by their click rate conditioned on observation. The assumption appears sensible, and the additive two-tower models based on it have been widely implemented in ULTR. However, two-tower models may not always work and sometimes work even worse than the biased models, as the user may not always follow the same pattern. In this work, we stick to the plausible assumption about the user interaction, but we also consider the spectrum of different user behaviors. In this case, the assumption that the position related observation probability may not be able to get explicitly factorized out. We also study generic methods to treat this complexity and show these methods could outperform the simple additive debias models in offline experiments. View details
    Rax: Composable Learning-to-Rank using JAX
    Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (2022), 3051–3060
    Preview abstract Rax is a library for composable Learning-to-Rank (LTR) written entirely in JAX. The goal of Rax is to facilitate easy prototyping of LTR systems by leveraging the flexibility and simplicity of JAX. Rax provides a diverse set of popular ranking metrics and losses that integrate well with the rest of the JAX ecosystem. Furthermore, Rax implements a system of ranking-specific function transformations which allows fine-grained customization of ranking losses and metrics. Most notably Rax provides approx_t12n: a function transformation (t12n) that can transform any of our ranking metrics into an approximate and differentiable form that can be optimized. This provides a systematic way to directly optimize neural ranking models for ranking metrics that are not easily optimizable in other libraries. We empirically demonstrate the effectiveness of Rax by benchmarking neural models implemented using Flax and trained using Rax on two popular LTR benchmarks: WEB30K and Istella. Furthermore, we show that integrating ranking losses with T5, a large language model, can improve overall ranking performance on the MS MARCO passage ranking task. We are sharing the Rax library with the open source community as part of the larger JAX ecosystem at https://github.com/google/rax. View details
    Search and Discovery in Personal Email Collections (Tutorial Proposal)
    Proceedings of the 15th ACM International Conference on Web Search and Data Mining (2022), 1617–1619
    Preview abstract Email has been an essential communication medium for many years. As a result, the information accumulated in our mailboxes has become valuable for all of our personal and professional activities. For years, researchers have developed interfaces, models, and algorithms to facilitate email search, discovery, and organization. This tutorial brings together these diverse research directions and provides both a historical background, as well as a high-level overview of the recent advances in the field. In particular, we lay out all of the components needed in the design of email search engines, including user interfaces, indexing, document and query understanding, retrieval, ranking, evaluation, and data privacy. The tutorial also goes beyond search, presenting recent work on intelligent task assistance in email and a number of interesting future directions. View details
    Retrieval Enhanced Machine Learning
    Hamed Zamani
    SIGIR 2022: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (Perspectives Track)
    Preview abstract Information access systems have supported people during tasks across a variety of domains. In this perspective paper, we advocate for broadening the scope of information access research to include machines. We believe that machine learning can be substantially advanced by developing a research program around retrieval as a core algorithmic method. This paper describes how core principles of indexing, representation, retrieval, and relevance can extend supervised learning algorithms. It proposes a generic retrieval-enhanced machine learning (REML) framework and describes challenges in and opportunities introduced by implementing REML. We also discuss different optimization approaches for training REML models and review a number of case studies that are simplified and special implementations of the proposed framework. The research agenda introduced in this paper will smooth the path towards developing machine learning models with better scalability, sustainability, effectiveness, and interpretability. View details
    On Optimizing Top-K Metrics for Neural Ranking Models
    Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (2022), 2303–2307
    Preview abstract Top-K metrics such as NDCG@K are frequently used to evaluate ranking performance. The traditional tree-based models such as LambdaMART, which are based on Gradient Boosted Decision Trees (GBDT), are designed to optimize NDCG@K using the LambdaRank losses. Recently, there is a good amount of research interest on neural ranking models for learning-to-rank tasks. These models are fundamentally different from the decision tree models and behave differently with respect to different loss functions. For example, the most popular ranking losses used in neural models are the Softmax loss and the GumbelApproxNDCG loss. These losses do not connect to top-K metrics such as NDCG@K naturally. It remains a question on how to effectively optimize NDCG@K for neural ranking models. In this paper, we follow the LambdaLoss framework and design novel and theoretically sound losses for NDCG@K metrics, while the original LambdaLoss paper can only do so using an unsound heuristic. We study the new losses on the LETOR benchmark datasets and show that the new losses work better than other losses for neural ranking models. View details
    Preview abstract The content on the web is in a constant state of flux. New entities, issues, and ideas continuously emerge, while the semantics of the existing conversation topics gradually shift. In recent years, pretrained language models like BERT greatly improved the state-of-the-art for a large spectrum of content understanding tasks. Therefore, in this paper, we aim to study how these language models can be adapted to better handle continuously evolving web content. In our study, we first analyze the evolution of 2013 – 2019 Twitter data, and unequivocally confirm that a BERT model trained on past tweets would heavily deteriorate when directly applied to data from later years. Then, we investigate two possible sources of the deterioration: the semantic shift of existing tokens and the sub-optimal or failed understanding of new tokens. To this end, we both explore two different vocabulary composition methods, as well as propose three sampling methods which help in efficient incremental training for BERT-like models. Compared to a new model trained from scratch offline, our incremental training (a) reduces the training costs, (b) achieves better performance on evolving content, and (c) is suitable for online deployment. The superiority of our methods is validated using two downstream tasks. We demonstrate significant improvements when incrementally evolving the model from a particular base year, on the task of Country Hashtag Prediction, as well as on the OffensEval 2019 task. View details
    Preview abstract Existing work on search result diversification typically falls into the "next document" paradigm, that is, selecting the next document based on the ones already chosen. A sequential process of selecting documents one-by-one is naturally modeled in learning-based approaches. However, such a process makes the learning difficult because there are an exponential number of ranking lists to consider. Sampling is usually used to reduce the computational complexity but this makes the learning less effective. In this paper, we propose a soft version of the "next document" paradigm in which we associate each document with an approximate rank, and thus the subtopics covered prior to a document can also be estimated. We show that we can derive differentiable diversification-aware losses, which are smooth approximation of diversity metrics like alpha-NDCG, based on these estimates. We further propose to optimize the losses in the learning-to-rank setting using neural distributed representations of queries and documents. Experiments are conducted on the public benchmark TREC datasets. By comparing with an extensive list of baseline methods, we show that our Diversification-Aware LEarning-TO-Rank (DALETOR) approaches outperform them by a large margin, while being much simpler during learning and inference. View details
    Preview abstract Despite the success of neural models in many major machine learning problems and recently published neural learning to rank (LTR) papers in top venues, the effectiveness of neural models on traditional LTR problems is still not widely acknowledged. We first validate the concern by showing that most recent neural LTR models are, by a large margin, inferior to the best publicly available tree-based implementation, which is sometimes ignored in recent neural LTR papers. We then investigate why existing neural LTR suffers by identifying several of its weaknesses. To that end, we propose a new neural LTR framework that mitigates these weaknesses, by borrowing ideas from several research fields. Our models are able to perform comparatively with the strong tree-based baseline, while outperforming recently published neural learning to rank methods by a large margin. Our results also serve as a benchmark for neural learning to rank models. View details
    Preview abstract Document layout comprises both structural and visual (\eg font size) information that are vital but often ignored by machine learning models. The few existing models which do use layout information only consider \textit{textual} contents, and overlook the existence of contents in other modalities such as images. Additionally, spatial interactions of presented contents in a layout was never fully exploited. On the other hand, a series of document understanding tasks are calling out for layout information. One example is given a position in a document, which image is the best to fit in. To address current models' limitations and tackle layout-aware document understanding tasks, we first parse a document into blocks whose content can be textual, tabular, or multimedia (\eg images) using a proprietary tool. We then propose a novel hierarchical framework, LAMPreT, to encode the blocks. Our LAMPreT model encodes each block with a multimodal transformer in the lower-level, and aggregates the block-level representations and connections utilizing a specifically designed transformer at the higher-level. We design hierarchical pre-training objectives where the lower-level model is trained with the standard masked language modeling (MLM) loss and the multimodal alignment loss, and the higher-level model is trained with three layout-aware objectives: (1) block-order predictions, (2) masked block predictions, and (3) image fitting predictions. We test the proposed model on two layout-aware tasks -- image suggestions and text block filling, and show the effectiveness of our proposed hierarchical architecture as well as pre-training techniques. View details
    Natural Language Understanding with Privacy-Preserving BERT
    Proceedings of the 30th ACM International Conference on Information and Knowledge Management, ACM (2021)
    Preview abstract Privacy preservation remains a key challenge in data mining and Natural Language Understanding (NLU). Previous research shows that the input text or even text embeddings can leak private information. This concern motivates our research on effective privacy preservation approaches for pretrained Language Models (LMs). We investigate the privacy and utility implications of applying dχ-privacy, a variant of Local Differential Privacy, to BERT fine-tuning in NLU applications. More importantly, we further propose privacy-adaptive LM pretraining methods and show that our approach can boost the utility of BERT dramatically while retaining the same level of privacy protection. We also quantify the level of privacy preservation and provide guidance on privacy configuration. Our experiments and findings lay the groundwork for future explorations of privacy-preserving NLU with pretrained LMs. View details
    Preview abstract A well-known challenge in leveraging implicit user feedback like clicks to improve real-world search services and recommender systems is its inherent bias. Most existing click models are based on the examination hypothesis in user behaviors and differ in how to model such an examination bias. However, they are constrained by assuming a simple position-based bias or enforcing a sequential order in user examination behaviors. These assumptions are insufficient to capture complex real-world user behaviors and hardly generalize to modern user interfaces (UI) in web applications (e.g., results shown in a grid view). In this work, we propose a fully data-driven neural model for the examination bias, Cross-Positional Attention (XPA), which is more flexible in fitting complex user behaviors. Our model leverages the attention mechanism to effectively capture cross-positional interactions among displayed items and is applicable to arbitrary UIs. We employ XPA in a novel neural click model that can both predict clicks and estimate relevance. Our experiments on offline synthetic data sets show that XPA is robust among different click generation processes. We further apply XPA to a large-scale real-world recommender system, showing significantly better results than baselines in online A/B experiments that involve millions of users. This validates the necessity to model more complex user behaviors than those proposed in the literature. View details
    Improving Cloud Storage Search with User Activity
    Proceedings of the 14th International Conference on Web Search and Data Mining (WSDM '21), ACM (2021)
    Preview abstract Cloud-based file storage platforms such as Google Drive are widely used as a means for storing, editing and sharing personal and organizational documents. In this paper, we improve search ranking quality for cloud storage platforms by utilizing user activity logs. Different from search logs, activity logs capture general document usage activity beyond search, such as opening, editing and sharing documents. We propose to automatically learn text embeddings that are effective for search ranking from activity logs. We develop a novel co-access signal, i.e., whether two documents were accessed by a user around the same time, to train deep semantic matching models that are useful for improving the search ranking quality. We confirm that activity-trained semantic matching models can improve ranking by conducting extensive offline experimentation using Google Drive search and activity logs. To the best of our knowledge, this is the first work to examine the benefits of leveraging document usage activity at large scale for cloud storage search; as such it can shed light on using such activity in scenarios where direct collection of search-specific interactions (e.g., query and click logs) may be expensive or infeasible. View details
    Interpretable Ranking with Generalized Additive Models
    Alexander Grushetsky
    Petr Mitrichev
    Ethan Sterling
    Nathan Bell
    Walker Ravina
    Hai Qian
    Proceedings of the 14th ACM International Conference on Web Search and Data Mining (WSDM) (2021)
    Preview abstract Interpretability of ranking models is a crucial yet relatively under-examined research area. Recent progress on this area largely focuses on generating post-hoc explanations for existing black-box ranking models. Though promising, such post-hoc methods cannot provide sufficiently accurate explanations in general, which makes them infeasible in many high-stakes scenarios, especially the ones with legal or policy constraints. Thus, building an intrinsically interpretable ranking model with transparent, self-explainable structure becomes necessary, but this remains less explored in the learning-to-rank setting. In this paper, we lay the groundwork for intrinsically interpretable learning-to-rank by introducing generalized additive models (GAMs) into ranking tasks. Generalized additive models (GAMs) are intrinsically interpretable machine learning models and have been extensively studied on regression and classification tasks. We study how to extend GAMs into ranking models which can handle both item-level and list-level features and propose a novel formulation of ranking GAMs. To instantiate ranking GAMs, we employ neural networks instead of traditional splines or regression trees. We also show that our neural ranking GAMs can be distilled into a set of simple and compact piece-wise linear functions that are much more efficient to evaluate with little accuracy loss. We conduct experiments on three data sets and show that our proposed neural ranking GAMs can outperform other traditional GAM baselines while maintaining similar interpretability. View details
    Preview abstract We describe how we built three recommendation products from scratch at Google Chrome Web Store, namely context-based recommendations, related extension recommendations, and personalized recommendations. Unlike most existing papers that focus on novel algorithms, this paper focuses on sharing practical experiences building large scale recommender systems under various real-world constraints, such as privacy constraints, data sparsity issues, highly skewed data distribution, and product design choices, such as user interface. We show how these constraints make standard approaches difficult to succeed in practice. We share success stories that turn very negative live metrics to very positive, by introducing 1) how we use interpretable neural models to bootstrap the systems, helps identifying pipeline issues, and paves way for more advanced models. 2) A new item-item based recommendation algorithm that works under highly skewed data distributions, and 3) how two products can help bootstrapping the third one, which significantly reduces development cycles and bypasses various real-world difficulties. All the explorations in this work are verified in live traffic on millions of users. We believe the findings in this work can help practitioners to bootstrap and build large-scale recommender systems. View details
    WIT: Wikipedia-based Image Text Dataset for Multimodal Multilingual Machine Learning
    Jiecao Chen
    Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '21) (2021)
    Preview abstract The milestone improvements brought about by deep representation learning and pre-training techniques have led to large performance gains across downstream NLP, IR and Vision tasks. Multimodal modeling techniques aim to leverage high-quality visio-linguistic datasets for learning complementary information (across image and text modalities). In this paper, we introduce the Wikipedia-based Image Text (WIT) Dataset to better facilitate multimodal, multilingual learning. WIT is composed of 11 million+ unique images with over 37 million entity rich text descriptions associated with these images in Wikipedia from over 100 languages. Its size enables WIT to be used as a pretraining dataset for multimodal models, as we show when applied to downstream tasks such as image-text retrieval. WIT has four main and unique advantages. First, WIT is the largest multimodal dataset (at the time of writing). Second, it is massively multilingual (first of its kind) with coverage over 100+ languages (each of which has at least 10K examples) and provides cross-lingual texts for many images. Third, it represents a more diverse set of concepts and real world entities relative to what previous datasets cover. Lastly, as we demonstrate empirically, WIT provides a very challenging real-world test set that empirically highlights the need for learning improvements in tasks such as Retrieval and Captioning. View details
    Ensemble Distillation for BERT-Based Ranking Models
    Shuguang Han
    Proceedings of the 2021 ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR ’21)
    Preview abstract Over the past two years, large pretrained language models such as BERT have been applied to text ranking problems and showed superior performance on multiple public benchmark data sets. Prior work demonstrated that an ensemble of multiple BERT-based ranking models can not only boost the performance, but also reduce the performance variance. However, an ensemble of models is more costly because it needs computing resource and/or inference time proportional to the number of models. In this paper, we study how to retain the performance of an ensemble of models at the inference cost of a single model by distilling the ensemble into a single BERT-based student ranking model. Specifically, we study different designs of teacher labels, various distillation strategies, as well as multiple distillation losses tailored for ranking problems. We conduct experiments on the MS MARCO passage ranking and the TREC-COVID data set. Our results show that even with these simple distillation techniques, the distilled model can effectively retain the performance gain of the ensemble of multiple models. More interestingly, the performances of distilled models are also more stable than models fine-tuned on original labeled data. The results reveal a promising direction to capitalize on the gains achieved by an ensemble of BERT-based ranking models. View details
    Preview abstract How to leverage cross-document interactions to improve ranking performance is an important topic in information retrieval research. The recent developments in deep learning show strength in modeling complex relationships across sequences and sets. It thus motivates us to study how to leverage cross-document interactions for learning-to-rank in the deep learning framework. In this paper, we formally define the permutation equivariance requirement for a scoring function that captures cross-document interactions. We then propose a self-attention based document interaction network that extends any univariate scoring function with contextual features capturing cross-document interactions. We show that it satisfies the permutation equivariance requirement, and can generate scores for document sets of varying sizes. Our proposed methods can automatically learn to capture document interactions without any auxiliary information, and can scale across large document sets. We conduct experiments on four ranking datasets: the public benchmarks WEB30K and Istella, as well as Gmail search and Google Drive Quick Access datasets. Experimental results show that our proposed methods lead to significant quality improvements over state-of-the-art neural ranking models, and are competitive with state-of-the-art gradient boosted decision tree (GBDT) based models on the WEB30K dataset. View details
    Feature Transformation for Neural Ranking Models
    Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2020), pp. 1649-1652
    Preview abstract Although neural network models enjoy tremendous advantages in handling image and text data, tree-based models still remain competitive for learning-to-rank tasks with numerical data. A major strength of tree-based ranking models is the insensitivity to different feature scales, while neural ranking models may suffer from features with varying scales or skewed distributions. Feature transformation or normalization is a simple technique which preprocesses input features to mitigate their potential adverse impact on neural models. However, due to lack of studies, it is unclear to what extent feature transformation can benefit neural ranking models. In this paper, we aim to answer this question by providing empirical evidence for learning-to-rank tasks. First, we present a list of commonly used feature transformation techniques and perform a comparative study on multiple learning-to-rank data sets. Then we propose a mixture feature transformation mechanism which can automatically derive a mixture of basic feature transformation functions to achieve the optimal performance. Our experiments show that applying feature transformation can substantially improve the performance of neural ranking models compared to directly using the raw features. In addition, the proposed mixture transformation method can further improve the performance of the ranking model without any additional human effort. View details
    Preview abstract Recent neural ranking algorithms focus on learning semantic matching between query and document terms. However, practical learning to rank systems typically rely on a wide range of side information beyond query and document textual features, like location, user context, etc. It is common practice to concatenate all of these features and rely on deep models to learn a complex representation. We study how to effectively and efficiently combine textual information from queries and documents with other useful but less prominent side information for learning to rank. We conduct synthetic experiments to show that: 1) neural networks are inefficient at learning the interaction between two prominent features (e.g., query and document embedding features) in the presence of other less prominent features; 2) direct application of a state-of-art method for higher-order feature generation is also inefficient at learning such important interactions. Based on the above observations, we propose a simple but effective matching cross network (MCN) method for learning to rank with side information. MCN conducts an element-wise multiplication matching of query and document embeddings and leverages a technique called latent cross to effectively learn the interaction between matching output and all side information. The approach is easy to implement, adds minimal parameters and latency overhead to standard neural ranking architectures, and can be used for efficient end-to-end training. We conduct extensive experiments using two of the world's largest personal search engines, Gmail and Google Drive search, and show that each proposed component adds meaningful gains against a strong production baseline with minimal latency overhead, thereby demonstrating the practical effectiveness and efficiency of the proposed approach. View details
    Learning to Cluster Documents into Workspaces Using Large Scale Activity Logs
    Proceedings of the 26th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD ’20), ACM (2020), 2416–2424
    Preview abstract Google Drive is widely used for managing personal and work-related documents in the cloud. To help users organize their documents in Google Drive, we develop a new feature to allow users to create a set of working files for ongoing easy access, called workspace. A workspace is a cluster of documents, but unlike a typical document cluster, it contains documents that are not only topically coherent, but are also useful in the ongoing user tasks. To alleviate the burden of creating workspaces manually, we automatically cluster documents into suggested workspaces. We go beyond the textual similarity-based unsupervised clustering paradigm and instead directly learn from users’ activity for document clustering. More specifically, we extract co-access signals (i.e., whether a user accessed two documents around the same time) to measure document relatedness. We then use a neural document similarity model that incorporates text, metadata, as well as co-access features. Since human labels are often difficult or expensive to collect, we extract weak labels based on co-access data at large scale for model training. Our offline and online experiments based on Google Drive show that (a) co-access features are very effective for document clustering; (b) our weakly supervised clustering achieves comparable or even better performance compared to the models trained with human labels; and (c) the weakly supervised method leads to better workspace suggestions that the users accept more often in the production system than baseline approaches. View details
    Preview abstract Many natural language processing and information retrieval problems can be formalized as the task of semantic matching. Existing work in this area has been largely focused on matching between short texts (e.g., question answering), or between a short and a long text (e.g., ad-hoc retrieval). Semantic matching between long-form documents, which has many important applications like news recommendation, related article recommendation and document clustering, is relatively less explored and needs more research effort. In recent years, self-attention based models like Transformers and BERT have achieved state-of-the-art performance in the task of text matching. These models, however, are still limited to short text like a few sentences or one paragraph due to the quadratic computational complexity of self-attention with respect to input text length. In this paper, we address the issue by proposing the Siamese Multi-depth Transformer-based Hierarchical (SMITH) Encoder for long-form document matching. Our model contains several innovations to adapt self-attention models for longer text input. We propose a transformer based hierarchical encoder to capture the document structure information. In order to better capture sentence level semantic relations within a document, we pre-train the model with a novel masked sentence block language modeling task in addition to the masked word language modeling task used by BERT. Our experimental results on several benchmark datasets for long-form document matching show that our proposed SMITH model outperforms the previous state-of-the-art models including hierarchical attention, multi-depth attention-based hierarchical recurrent neural network, and BERT. Comparing to BERT based baselines, our model is able to increase maximum input text length from 512 to 2048. We will open source a Wikipedia based benchmark dataset, code and a pre-trained checkpoint to accelerate future research on long-form document matching. View details
    A Stochastic Treatment of Learning to Rank Scoring Functions
    Sebastian Bruch
    Shuguang Han
    Proceedings of the 13th ACM International Conference on Web Search and Data Mining (WSDM 2020), pp. 61-69
    Preview abstract Learning to Rank, a central problem in information retrieval, is a class of machine learning algorithms that formulate ranking as an optimization task. The objective is to learn a function that produces an ordering of a set of objects in such a way that the utility of the entire ordered list is maximized. Learning-to-rank methods do so by constructing a function that computes a score for each object in the set. A ranked list is then compiled by sorting objects according to their scores. While such a deterministic mapping of scores to permutations makes sense during inference where stability of ranked lists is required, we argue that its greedy nature during training leads to less robust models. This is particularly problematic when the loss function under optimization---in agreement with ranking metrics---only penalizes incorrect rankings and does not take into account the distribution of raw scores. In this work, we present a stochastic framework where, instead of a deterministic derivation of permutations from raw scores, permutations are sampled from a distribution defined by raw scores. Our proposed sampling method is differentiable and works well with gradient descent optimizers. We analytically study our proposed method and demonstrate when and why it leads to model robustness. We also show empirically, through experiments on publicly available learning-to-rank datasets, that the application of our proposed method to a class of ranking loss functions leads to significant model quality improvements. View details
    Preview abstract This paper describes a machine learning algorithm for document (re)ranking, in which queries and documents are firstly encoded using BERT [1], and on top of that a learning-to-rank (LTR) model constructed with TF-Ranking (TFR) [2] is applied to further optimize the ranking performance. This approach is proved to be effective in a public MS MARCO benchmark [3]. Our submissions achieve the best performance for the passage re-ranking task as of March 30, 2020 [4], and the second best performance for the passage full-ranking task as of April 10, 2020 [5], demonstrating the effectiveness of combining ranking losses with BERT representations for document ranking. View details
    Preview abstract Search engines often follow a 2-phase paradigm where in the first step an initial set of documents is retrieved (the \emp{retrieval} step) and in the second step the documents are ranked so as to obtain the final result list (the \emp{re-ranking} step). The focus of this paper is on improving the \emph{retrieval} step (measured mainly by recall) using deep neural network-based approaches. While deep neural networks were shown to improve the performance of the re-ranking step, there is little literature about using deep neural networks to improve the retrieval step. Previous works on deep neural networks for IR usually apply a simple lexical retrieval model for the retrieval step (e.g., BM25) and emphasize on the re-ranking step. In this paper, we propose and study a hybrid retrieval approach, which leverages both semantic (deep neural network based) and lexical (keyword matching based like BM25) matching techniques. The main idea is to perform semantic and lexical retrieval in parallel, and then to combine the result lists to generate the initial result set for re-ranking. An empirical evaluation, using a public TREC collection, shows that semantic retrieval model generated result lists often contain a substantial number of relevant documents not covered by the lexical-based generated lists. Further analysis of these relevant documents shows that they often also exhibit different characteristics than the lexical-based documents, attesting to the complementary nature of the two approaches. Finally, the experiments show that by combining the two result lists, the recall of the result list can increase significantly, the retrieval step can be greatly improved and these improvements are highly robust. View details
    Adversarial Bandits Policy for Crawling Commercial Web Content
    Shuguang Han
    Przemek Gajda
    Sergey Novikov
    Alexandrin Popescul
    Proceedings of the Web Conference 2020 (WWW 2020), pp. 407-417
    Preview abstract The rapid growth of commercial web content has driven the development of shopping search services to help users find product offers. Due to the dynamic nature of commercial content, an effective recrawl policy is a key component in a shopping search service; it ensures that users have access to the up-to-date product details. Most of the existing strategies either relied on simple heuristics, or overlooked the resource budgets. To address this, Azar et al. [5] recently proposed an optimization strategy LambdaCrawl aiming to maximize content freshness within a given resource budget. In this paper, we demonstrate that the effectiveness of LambdaCrawl is governed in large part by how well future content change rate can be estimated. By adopting the state-of-the-art deep learning models for change rate prediction, we obtain a substantial increase of content freshness over the common LambdaCrawl implementation with change rate estimated from the past history. Moreover, we demonstrate that while LambdaCrawl is a significant advancement upon existing recrawl strategies, it can be further improved upon by a unified multi-strategy recrawl policy. To this end, we adopt the $K$-armed adversarial bandits algorithm that can provably optimize the overall freshness by combining multiple strategies. Empirical results over a large-scale production dataset confirm its superiority to LambdaCrawl, especially under tight resource budgets. View details
    Preview abstract Pre-trained models like BERT have dominated NLP / IR applications such as single sentence classification, text pair classification, and question answering. However, deploying these models in real systems is highly non-trivial due to their exorbitant computational costs. A common remedy to this is knowledge distillation, leading to faster inference. However –as we show here – existing works are not optimized for dealing with pairs (or tuples) of texts. Consequently, they are either not scalable or demonstrate subpar performance. In this work,we propose DiPair— a novel framework for distilling fast and accurate models on text pair tasks. Coupled with an end-to-end training strategy, DiPair is both highly scalable and offers improved quality-speed tradeoffs. Empirical studies conducted on both academic and real-world e-commerce benchmarks demonstrate the efficacy of the proposed approach with speedups of over 350x and minimal quality drop relative to the cross-attention teacherBERT model. View details
    Learning Groupwise Multivariate Scoring Functions Using Deep Neural Networks
    Qingyao Ai
    Sebastian Bruch
    Proceedings of the 5th ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR) (2019), pp. 85-92
    Preview abstract While in a classification or a regression setting a label or a value is assigned to each individual document, in a ranking setting we determine the relevance ordering of the entire input document list. This difference leads to the notion of relative relevance between documents in ranking. The majority of the existing learning-to-rank algorithms model such relativity at the loss level using pairwise or listwise loss functions. However, they are restricted to univariate scoring functions, i.e., the relevance score of a document is computed based on the document itself, regardless of other documents in the list. To overcome this limitation, we propose a new framework for multivariate scoring functions, in which the relevance score of a document is determined jointly by multiple documents in the list. We refer to this framework as GSFs---groupwise scoring functions. We learn GSFs with a deep neural network architecture, and demonstrate that several representative learning-to-rank algorithms can be modeled as special cases in our framework. We conduct evaluation using click logs from one of the largest commercial email search engines, as well as a public benchmark dataset. In both cases, GSFs lead to significant performance improvements, especially in the presence of sparse textual features. View details
    Preview abstract Existing unbiased learning-to-rank models use counterfactual inference, notably Inverse Propensity Scoring (IPS), to learn a ranking function from biased click data. They handle the click incompleteness bias, but usually assume that the clicks are noise-free, i.e., a clicked document is always assumed to be relevant. In this paper, we relax this unrealistic assumption and study click noise explicitly in the unbiased learning-to-rank setting. Specifically, we model the noise as the position-dependent trust bias and propose a noise-aware Position-Based Model, named TrustPBM, to better capture user click behavior. We propose an Expectation-Maximization algorithm to estimate both examination and trust bias from click data in TrustPBM. Furthermore, we show that it is difficult to use a pure IPS method to incorporate click noise and thus propose a novel method that combines a Bayes rule application with IPS for unbiased learning-to-rank. We evaluate our proposed methods on three personal search data sets and demonstrate that our proposed model can significantly outperform the existing unbiased learning-to-rank methods. View details
    TF-Ranking: Scalable TensorFlow Library for Learning-to-Rank
    Sebastian Bruch
    Rohan Anil
    Stephan Wolf
    Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD) (2019), pp. 2970-2978
    Preview abstract Learning-to-Rank deals with maximizing the utility of a list of examples presented to the user, with items of higher relevance being prioritized. It has several practical applications such as large-scale search, recommender systems, document summarization and question answering. While there is widespread support for classification and regression based learning, support for learning-to-rank in deep learning has been limited. We propose TensorFlow Ranking, the first open source library for solving large-scale ranking problems in a deep learning framework. It is highly configurable and provides easy-to-use APIs to support different scoring mechanisms, loss functions and evaluation metrics in the learning-to-rank setting. Our library is developed on top of TensorFlow and can thus fully leverage the advantages of this platform. For example, it is highly scalable, both in training and in inference, and can be used to learn ranking models over massive amounts of user activity data, which can include heterogeneous dense and sparse features. We empirically demonstrate the effectiveness of our library in learning ranking functions for large-scale search and recommendation applications in Gmail and Google Drive. We also show that ranking models built using our model scale well for distributed training, without significant impact in metrics. The proposed library is available to the open source community, with the hope that it facilitates further academic research and industrial applications in the field of learning-to-rank. View details
    Predictive Crawling for Commercial Web Content
    Shuguang Han
    Przemek Gajda
    Sergey Novikov
    Robin Dua
    Alexandrin Popescul
    Proceedings of the 2019 World Wide Web Conference, pp. 627-637
    Preview abstract Web crawlers spend significant resources to maintain freshness of their crawled data. This paper describes the optimization of resources to ensure that product prices shown in ads in a context of a shopping sponsored search service are synchronized with current merchant prices. We are able to use the predictability of price changes to build a machine learned system leading to considerable resource savings for both the merchants and the crawler. We describe our solution to technical challenges due to partial observability of price history, feedback loops arising from applying machine learned models, and offers in cold start state. Empirical evaluation over large-scale product crawl data demonstrates the effectiveness of our model and confirms its robustness towards unseen data. We argue that our approach can be applicable in more general data pull settings. View details
    Domain Adaptation for Enterprise Email Search
    Brandon Tran
    Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR) (2019)
    Preview abstract In the enterprise email search setting, the same search engine often powers multiple enterprises from various industries: technology, education, manufacturing, etc. However, using the same global ranking model across different enterprises may result in suboptimal search quality, due to the corpora differences and distinct information needs. On the other hand, training an individual ranking model for each enterprise may be infeasible, especially for smaller institutions with limited data. To address this data challenge, in this paper we propose a domain adaptation approach that fine-tunes the global model to each individual enterprise. In particular, we propose a novel application of the Maximum Mean Discrepancy (MMD) approach to information retrieval, which attempts to bridge the gap between the global data distribution and the distribution arising from an individual enterprise. We conduct a comprehensive set of experiments on a large-scale email search engine, and demonstrate that the MMD approach consistently improves the search quality for multiple individual domains, both in comparison to the global ranking model, as well as several competitive domain adaptation baselines including adversarial learning methods. View details
    Revisiting Approximate Metric Optimization in the Age of Deep Neural Networks
    Sebastian Bruch
    Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '19) (2019), pp. 1241-1244
    Preview abstract Learning-to-Rank is a branch of supervised machine learning that seeks to produce an ordering of a list of items such that the utility of the ranked list is maximized. Unlike most machine learning techniques, however, the objective cannot be directly optimized using gradient descent methods as it is either discontinuous or flat everywhere. As such, learning-to-rank methods often optimize a loss function that either is loosely related to or upper-bounds a ranking utility instead. A notable exception is the approximation framework originally proposed by Qin et al. that facilitates a more direct approach to ranking metric optimization. We revisit that framework almost a decade later in light of recent advances in neural networks and demonstrate its superiority empirically. Through this study, we hope to show that the ideas from that work are more relevant than ever and can lay the foundation of learning-to-rank research in the age of deep neural networks. View details
    An Analysis of the Softmax Cross Entropy Loss for Learning-to-Rank with Binary Relevance
    Sebastian Bruch
    Proceedings of the 2019 ACM SIGIR International Conference on the Theory of Information Retrieval (ICTIR 2019), pp. 75-78
    Preview abstract One of the challenges of learning-to-rank for information retrieval is that ranking metrics are not smooth and as such cannot be optimized directly with gradient descent optimization methods. This gap has given rise to a large body of research that reformulates the problem to fit into existing machine learning frameworks or defines a surrogate, ranking-appropriate loss function. One such loss ListNet's which measures the cross entropy between a distribution over documents obtained from scores and another from ground-truth labels. This loss was designed to capture permutation probabilities and as such is considered to be only loosely related to ranking metrics. In this work, however, we show that the above statement is not entirely accurate. In fact, we establish an analytical connection between softmax cross entropy and two popular ranking metrics in a learning-to-rank setup with binary relevance labels. In particular, we show that ListNet's loss bounds Mean Reciprocal Rank as well as Normalized Discounted Cumulative Gain. Our analysis sheds light on the behavior of that loss function and explains its superior performance on binary labeled data over data with graded relevance. View details
    Preview abstract Spell correction is a must-have feature for any modern search engine in applications such as web or e-commerce search. Typical spell correction solutions used in production systems consist of large indexed lookup tables based on a global model trained across many users over a large scale web corpus or a query log. For search over personal corpora, such as email, this global solution is not sufficient, as it ignores the user's personal lexicon. Without personalization, global spelling fails to correct tail queries drawn from a user's own, often idiosyncratic, lexicon. Personalization using existing algorithms is difficult due to resource constraints and unavailability of sufficient data to build per-user models. In this work, we propose a simple and effective personalized spell correction solution that augments existing global solutions for search over private corpora. Our event driven spell correction candidate generation method is specifically designed with personalization as the key construct. Our novel spell correction and query completion algorithms do not require complex model training and is highly efficient. The proposed solution has shown over 30% click-through rate gain on affected queries when evaluated against a range of strong commercial personal search baselines - Google's Gmail, Drive, and Calendar search production systems. View details
    Multi-view Embedding-based Synonyms for Personal Search
    Hongbo Deng
    Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR '19) (2019), pp. 575-584
    Preview abstract Synonym expansion is a technique that adds related words to search queries, which may lead to more relevant documents being retrieved, thus improving recall. There is extensive prior work on synonym expansion for web search, however very few studies have tackled its application for email search. Synonym expansion for private corpora like emails poses several unique research challenges. First, the emails are not shared across users, which precludes us from directly employing query-document bipartite graphs, which are standard in web search synonym expansion. Second, user search queries are of personal nature, and may not be generalizable across users. Third, the size of the underlying corpora from which the synonyms may be mined is relatively small (i.e., user's private email inbox) compared to the size of the web corpus. Therefore, in this paper, we propose a solution tailored to the challenges of synonym expansion for email search. We formulate it as a multi-view learning problem, and propose a novel embedding-based model that joins information from multiple sources to obtain the optimal synonym candidates. To demonstrate the effectiveness of the proposed technique, we evaluate our model using both explicit human ratings as well as a live experiment using the Gmail Search service, one of the world's largest email search engines. View details
    Preview abstract Semantic text matching is one of the most important research problems in many domains, including, but not limited to, information retrieval, question answering, and recommendation. Among the different types of semantic text matching, long-document-to-long-document text matching has many applications, but has rarely been studied. Most existing approaches for semantic text matching have limited success in this setting, due to their inability to capture and distill the main ideas and topics from long-form text. In this paper, we propose a novel Siamese multi-depth attention-based hierarchical recurrent neural network (SMASH RNN) that learns the long-form semantics, and enables long-form document based semantic text matching. In addition to word information, SMASH RNN is using the document structure to improve the representation of long-form documents. Specifically, SMASH RNN synthesizes information from different document structure levels, including paragraphs, sentences, and words. An attention-based hierarchical RNN derives a representation for each document structure level. Then, the representations learned from the different levels are aggregated to learn a more comprehensive semantic representation of the entire document. For semantic text matching, a Siamese structure couples the representations of a pair of documents, and infers a probabilistic score as their similarity. We conduct an extensive empirical evaluation of SMASH RNN with three practical applications, including email attachment suggestion, related article recommendation, and citation recommendation. Experimental results on public data sets demonstrate that SMASH RNN significantly outperforms competitive baseline methods across various classification and ranking scenarios in the context of semantic matching of long-form documents. View details
    Learning Groupwise Scoring Functions Using Deep Neural Networks
    Qingyao Ai
    Proceedings of the First International Workshop On Deep Matching In Practical Applications (2019)
    Preview abstract While in a classification or a regression setting a label or a value is assigned to each individual document, in a ranking setting we determine the relevance ordering of the entire input document list. This difference leads to the notion of relative relevance between documents in ranking. The majority of the existing learning-to-rank algorithms model such relativity at the loss level using pairwise or listwise loss functions. However, they are restricted to pointwise scoring functions, i.e., the relevance score of a document is computed based on the document itself, regardless of the other documents in the list. In this paper, we overcome this limitation by proposing generalized groupwise scoring functions (GSFs), in which the relevance score of a document is determined jointly by groups of documents in the list. We learn GSFs with a deep neural network architecture, and demonstrate that several representative learning-to-rank algorithms can be modeled as special cases in our framework. We conduct evaluation using the public MSLR-WEB30K dataset, and our experiments show that GSFs lead to significant performance improvements both in a standalone deep learning architecture, or when combined with a state-of-the-art tree-based learning-to-rank algorithm. View details
    Position Bias Estimation for Unbiased Learning to Rank in Personal Search
    Proceedings of the 11th ACM International Conference on Web Search and Data Mining (WSDM), ACM (2018), pp. 610-618
    Preview abstract A well-known challenge in learning from click data is its inherent bias and most notably position bias. Traditional click models aim to extract the (query, document) relevance and the estimated bias is usually discarded after relevance is extracted. In contrast, the most recent work on unbiased learning-to-rank can effectively leverage the bias and thus focuses on estimating bias rather than relevance. Existing approaches use search result randomization over a small percentage of production traffic to estimate the position bias. This is not desired because result randomization can negatively impact users' search experience. In this paper, we compare different schemes for result randomization (i.e., RandTopN and RandPair) and show their negative effect in personal search. Then we study how to infer such bias from regular click data without relying on randomization. We propose a regression-based Expectation-Maximization (EM) algorithm that is based on a position bias click model and that can handle highly sparse clicks in personal search. We evaluate our EM algorithm and the extracted bias in the learning-to-rank setting. Our results show that it is promising to extract position bias from regular clicks without result randomization. The extracted bias can improve the learning-to-rank algorithms significantly. In addition, we compare the pointwise and pairwise learning-to-rank models. Our results show that pairwise models are more effective in leveraging the estimated bias. View details
    The LambdaLoss Framework for Ranking Metric Optimization
    Proceedings of The 27th ACM International Conference on Information and Knowledge Management (CIKM '18), ACM (2018), pp. 1313-1322
    Preview abstract How to optimize ranking metrics such as Normalized Discounted Cumulative Gain (NDCG) is an important but challenging problem, because ranking metrics are either flat or discontinuous everywhere, which makes them hard to be optimized directly. Among existing approaches, LambdaRank is a novel algorithm that incorporates ranking metrics into its learning procedure. Though empirically effective, it still lacks theoretical justification. For example, the underlying loss that LambdaRank optimizes for remains unknown until now. Due to this, there is no principled way to advance the LambdaRank algorithm further. In this paper, we present LambdaLoss, a probabilistic framework for ranking metric optimization. We show that LambdaRank is a special configuration with a well-defined loss in the LambdaLoss framework, and thus provide theoretical justification for it. More importantly, the LambdaLoss framework allows us to define metric-driven loss functions that have clear connection to different ranking metrics. We show a few cases in this paper and evaluate them on three publicly available data sets. Experimental results show that our metric-driven loss functions can significantly improve the state-of-the-art learning-to-rank algorithms. View details
    Learning with Sparse and Biased Feedback for Personal Search
    Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI) (2018), pp. 5219-5223
    Preview abstract Personal search, including email, on-device, and personal media search, has recently attracted a considerable attention from the information retrieval community. In this paper, we provide an overview of challenges and opportunities of learning with implicit user feedback (e.g., click data) in personal search. Implicit user feedback provides a convenient source of supervision for ranking models in personal search. This feedback, however, has two major drawbacks: it is highly sparse and biased due to the personal nature of queries and documents. We demonstrate how these drawbacks can be overcome, and empirically demonstrate the benefits of learning with implicit feedback in the context of a large-scale email search engine. View details
    Preview abstract TensorFlow Ranking is the first open source library for solving large-scale ranking problems in a deep learning framework. It is highly configurable and provides easy-to-use APIs to support different scoring mechanisms, loss functions and evaluation metrics in the learning-to-rank setting. Our library is developed on top of TensorFlow and can thus fully leverage the advantages of this platform. For example, it is highly scalable, both in training and in inference, and can be used to learn ranking models over massive amounts of user activity data. We empirically demonstrate the effectiveness of our library in learning ranking functions for large-scale search and recommendation applications in Gmail and Google Drive. View details
    Preview abstract Ranking functions are used to return ranked lists of items for users to interact. How to evaluate ranking functions using historical user interaction logs, as known as off-policy evaluation, is an important but challenging problem. The commonly used Inverse Propensity Scores (IPS) approaches works better for the single item case, but suffer from extremely low data efficiency for the ranked list case. In this paper, we study how to improve the data efficiency of IPS approaches in the offline comparison setting. We propose two approaches Trunc-match and Rand-interleaving for offline comparison using uniformly randomized data. We show that these methods can improve the data efficiency and also the comparison sensitivity based on one of the largest email search engines. View details
    Semantic Location in Email Query Suggestion
    John Foley
    Proceedings of the 41st International ACM SIGIR Conference on Research & Development in Information Retrieval (2018), pp. 977-980
    Preview abstract Mobile devices are pervasive, which means that users have access to web content and their personal documents at all locations, not just their home or office. Existing work has studied how locations can influence information needs, focusing on web queries. We explore whether or not location information can be helpful to users who are searching their own personal documents. We wish to study whether a users’ location can predict their queries over their own personal data, so we focus on the task of query suggestion. While we find that using location directly can be helpful, it does not generalize well to novel locations. To improve this situation, we explore using semantic location: that is, rather than memorizing location-query associations, we generalize our location information to names of the closest point of interest. By using short, semantic descriptions of locations, we find that we can more robustly improve query completion and observe that users are already using locations to extend their own queries in this domain. We present a simple but effective model that can use location to predict queries for a user even before they type anything into a search box, and which learns effectively even when not all queries have location information. View details
    Multi-Task Learning for Personal Search Ranking with Query Clustering
    Jiaming Shen
    Proceedings of ACM Conference on Information and Knowledge Management (CIKM) (2018)
    Preview abstract User needs vary significantly across different tasks, and therefore their queries will also vary significantly in their expressiveness and semantics. Many studies have been proposed to model such query diversity by obtaining query types and building query-dependent ranking models. To obtain query types, these studies typically require either a labeled query dataset or clicks from multiple users aggregated over the same document. These techniques, however, are not applicable when manual query labeling is not viable, and aggregated clicks are unavailable due to the private nature of the document collection, e.g., in personal search scenarios. Therefore, in this paper, we study the problem of how to obtain query type in an unsupervised fashion and how to leverage this information using query-dependent ranking models in personal search. We first develop a hierarchical clustering algorithm based on truncated SVD and varimax rotation to obtain coarse-to-fine query types. Then, we propose three query-dependent ranking models, including two neural models that leverage query type information as additional features, and one novel multi-task neural model that is trained to simultaneously rank documents and predict query types. We evaluate our ranking models using the click data collected from one of the world’s largest personal search engines. The experiments demonstrate that the proposed multi-task model can significantly outperform the baseline neural models, which either do not incorporate query type information or just simply feed query type as an additional feature. To the best of our knowledge, this is the first successful application of query-dependent multi-task learning in personal search ranking. View details
    Preview abstract Modern search engines leverage a variety of sources, beyond the traditional query-document content similarity, to improve their ranking performance. Among them, query context has attracted attention in prior work. Previously, query context was mainly modeled by user search history, either long-term or short-term, to help the ranking of future queries. In this paper, we focus on situational context, i.e., the contextual features of the current search request that are independent from both query content and user history. As an example, situational context can depend on search request time and location. We propose two context-aware ranking models based on neural networks. The first model learns a low-dimensional deep representation from the combination of contextual features. The second model extends the first model by leveraging binarized contextual features in addition to the high-level abstractions learned from a deep network. The existing context-aware ranking models are mainly based on search history, especially click data that can be gathered from the search engine logs. Although context-aware models have been widely explored in web search, their influence on search scenarios where click data is highly sparse is relatively unstudied. The focus of this paper, personal search (e.g., email search or on-device search) is one of such scenarios. We evaluate our models using the click data collected from one of the world's largest personal search engines. The experiments demonstrate that the proposed models significantly outperform the baselines which do not take context into account. These results indicate the importance of situational context for personal search, and open up a venue for further exploration of situational context in other search scenarios. View details
    Related Event Discovery
    Sujith Ravi
    Vijay Garg
    Proceedings of WSDM (2017)
    Preview abstract In this paper, we consider the problem of discovering local events on the web, where events are entities extracted from pages with Schema.org annotations. Examples of such local events include small venue concerts, farmers markets, sports activities, etc. Given an event entity, we propose a graph-based framework for retrieving a ranked list of related events that a user is likely to be interested in or to attend. We demonstrate that this framework can be easily extended to the keyword search scenario as well, where the user issues a query to a search engine, hoping to find relevant events to attend. Due to the difficulty of obtaining ground-truth labels for event entities, which are temporal and are constrained by location, our general retrieval framework is unsupervised, and its graph-based formulation addresses (a) the challenge of feature sparseness and noisiness, and (b) semantic mismatch problem in a self-contained and principled manner. To validate our methods, we collect human annotations and conduct a comprehensive empirical study, analyzing the performance of our methods with regard to relevance, recall, and diversity. This study shows that our graph-based framework is significantly better than any individual feature for both entity and keyword search scenarios, and can be further improved with minimal supervision. Finally, we demonstrate that our framework can be useful in understanding the temporal and the localized nature of the events on the web. View details
    Learning from User Interactions in Personal Search via Attribute Parameterization
    Proceedings of the 10th ACM International Conference on Web Search and Data Mining (WSDM), ACM (2017), pp. 791-800
    Preview abstract User interaction data (e.g., click data) has proven to be a powerful signal for learning-to-rank models in web search. However, such models require observing multiple interactions across many users for the same query-document pair to achieve statistically meaningful gains. Therefore, utilizing user interaction data for improving search over personal, rather than public, content is a challenging problem. First, the documents (e.g., emails or private files) are not shared across users. Second, user search queries are of personal nature (e.g., [alice's address]) and may not generalize well across users. In this paper, we propose a solution to these challenges, by projecting user queries and documents into a multi-dimensional space of fine-grained and semantically coherent attributes. We then introduce a novel parameterization technique to overcome sparsity in the multi-dimensional attribute space. Attribute parameterization enables effective usage of cross-user interactions for improving personal search quality -- which is a first such published result, to the best of our knowledge. Experiments with a dataset derived from interactions of users of one of the worlds' largest personal search engines demonstrate the effectiveness of the proposed attribute parameterization technique. View details
    Learning to Rank with Selection Bias in Personal Search
    Proc. of the 39th International ACM SIGIR Conference on Research and Development in Information Retrieval, ACM (2016), pp. 115-124
    Preview abstract Click-through data has proven to be a critical resource for improving search ranking quality. Though a large amount of click data can be easily collected by search engines, various biases make it difficult to fully leverage this type of data. In the past, many click models have been proposed and successfully used to estimate the relevance for individual query-document pairs in the context of web search. These click models typically require a large quantity of clicks for each individual pair and this makes them difficult to apply in systems where click data is highly sparse due to personalized corpora and information needs, e.g., personal search. In this paper, we study the problem of how to leverage sparse click data in personal search and introduce a novel selection bias problem and address it in the learning-to-rank framework. This paper proposes a few bias estimation methods, including a novel query-dependent one that captures queries with similar results and can successfully deal with sparse data. We empirically demonstrate that learning-to-rank that accounts for query-dependent selection bias yields significant improvements in search effectiveness through online experiments with one of the world's largest personal search engines. View details
    Semantic Video Trailers
    Harrie Oosterhuis
    Sujith Ravi
    ICML 2016 Workshop on Multi-View Representation Learning
    Preview abstract Query-based video summarization is the task of creating a brief visual trailer, which captures the parts of the video (or a collection of videos) that are most relevant to the user-issued query. In this paper, we propose an unsupervised label propagation approach for this task. Our approach effectively captures the multimodal semantics of queries and videos using state-of-the-art deep neural networks and creates a summary that is both semantically coherent and visually attractive. We describe the theoretical framework of our graph-based approach and empirically evaluate its effectiveness in creating relevant and attractive trailers. Finally, we showcase example video trailers generated by our system. View details
    Hierarchical Label Propagation and Discovery for Machine Generated Email
    Lluis Garcia-Pueyo
    Vanja Josifovski
    Ivo Krka
    Amitabh Saikia
    Sujith Ravi
    Proceedings of the International Conference on Web Search and Data Mining (WSDM), ACM (2016), pp. 317-326
    Preview abstract Machine-generated documents such as email or dynamic web pages are single instantiations of a pre-defined structural template. As such, they can be viewed as a hierarchy of template and document specific content. This hierarchical template representation has several important advantages for document clustering and classification. First, templates capture common topics among the documents, while filtering out the potentially noisy variabilities such as personal information. Second, template representations scale far better than document representations since a single template captures numerous documents. Finally, since templates group together structurally similar documents, they can propagate properties between all the documents that match the template. In this paper, we use these advantages for document classification by formulating an efficient and effective hierarchical label propagation and discovery algorithm. The labels are propagated first over a template graph (constructed based on either term-based or topic-based similarities), and then to the matching documents. We evaluate the performance of the proposed algorithm using a large donated email corpus and show that the resulting template graph is significantly more compact than the corresponding document graph and the hierarchical label propagation is both efficient and effective in increasing the coverage of the baseline document classification algorithm. We demonstrate that the template label propagation achieves more than 91% precision and 93% recall, while increasing the label coverage by more than 11%. View details
    Preview abstract The goal of this work is extraction and retrieval of local events from web pages. Examples of local events include small venue concerts, theater performances, garage sales, movie screenings, etc. We collect these events in the form of retrievable calendar entries that include structured information about event name, date, time and location. Between existing information extraction techniques and the availability of information on social media and semantic web technologies, there are numerous ways to collect commercial, high-profile events. However, most extraction techniques require domain-level supervision, which is not attainable at web scale. Similarly, while the adoption of the semantic web has grown, there will always be organizations without the resources or the expertise to add machine-readable annotations to their pages. Therefore, our approach bootstraps these explicit annotations to massively scale up local event extraction. We propose a novel event extraction model that uses distant supervision to assign scores to individual event fields (event name, date, time and location) and a structural algorithm to optimally group these fields into event records. Our model integrates information from both the entire source document and its relevant sub-regions, and is highly scalable. We evaluate our extraction model on all 700 million documents in a large publicly available web corpus, ClueWeb12. Using the 217,000 unique explicitly annotated events as distant supervision, we are able to double recall with 85% precision and quadruple it with 65% precision, with no additional human supervision. We also show that our model can be bootstrapped for a fully supervised approach, which can further improve the precision by 30%. In addition, we evaluate the geographic coverage of the extracted events. We find that there is a significant increase in the geo-diversity of extracted events compared to existing explicit annotations, while maintaining high precision levels View details
    Up Next: Retrieval Methods for Large Scale Related Video Suggestion
    Lluis Garcia Pueyo
    Vanja Josifovski
    Dima Lepikhin
    Proceedings of KDD 2014, New York, NY, USA, pp. 1769-1778
    Preview abstract The explosive growth in sharing and consumption of the video content on the web creates a unique opportunity for scientific advances in video retrieval, recommendation and discovery. In this paper, we focus on the task of video suggestion, commonly found in many online applications. The current state-of-the-art video suggestion techniques are based on the collaborative filtering analysis, and suggest videos that are likely to be co-viewed with the watched video. In this paper, we propose augmenting the collaborative filtering analysis with the topical representation of the video content to suggest related videos. We propose two novel methods for topical video representation. The fi rst method uses information retrieval heuristics such as tf-idf, while the second method learns the optimal topical representations based on the implicit user feedback available in the online scenario. We conduct a large scale live experiment on YouTube traffic, and demonstrate that augmenting collaborative filtering with topical representations significantly improves the quality of the related video suggestions in a live setting, especially for categories with fresh and topically-rich video content such as news videos. In addition, we show that employing user feedback for learning the optimal topical video representations can increase the user engagement by more than 80% over the standard information retrieval representation, when compared to the collaborative filtering baseline. View details
    Effective query formulation with multiple information sources
    W. Bruce Croft
    WSDM (2012), pp. 443-452
    Parameterized concept weighting in verbose queries
    W. Bruce Croft
    SIGIR (2011), pp. 605-614
    The anatomy of an ad: structured indexing and retrieval for sponsored search
    Evgeniy Gabrilovich
    Vanja Josifovski
    WWW (2010), pp. 101-110
    Learning concept importance using a weighted dependence model
    W. Bruce Croft
    WSDM (2010), pp. 31-40