Jump to Content
Chris Welty

Chris Welty

Dr. Chris Welty is a Sr. Research Scientist at Google in New York. His main area of interest is the interaction between structured knowledge (e.g. knowledge graphs such as freebase), unstructured knowledge (e.g. natural language text), and human knowledge (e.g. crowdsourcing). His latest work focuses on understanding the continuous nature of truth in the presence of a diversity of perspectives, and he has been working with the google maps team to better understand user contributions that often disagree. He is most active in the Crowdsourcing and Human Computation community, as well as The Web Conf, AKBC, Information and Knowledge Management, and AAAI.

His first project at Google was launched as Explore in Google Docs, and then on improving the quality and expanding the coverage of price level labels on maps using user signals. Before Google, Dr. Welty was a member of the technical leadership team for IBM's Watson - the question answering computer that defeated the all-time best Jeopardy! champions in a widely televised contest. He appeared on the broadcast, discussing the technology behind Watson, as well as many articles in the popular and scientific press. His proudest moment was being interviewed for StarTrek.com about the project. He is a recipient of the AAAI Feigenbaum Prize for his work.

Welty has played a seminal role in the development of the Semantic Web and Ontologies, and co-developed OntoClean, the first formal methodology for evaluating ontologies. He is on the editorial board of AI Magazine, the Journal of Applied Ontology, the Journal of Web Semantics, and the Semantic Web Journal. He is currently an editor for the AI Magazine column, "AI Bookies" to foster science bets on the progress of AI. He published many papers before those shown below, see his Google Scholar entry.

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract We tackle the problem of providing accurate, rigorous p-values for comparisons between the results of two evaluated systems whose evaluations are based on a crowdsourced “gold” reference standard. While this problem has been studied before, we argue that the null hypotheses used in previous work have been based on a common fallacy of equality of probabilities, as opposed to the standard null hypothesis that two sets are drawn from the same distribution. We propose using the standard null hypothesis, that two systems’ responses are drawn from the same distribution, and introduce a simulation-based framework for determining the true p-value for this null hypothesis. We explore how to estimate the true p-value from a single test set under different metrics, tests, and sampling methods, and call particular attention to the role of response variance, which exists in crowdsourced annotations as a product of genuine disagreement, and in system predictions as a product of stochastic training regimes, or in generative models as an expected property of the outputs. We find that response variance is a powerful tool for estimating p-values, and present results for the metrics, tests, and sampling methods that make the best p-value estimates in a simple machine learning model comparison View details
    Preview abstract Search engines including Google are beginning to support local-dining queries such as ``At which nearby restaurants can I order the Indonesian salad \textit{gado-gado}?''. Given the low coverage of online menus worldwide, and only 30\% even having a website, this remains a challenge. Here we leverage the power of the crowd: online users who are willing to answer questions about dish availability at restaurants visited. While motivated users are happy to contribute knowledge for free, they are much less likely to respond to ``silly'' or embarrassing questions (e.g., ``Does \textit{Pizza Hut} serve pizza?'' or ``Does \textit{Mike's Vegan Restaurant} serve hamburgers?'') In this paper, we study the problem of \textit{Vexation-Aware Active Learning}, where judiciously selected questions are targeted towards improving restaurant-dish model prediction, subject to a limit on the percentage of ``unsure'' answers or ``dismissals'' (e.g., swiping the app closed) used to measure vexation. We formalize the problem as an integer linear program and solve it efficiently using a distributed solution that scales linearly with the number of candidate questions. Since our algorithm relies on precise estimation of the unsure-dismiss rate (UDR), we give a regression model that provides accurate results compared to baselines including collaborative filtering. Finally, we demonstrate in a live system that our proposed vexation-aware strategy performs competitively against classical (margin-based) active learning approaches while not exceeding UDR bounds. View details
    Annotator Response Distributions as a Sampling Frame
    Christopher Homan
    LREC WOrkshop on Perspectivist NLP (2022)
    Preview abstract Annotator disagreement is often dismissed as noise or the result of poor annotation process quality. Others have argued that it can be meaningful. But lacking a rigorous statistical foundation, the analysis of disagreement patterns can resemble a high-tech form of tea-leaf-reading. We contribute a framework for analyzing the variation of per-item annotator response distributions to data for humans-in-the-loop machine learning. We provide visualizations for, and use the framework to analyze the variance in, a crowdsourced dataset of hard-to-classify examples of the OpenImages archive. View details
    Preview abstract Successful knowledge graphs (KGs) solved the historical knowledge acquisition bottleneck by supplanting an expert focus with a simple, crowd-friendly one: KG nodes represent popular people, places, organizations, etc., and the graph arcs represent common sense relations like affiliations, locations, etc. Techniques for more general, categorical, KG curation do not seem to have made the same transition: the KG research community is still largely focused on logic-based methods that belie the common-sense characteristics of successful KGs. In this paper, we propose a simple yet novel approach to acquiring \emph{class-level attributes} from the crowd that represent broad common sense associations between categories, and can be used with the classic knowledge-base default \& override technique (e.g. \cite{reiter1978}) to address the early \textit{label sparsity problem} faced by machine learning systems for problems that lack data for training. We demonstrate the effectiveness of our acquisition and reasoning approach on a pair of very real industrial-scale problems: how to augment an existing KG of places and offerings (e.g. stores and products, restaurants and dishes) with associations between them indicating the availability of the offerings at those places, which would enable the KG to provide answers to questions like, ``Where can I buy milk nearby?'' This problem has several practical challenges but for this paper we focus mostly on the label sparsity. Less than 30\% of physical places worldwide (i.e. brick \& mortar stores and restaurants) have a website, and less than half of those list their product catalog or menus, leaving a large acquisition gap to be filled by methods other than information extraction (IE). Label sparsity is a general problem, and not specific to these use cases, that prevents modern AI and machine learning techniques from applying to many applications for which labeled data is not readily available. As a result, the study of how to acquire the knowledge and data needed for AI to work is as much a problem today as it was in the 1970s and 80s during the advent of expert systems \cite{mycin1975}. The class-level attributes approach presented here is based on a KG-inspired intuition that a lot of the knowledge people need to understand where to go to buy a product they need, or where to find the dishes they want to eat, is categorical and part of their general common sense: everyone knows grocery stores sell milk and don't sell asphalt, chinese restaurants serve fried rice and not hamburgers, etc. We acquired a mixture of instance- and class- level pairs (e.g. $\langle$\textit{Ajay Mittal Dairy}, milk$\rangle$, $\langle$GroceryStore, milk$\rangle$, resp.) from a novel 3-tier crowdsourcing method, and demonstrate the scalability advantages of the class-level approach. Our results show that crowdsourced class-level knowledge can provide rapid scaling of knowledge acquisition in shopping and dining domains. The acquired common sense knowledge also has long-term value in the KG. The approach was a critical part of enabling a worldwide \textit{local search} capability on Google Maps, with which users can find products and dishes that are available in most places on earth. View details
    AI Bookie: Betting on Bets
    Kurt Bollacker
    Praveen Kumar Paritosh
    AI Magazine, vol. 42(3), Fall 2021 (2021)
    Preview abstract The AI bookies have spent a lot of time and energy collecting bets from AI researchers, and have met with universal approval of the idea of scientific betting, and nearly universal silence in the acquisition of bets. We have collected a few in this column over the past two years, in the first column we published the “will voice interfaces become the standard” bet, as well as a set of 10 predictions from Eric Horvitz that we proposed as bets awaiting challengers. No challengers have emerged. In this article we review the methods we've used to collect bets and conclude that people need ideas for bets to make. We propose five new bets and solicit participants in them. View details
    Empirical methodology for crowdsourcing ground truth
    Anca Dumitrache
    Benjamin Timmermans
    Oana Inel
    Semantic Web Journal, vol. 12:3; 2021 (2021)
    Preview abstract The process of gathering ground truth data through human annotation is a major bottleneck in the use of information extraction methods for populating the Semantic Web. Crowdsourcing-based approaches are gaining popularity in the attempt to solve the issues related to volume of data and lack of annotators. Typically these practices use inter-annotator agreement as a measure of quality. However, in many domains, such as event detection, there is ambiguity in the data, as well as a multitude of perspectives of the information examples. We present an empirically derived methodology for efficiently gathering of ground truth data in a diverse set of use cases covering a variety of domains and annotation tasks. Central to our approach is the use of CrowdTruth metrics that capture inter-annotator disagreement. We show that measuring disagreement is essential for acquiring a high quality ground truth. We achieve this by comparing the quality of the data aggregated with CrowdTruth metrics with majority vote, over a set of diverse crowdsourcing tasks: Medical Relation Extraction, Twitter Event Identification, News Event Extraction and Sound Interpretation. We also show that an increased number of crowd workers leads to growth and stabilization in the quality of annotations, going against the usual practice of employing a small number of annotators. View details
    Preview abstract Successful knowledge graphs (KGs) solved the historical knowledge acquisition bottleneck by supplanting an expert focus with a simple, crowd-friendly one: KG nodes represent popular people, places, organizations, etc., and the graph arcs represent common sense relations like affiliations, locations, etc. Techniques for more general, categorical, KG curation do not seem to have made the same transition: the KG research community is still largely focused on methods that belie the common-sense characteristics of successful KGs. In this paper, we propose a simple approach to acquiring and reasoning with class-level attributes from the crowd that represent broad common sense associations between categories. We pick a very real industrial-scale data set and problem: how to augment an existing knowledge graph of places and products with associations between them indicating the availability of the products at those places, which would enable a KG to provide answers to questions like, ``Where can I buy milk nearby?'' This problem has several practical challenges, not least of which is that only 30\% of physical stores (i.e. brick \& mortar stores) have a website, and fewer list their product inventory, leaving a large acquisition gap to be filled by methods other than information extraction (IE). Based on a KG-inspired intuition that a lot of the class-level pairs are part of people's general common sense, e.g. everyone knows grocery stores sell milk and don't sell asphalt, we acquired a mixture of instance- and class- level pairs (e.g. {Ajay Mittal Dairy, milk}, {GroceryStore, milk}, resp.) from a novel 3-tier crowdsourcing method, and demonstrate the scalability advantages of the class-level approach. Our results show that crowdsourced class-level knowledge can provide rapid scaling of knowledge acquisition in this and similar domains, as well as long-term value in the KG. View details
    Embedding Semantic Taxonomies
    Alyssa Whitlock Lees
    Jacek Korycki
    Sara Mc Carthy
    CoLing 2020
    Preview abstract A common step in developing an understanding of a vertical domain, e.g. shopping, dining, movies, medicine, etc., is curating a taxonomy of categories specific to the domain. These human created artifacts have been the subject of research in embeddings that attempt to encode aspects of the partial ordering property of taxonomies. We compare Box Embeddings, a natural containment representation of category taxonomies, to partial-order embeddings and a baseline Bayes Net, in the context of representing the Medical Subject Headings (MeSH) taxonomy given a set of 300K PubMed articles with subject labels from MeSH. We deeply explore the experimental properties of training box embeddings, including preparation of the training data, sampling ratios and class balance, initialization strategies, and propose a fix to the original box objective. We then present first results in using these techniques for representing a bipartite learning problem (i.e. collaborative filtering) in the presence of taxonomic relations within each partition, inferring disease (anatomical) locations from their use as subject labels in journal articles. Our box model substantially outperforms all baselines for taxonomic reconstruction and bipartite relationship experiments. This performance improvement is observed both in overall accuracy and the weighted spread by true taxonomic depth. View details
    Preview abstract We present a resource for the task of FrameNetsemantic frame disambiguation over 5,000word-sentence pairs from the Wikipedia cor-pus.The annotations were collected usinga novel crowdsourcing approach with mul-tiple workers per sentence to captureinter-annotator disagreement.In contrast to thetypical approach of attributing the best singleframe to each word, we provide a list of frameswith disagreement-based scores that expressthe confidence with which each frame appliesto the word. This is based on the idea thatinter-annotator disagreement is at least partlycaused by ambiguity that is inherent to the textand frames. We have found many exampleswhere the semantics of individual frames over-lap sufficiently to make them acceptable alter-natives for interpreting a sentence. We haveargued that ignoring this ambiguity creates anoverly arbitrary target for training and eval-uating natural language processing systems -if humans cannot agree, why would we ex-pect the correct answer from a machine to beany different? To process this data we alsoutilized an expanded lemma-set provided bythe Framester system, which merges FN withWordNet to enhance coverage. Our datasetincludes annotations of 1,000 sentence-wordpairs whose lemmas are not part of FN. Finallywe present metrics for evaluating frame disam-biguation systems that account for ambiguity. View details
    Pareto-Efficient Fairness for Skewed Subgroup Data
    Alyssa Whitlock Lees
    Ananth Balashankar
    Lakshminarayanan Subramanian
    AISG, AISG @ICML (2019)
    Preview abstract As awareness of the potential for learned models to amplify existing societal biases increases, the field of ML fairness has developed mitigation techniques. A prevalent method applies constraints, including equality of performance, with respect to subgroups defined over the intersection of sensitive attributes such as race and gender. Enforcing such constraints when the subgroup populations are considerably skewed with respect to a target can lead to unintentional degradation in performance, without benefiting any individual subgroup, counter to the United Nations Sustainable Development goals of reducing inequalities and promoting growth. In order to avoid such performance degradation while ensuring equitable treatment to all groups, we propose Pareto-Efficient Fairness (PEF), which identifies the operating point on the Pareto curve of subgroup performances closest to the fairness hyperplane. Specifically, PEF finds a Pareto Optimal point which maximizes multiple subgroup accuracy measures. The algorithm *scalarizes* using the adaptive weighted metric norm by iteratively searching the Pareto region of all models enforcing the fairness constraint. PEF is backed by strong theoretical results on discoverability and provides domain practitioners finer control in navigating both convex and non-convex accuracyfairness trade-offs. Empirically, we show that PEF increases performance of all subgroups in skewed synthetic data and UCI datasets. View details
    A Metrological Framework for Evaluating Crowd-powered Instruments
    Praveen Kumar Paritosh
    HCOMP-2019: AAAI Conference on Human Computation
    Preview abstract In this paper we present the first steps towards hardening the science of measuring AI systems, by adopting metrology, the science of measurement and its application, and applying it to human (crowd) powered evaluations. We begin with the intuitive observation that evaluating the performance of an AI system is a form of measurement. In all other science and engineering disciplines, the devices used to measure are called instruments, and all measurements are recorded with respect to the characteristics of the instruments used. One does not report mass, speed, or length, for example, of a studied object without disclosing the precision (measurement variance) and resolution (smallest detectable change) of the instrument used. It is extremely common in the AI literature to compare the performance of two systems by using a crowd-sourced dataset as an instrument, but failing to report if the performance difference lies within the capability of that instrument to measure. To further the discussion, we focus on a single crowd-sourced dataset, the so-called WS-353, a venerable and often used gold standard for word similarity, and propose a set of metrological characteristics for it {\it as an instrument}. We then analyze several previously published experiments that use the WS-353 instrument, and show that, in the light of these proposed characteristics, the differences in performance of these systems cannot be measured with this instrument. View details
    Discovering User Bias in Ordinal Voting Systems
    Alyssa Whitlock Lees
    SAD-2019: Workshop on Subjectivity, Ambiguity and Disagreement
    Preview abstract Crowdsourcing systems increasingly rely on users to provide more subjective ground truth for intelligent systems - e.g. ratings, aspect of quality and perspectives on how expensive or lively a place feels, etc. We focus on the ubiquitous implementation of online user ordinal voting (e.g 1-5, 1 star-4 stars) on some aspect of an entity, to extract a relative truth, measured by a selected metric such as vote plurality or mean. We argue that this methodology can aggregate results that yield little information to the end user. In particular, ordinal user rankings often converge to a indistinguishable rating. This is demonstrated by the trend in certain cities for the majority of restaurants to all have a 4 star rating. Similarly, the rating of an establishment can be significantly affected by a few users. User bias in voting is not spam, but rather a preference that can be harnessed to provide more information to users. We explore notions of both global skew and user bias. Leveraging these bias and preference concepts, the paper suggests explicit models for better personalization and more informative ratings. View details
    What is Fair? Exploring Pareto-Efficiency for Fairness Constraint Classifiers
    Alyssa Whitlock Lees
    Ananth Balashankar
    Lakshminarayanan Subramanian
    arxiv (2019), pp. 10
    Preview abstract The potential for learned models to amplify existing societal biases has been broadly recognized. Fairness-aware classifier constraints, which apply equality metrics of performance across subgroups defined on sensitive attributes such as race and gender, seek to rectify inequity but can yield non-uniform degradation in performance for skewed datasets. In certain domains, imbalanced degradation of performance can yield another form of unintentional bias. In the spirit of constructing fairness-aware algorithms as societal imperative, we explore an alternative: Pareto-Efficient Fairness (PEF). PEF identifies the operating point on the Pareto curve of subgroup performances closest to the fairness hyperplane, maximizing multiple subgroup accuracies. Empirically we demonstrate that PEF increases performance of all subgroups in several UCI datasets. View details
    Taxonomy Embeddings on PubMed Article Subject Headings
    Alyssa Whitlock Lees
    Jacek Korycki
    Taxonomy Embeddings on PubMed Article Subject Headings, CEUR Workshop Proceedings, http://semantics-powered.org/sepda2019.html#scope (2019) (to appear)
    Preview abstract Machine learning approaches for hierarchical partial-orders, such as taxonomies, are of increasing interest in the research community, though practical applications have not yet emgerged. The basic intuition of hierarchical embeddings is that some signal from taxonomic knowledge can be harnessed in broader machine learning problems; when we learn similarity of words using word embeddings, the similarity of *lion* and *tiger* are indistinguishable from the similarity of *lion* and *animal*. The ability to tease apart these two kinds of similarities in a machine learning setting yields improvements in quality as well as enabling the exploitation of the numerous human-curated taxonomies available across domains, while at the same time improving upon known taxonomic organization problems, such as partial or conditional membership. We explore some of the practical problems in learning taxonomies using bayesian networks, partial order embeddings, and box lattice embeddings, where box containment represents category containment. Using open data from pubmed articles with human assigned MeSH labels, we investigate the impact of taxonomic information, negative sampling, instance sampling, and objective functions to improve performance on the taxonomy learning problem. We discovered a particular problem for learning box embeddings for taxonomies we called the box crossing problem, and developed strategies to overcome it. Finally we make some initial contributions to using taxonomy embeddings to improve another learning problem: inferring disease (anatomical) locations from their use as subject labels in journal articles. In most experiments, after our improvements to box models, the box models outperformed the simpler Bayes Net approach as well as Order Embeddings. View details
    Capturing Ambiguity in Crowdsourcing Frame Disambiguation
    Anca Dumitrache
    Lora Aroyo
    HCOMP-2018: AAAI Human Computation Conference
    Preview abstract FrameNet is a computational linguistics resource composed of semantic frames, high-level concepts that represent the meanings of words. In this paper, we present an approach to gather frame disambiguation annotations in sentences using a crowdsourcing approach with multiple workers per sentence to capture inter-annotator \emph{disagreement}. We perform an experiment over a set of 433 sentences annotated with frames from the FrameNet corpus, and show that the aggregated crowd annotations achieve an F1 score greater than 0.67 compared to expert linguists. We highlight cases where the crowd annotation was correct even though the expert is in disagreement, arguing for the need to have multiple annotators per sentence. Most importantly, we examine cases in which crowd workers could not agree, and demonstrate that these cases exhibit ambiguity, either in the sentence, frame, or the task itself, and argue that collapsing such cases to a single, discrete truth value (i.e. correct or incorrect) is inappropriate, creating arbitrary targets for machine learning. View details
    Preview abstract Distant supervision is a popular method for performing relation extraction from text that is known to produce noisy labels. Most progress in relation extraction and classification has been made with crowdsourced corrections to distant-supervised labels, and there is evidence that indicates still more would be better. In this paper, we explore the problem of propagating human annotation signals gathered for open-domain relation classification through the CrowdTruth crowdsourcing method, that captures ambiguity in annotations by measuring inter-annotator disagreement. Our approach propagates annotations to sentences that are similar in a low dimensional embedding space, expanding the number of labels by two orders of magnitude. Our experiments show significant improvement in a sentence-level multi-class relation classifier. View details
    Crowdsourcing Ground Truth for Medical Relation Extraction
    Anca Dumitrache
    Lora Aroyo
    ACM Transactions on Interactive Intelligent Systems, vol. 8:1 (2018)
    Preview abstract Cognitive computing systems require human labeled data for evaluation, and often for training. The standard practice used in gathering this data minimizes disagreement between annotators, and we have found this results in data that fails to account for the ambiguity inherent in language. We have proposed the CrowdTruth method for collecting ground truth through crowdsourcing, that reconsiders the role of people in machine learning based on the observation that disagreement between annotators provides a useful signal for phenomena such as ambiguity in the text. We report on using this method to build an annotated data set for medical relation extraction for the $cause$ and $treat$ relations, and how this data performed in a supervised training experiment. We demonstrate that by modeling ambiguity, labeled data gathered from crowd workers can (1) reach the level of quality of domain experts for this task while reducing the cost, and (2) provide better training data at scale than distant supervision. We further propose and validate new weighted measures for precision, recall, and F-measure, that account for ambiguity in both human and machine performance on this task. View details
    False Positive and Cross-relation Signals in Distant Supervision Data
    Anca Dumitrache
    Lora Aroyo
    NIPS-2017 Workshop - Automatic Knowledge-Base Construction, http://www.akbc.ws/2017/
    Preview abstract Distant supervision (DS) is a well-established method for relation extraction from text, based on the assumption that when a knowledge-base contains a relation between a term pair, then sentences that contain that pair are likely to express the relation. In this paper, we use the results of a crowdsourcing relation extraction task to identify two problems with DS data quality: the widely varying degree of false positives across different relations, and the observed causal connection between relations that are not considered by the DS method. The crowdsourcing data aggregation is performed using ambiguity-aware CrowdTruth metrics, that are used to capture and interpret inter-annotator disagreement. We also present preliminary results of using the crowd to enhance DS training data for a relation classification model, without requiring the crowd to annotate the entire set. View details
    Crowdsourcing a Gold Standard for Medical Relation Extraction with CrowdTruth
    Anca Dumitrache
    Lora Aroyo
    Proceedings of the 2016 Collective Intelligence Conference
    Preview abstract In this paper, we make the following contributions: (1) a comparison of the quality and efficacy of annotations for medical relation extraction provided by both crowd and medical experts, showing that crowd annotations are equivalent to those of experts, with appropriate processing; (2) an openly available dataset of 900 English sentences for medical relation extraction, centering primarily on the cause relation, that have been processed with disagreement analysis and by experts. View details
    No Results Found