Jump to Content
Shubin Zhao

Shubin Zhao

See my page here

Publications

  • "Corroborate and learn facts from the web", Shubin Zhao, Jonathan Betz, Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, 2007, pp. 995-1003.
    [doi.acm.org] [search]

Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, descending
  • Year
  • Year, descending
    Preview abstract Successful knowledge graphs (KGs) solved the historical knowledge acquisition bottleneck by supplanting an expert focus with a simple, crowd-friendly one: KG nodes represent popular people, places, organizations, etc., and the graph arcs represent common sense relations like affiliations, locations, etc. Techniques for more general, categorical, KG curation do not seem to have made the same transition: the KG research community is still largely focused on logic-based methods that belie the common-sense characteristics of successful KGs. In this paper, we propose a simple yet novel approach to acquiring \emph{class-level attributes} from the crowd that represent broad common sense associations between categories, and can be used with the classic knowledge-base default \& override technique (e.g. \cite{reiter1978}) to address the early \textit{label sparsity problem} faced by machine learning systems for problems that lack data for training. We demonstrate the effectiveness of our acquisition and reasoning approach on a pair of very real industrial-scale problems: how to augment an existing KG of places and offerings (e.g. stores and products, restaurants and dishes) with associations between them indicating the availability of the offerings at those places, which would enable the KG to provide answers to questions like, ``Where can I buy milk nearby?'' This problem has several practical challenges but for this paper we focus mostly on the label sparsity. Less than 30\% of physical places worldwide (i.e. brick \& mortar stores and restaurants) have a website, and less than half of those list their product catalog or menus, leaving a large acquisition gap to be filled by methods other than information extraction (IE). Label sparsity is a general problem, and not specific to these use cases, that prevents modern AI and machine learning techniques from applying to many applications for which labeled data is not readily available. As a result, the study of how to acquire the knowledge and data needed for AI to work is as much a problem today as it was in the 1970s and 80s during the advent of expert systems \cite{mycin1975}. The class-level attributes approach presented here is based on a KG-inspired intuition that a lot of the knowledge people need to understand where to go to buy a product they need, or where to find the dishes they want to eat, is categorical and part of their general common sense: everyone knows grocery stores sell milk and don't sell asphalt, chinese restaurants serve fried rice and not hamburgers, etc. We acquired a mixture of instance- and class- level pairs (e.g. $\langle$\textit{Ajay Mittal Dairy}, milk$\rangle$, $\langle$GroceryStore, milk$\rangle$, resp.) from a novel 3-tier crowdsourcing method, and demonstrate the scalability advantages of the class-level approach. Our results show that crowdsourced class-level knowledge can provide rapid scaling of knowledge acquisition in shopping and dining domains. The acquired common sense knowledge also has long-term value in the KG. The approach was a critical part of enabling a worldwide \textit{local search} capability on Google Maps, with which users can find products and dishes that are available in most places on earth. View details
    Preview abstract Successful knowledge graphs (KGs) solved the historical knowledge acquisition bottleneck by supplanting an expert focus with a simple, crowd-friendly one: KG nodes represent popular people, places, organizations, etc., and the graph arcs represent common sense relations like affiliations, locations, etc. Techniques for more general, categorical, KG curation do not seem to have made the same transition: the KG research community is still largely focused on methods that belie the common-sense characteristics of successful KGs. In this paper, we propose a simple approach to acquiring and reasoning with class-level attributes from the crowd that represent broad common sense associations between categories. We pick a very real industrial-scale data set and problem: how to augment an existing knowledge graph of places and products with associations between them indicating the availability of the products at those places, which would enable a KG to provide answers to questions like, ``Where can I buy milk nearby?'' This problem has several practical challenges, not least of which is that only 30\% of physical stores (i.e. brick \& mortar stores) have a website, and fewer list their product inventory, leaving a large acquisition gap to be filled by methods other than information extraction (IE). Based on a KG-inspired intuition that a lot of the class-level pairs are part of people's general common sense, e.g. everyone knows grocery stores sell milk and don't sell asphalt, we acquired a mixture of instance- and class- level pairs (e.g. {Ajay Mittal Dairy, milk}, {GroceryStore, milk}, resp.) from a novel 3-tier crowdsourcing method, and demonstrate the scalability advantages of the class-level approach. Our results show that crowdsourced class-level knowledge can provide rapid scaling of knowledge acquisition in this and similar domains, as well as long-term value in the KG. View details
    Embedding Semantic Taxonomies
    Alyssa Whitlock Lees
    Jacek Korycki
    Sara Mc Carthy
    CoLing 2020
    Preview abstract A common step in developing an understanding of a vertical domain, e.g. shopping, dining, movies, medicine, etc., is curating a taxonomy of categories specific to the domain. These human created artifacts have been the subject of research in embeddings that attempt to encode aspects of the partial ordering property of taxonomies. We compare Box Embeddings, a natural containment representation of category taxonomies, to partial-order embeddings and a baseline Bayes Net, in the context of representing the Medical Subject Headings (MeSH) taxonomy given a set of 300K PubMed articles with subject labels from MeSH. We deeply explore the experimental properties of training box embeddings, including preparation of the training data, sampling ratios and class balance, initialization strategies, and propose a fix to the original box objective. We then present first results in using these techniques for representing a bipartite learning problem (i.e. collaborative filtering) in the presence of taxonomic relations within each partition, inferring disease (anatomical) locations from their use as subject labels in journal articles. Our box model substantially outperforms all baselines for taxonomic reconstruction and bipartite relationship experiments. This performance improvement is observed both in overall accuracy and the weighted spread by true taxonomic depth. View details
    Taxonomy Embeddings on PubMed Article Subject Headings
    Alyssa Whitlock Lees
    Jacek Korycki
    Taxonomy Embeddings on PubMed Article Subject Headings, CEUR Workshop Proceedings, http://semantics-powered.org/sepda2019.html#scope (2019) (to appear)
    Preview abstract Machine learning approaches for hierarchical partial-orders, such as taxonomies, are of increasing interest in the research community, though practical applications have not yet emgerged. The basic intuition of hierarchical embeddings is that some signal from taxonomic knowledge can be harnessed in broader machine learning problems; when we learn similarity of words using word embeddings, the similarity of *lion* and *tiger* are indistinguishable from the similarity of *lion* and *animal*. The ability to tease apart these two kinds of similarities in a machine learning setting yields improvements in quality as well as enabling the exploitation of the numerous human-curated taxonomies available across domains, while at the same time improving upon known taxonomic organization problems, such as partial or conditional membership. We explore some of the practical problems in learning taxonomies using bayesian networks, partial order embeddings, and box lattice embeddings, where box containment represents category containment. Using open data from pubmed articles with human assigned MeSH labels, we investigate the impact of taxonomic information, negative sampling, instance sampling, and objective functions to improve performance on the taxonomy learning problem. We discovered a particular problem for learning box embeddings for taxonomies we called the box crossing problem, and developed strategies to overcome it. Finally we make some initial contributions to using taxonomy embeddings to improve another learning problem: inferring disease (anatomical) locations from their use as subject labels in journal articles. In most experiments, after our improvements to box models, the box models outperformed the simpler Bayes Net approach as well as Order Embeddings. View details
    Corroborate and learn facts from the web
    Jonathan Betz
    Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, ACM, San Jose (2007), pp. 995-1003
    Preview
    No Results Found