Jump to Content
Jie Yang

Jie Yang

Jie Yang is a senior software engineer at X, The Moonshot Factory (formerly Google X), working on computer vision, machine learning/deep learning, remote sensing and large-scale distributed systems. He was also involved in NLP, data mining, personalization and user modeling projects at Google Research. Prior to joining Google, he worked at Yahoo! Labs, Delft University of Technology, and a couple of startups. He holds 10+ patents in area of data mining, machine learning and remote sensing.
Authored Publications
Google Publications
Other Publications
Sort By
  • Title
  • Title, desc
  • Year
  • Year, desc
    Hierarchical Label Propagation and Discovery for Machine Generated Email
    Lluis Garcia-Pueyo
    Vanja Josifovski
    Ivo Krka
    Amitabh Saikia
    Sujith Ravi
    Proceedings of the International Conference on Web Search and Data Mining (WSDM), ACM (2016), pp. 317-326
    Preview abstract Machine-generated documents such as email or dynamic web pages are single instantiations of a pre-defined structural template. As such, they can be viewed as a hierarchy of template and document specific content. This hierarchical template representation has several important advantages for document clustering and classification. First, templates capture common topics among the documents, while filtering out the potentially noisy variabilities such as personal information. Second, template representations scale far better than document representations since a single template captures numerous documents. Finally, since templates group together structurally similar documents, they can propagate properties between all the documents that match the template. In this paper, we use these advantages for document classification by formulating an efficient and effective hierarchical label propagation and discovery algorithm. The labels are propagated first over a template graph (constructed based on either term-based or topic-based similarities), and then to the matching documents. We evaluate the performance of the proposed algorithm using a large donated email corpus and show that the resulting template graph is significantly more compact than the corresponding document graph and the hierarchical label propagation is both efficient and effective in increasing the coverage of the baseline document classification algorithm. We demonstrate that the template label propagation achieves more than 91% precision and 93% recall, while increasing the label coverage by more than 11%. View details
    Preview abstract Business-to-consumer (B2C) emails are usually generated by filling structured user data (e.g. purchase, event) into tem- plates. Extracting structured data from B2C emails allows users to track important information on various devices. However, it also poses several challenges, due to the re- quirement of short response time for massive data volume, the diversity and complexity of templates, and the privacy and legal constraints. Most notably, email data is legally protected content, which means no one except the receiver can review the messages or derived information. In this paper we first introduce a system which can extract structured information automatically without requiring hu- man review of any personal content. Then we focus on how to annotate product names from the extracted texts, which is one of the most difficult problems in the system. Nei- ther general learning methods, such as binary classifiers, nor more specific structure learning methods, such as Condition- al Random Field (CRF), can solve this problem well. To accomplish this task, we propose a hybrid approach, which basically trains a CRF model using the labels pre- dicted by binary classifiers (weak learners). However, the performance of weak learners can be low, therefore we use Expectation Maximization (EM) algorithm on CRF to re- move the noise and improve the accuracy, without the need to label and inspect specific emails. In our experiments, the EM-CRF model can significantly improve the product name annotations over the weak learners and plain CRFs. View details
    Vote calibration in community question-answering systems
    Bee-Chung Chen
    Anirban Dasgupta
    SIGIR '12 Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval (2012), pp. 781-790
    Preview abstract User votes are important signals in community question-answering (CQA) systems. Many features of typical CQA systems, e.g. the best answer to a question, status of a user, are dependent on ratings or votes cast by the community. In a popular CQA site, Yahoo! Answers, users vote for the best answers to their questions and can also thumb up or down each individual answer. Prior work has shown that these votes provide useful predictors for content quality and user expertise, where each vote is usually assumed to carry the same weight as others. In this paper, we analyze a set of possible factors that indicate bias in user voting behavior -- these factors encompass different gaming behavior, as well as other eccentricities, e.g., votes to show appreciation of answerers. These observations suggest that votes need to be calibrated before being used to identify good answers or experts. To address this problem, we propose a general machine learning framework to calibrate such votes. Through extensive experiments based on an editorially judged CQA dataset, we show that our supervised learning method of content-agnostic vote calibration can significantly improve the performance of answer ranking and expert ranking. View details
    No Results Found