
Google's structured data and database management research focus on creating tools that help users discover structured data on the Web, query and integrate multiple data sources and create visualizations that are easily published. Many of our challenges arise from the vast degree of data heterogeneity of data on the Web and our desire to cater to users with relatively little technical skills. The following are examples of our publications.
“Fuzzy Joins Using MapReduce”, Foto N. Afrati, Anish Das Sarma, David Menestrina, Aditya Parameswaran, Jeffrey Ullman, ICDE, 2012 (to appear).
[search]
“Interactive Regret Minimization”, Danupon Nanongkai, Ashwin Lall, Atish Das Sarma, Kazuhisa Makino, SIGMOD, 2012 (to appear).
[search]
“Computational Journalism: A Call to Arms to Database Researchers”, Sarah Cohen, Chengkai Li, Jun Yang, Cong Yu, CIDR, 2011.
[pdf] [search]
“Data Integration with Dependent Sources”, Anish Das Sarma, Luna Dong, Alon Halevy, EDBT, 2011.
[ilpubs.stanford.edu:8090] [search]
“Dremel: Interactive Analysis of Web-Scale Datasets”, Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Communications of the ACM, vol. 54 (2011), pp. 114-123.
[abstract] [cacm.acm.org] [pdf] [search]
“Efficiently Encoding Term Co-occurrences in Inverted Indexes”, Marcus Fontoura, Maxim Gurevich, Vanja Josifovski, Sergei Vassilvitskii, 20th ACM Conference on Information and Knowledge Management (CIKM 2011) (to appear).
[pdf] [search]
“Efficiently Evaluating Graph Constraints in Content-Based Publish/Subscribe”, Andrei Broder, Shirshanka Das, Marcus Fontoura, Bhaskar Ghosh, Vanja Josifovski, Jayavel Shanmugasundaram, Sergei Vassilvitskii, The 20th International World Wide Web Confererence (WWW 2011).
[pdf] [search]
“Entity-Relationship Queries over Wikipedia”, Xiaonan Li, Chengkai Li, Cong Yu, ACM Transactions on Intelligent Systems and Technology, vol. to appear (2011).
[search]
“Evaluation Strategies for Top-k Queries over Memory-Resident Inverted Indexes”, Marcus Fontoura, Vanja Josifovski, Jinhui Liu, Srihari Venkatesan, Xiangfei Zhu, Jason Zien, The 37th International Conference on Very Large Databases (VLDB 2011) (to appear).
[pdf] [search]
“Factorization-based Lossless Compression of Inverted Indices”, George Beskales, Marcus Fontoura, Maxim Gurevich, Vanja Josifovski, Sergei Vassilvitskii, 20th ACM Conference on Information and Knowledge Management (CIKM 2011) (to appear).
[pdf] [search]
“Graph cube: on warehousing and OLAP multidimensional networks”, Peixiang Zhao, Xialolei Li, Dong Xin, Jiawei Han, SIGMOD - Proceedings of the 2011 International Conference on Management of Data.
[abstract] [pdf] [search]
“Hyper-local, directions-based ranking of places”, Petros Venetis, Hector Gonzalez, Alon Y. Halevy, Christian S. Jensen, PVLDB, vol. 4(5) (2011), pp. 290-30.
[search]
“Maestro: Quality-of-Service in Large Disk Arrays”, Arif Merchant, Mustafa Uysal, Pradeep Padala, Xiaoyun Zhu, Sharad Singhal, Kang Shin, Proceedings of the 8th ACM international conference on Autonomic computing (ICAC), 2011, pp. 245-254.
[abstract] [kabru.eecs.umich.edu] [search]
“Representative Skylines using Threshold-based Preference Distributions”, Atish Das Sarma, Ashwin Lall, Danupon Nanongkai, Richard J. Lipton, Jim Xu, International Conference on Data Engineering (ICDE), 2011.
[search]
“Adaptive query processing in data stream management systems under limited memory resources.”, Fatima Farag, Moustafa A. Hammad, Reda Alhajj, Proceedings of the 3rd workshop on Ph.D. students in information and knowledge management. PIKM 2010, Toronto, Ontario, Canada, October 30, 2010., pp. 9-16.
[doi.acm.org] [search]
“Automatically incorporating new sources in keyword search-based data integration”, Partha Pratim Talukdar, Zachary G. Ives, Fernando Pereira, SIGMOD Conference, 2010, pp. 387-398.
[doi.acm.org] [pdf] [search]
“Collaborative Environmental In Situ Data Collection: Experiences and Opportunities for Ambient Data Integration”, David Thau, On the Move to Meaningful Internet Systems: OTM 2010 Workshops, pp. 119.
[abstract] [springerlink3.metapress.com] [search]
“Google Fusion Tables: Data Management, Integration, and Collaboration in the Cloud”, Hector Gonzalez, Alon Halevy, Christian Jensen, Anno Langen, Jayant Madhavan, Rebecca Shapley, Warren Shen, Proceedings of the ACM Symposium on Cloud Computing (SOCC), 2010 (to appear).
[search]
“Google Fusion Tables: Web-Centered Data Management and Collaboration”, Hector Gonzalez, Alon Halevy, Christian Jensen, Anno Langen, Jayant Madhavan, Rebecca Shapley, Warren Shen, Jonathan Goldberg-Kidon, Proceedings of the ACM SIGMOD conference, 2010 (to appear).
[search]
“Pregel: a system for large-scale graph processing”, Grzegorz Malewicz, Matthew H. Austern, Aart J.C Bik, James C. Dehnert, Ilan Horn, Naty Leiser, Grzegorz Czajkowski, Proceedings of the 2010 international conference on Management of data, pp. 135-146.
[doi.acm.org] [search]
“The Case Against Data Lock-in”, Brian W. Fitzpatrick, JJ Lueck, Communications of the ACM, vol. 53 No.11 (2010), pp. 42-46.
[abstract] [queue.acm.org] [search]
“Threshold query optimization for uncertain data”, Yinian Qi, Rohit Jain, Sarvjeet Singh, Sunil Prabhakar, Special Interest Group on Management of Data (SIGMOD), 2010.
[search]
“VoR-Tree: R-trees with Voronoi Diagrams for Efficient Processing of Spatial Nearest Neighbor Queries”, Mehdi Sharifzadeh, Cyrus Shahabi, Very Large Databases (VLDB) (2010).
[pdf] [search]
“DRAM Errors in the Wild: A Large-Scale Field Study”, Bianca Schroeder, Eduardo Pinheiro, Wolf-Dietrich Weber, SIGMETRICS, 2009.
[abstract] [pdf] [search]
“Data Integration with Uncertainty”, Xin Luna Dong, Alon Halevy, Cong Yu, The VLDB Journal, vol. 18 (2009), pp. 469-500.
[doi] [pdf] [search]
“Data Modeling in Dataspace Support Platforms”, Anish Das Sarma, Xin (Luna) Dong, Alon Y. Halevy, 2009, pp. 122-138.
[doi] [search]
“Engineering autonomic systems”, Joseph L. Hellerstein, ICAC '09: Proceedings of the 6th international conference on Autonomic computing, 2009, pp. 75-76.
[doi.acm.org] [search]
“Exploring Schema Repositories with Schemr”, Kuang Chen, Jayant Madhavan, Alon Halevy, Proceedings of the ACM SIGMOD conference, 2009, pp. 1095-1098.
[search]
“Representing uncertain data: models, properties, and algorithms”, Anish Das Sarma, Omar Benjelloun, Alon Halevy, Shubha Nabar, Jennifer Widom, The VLDB Journal, vol. 18 (2009), pp. 989-1019.
[doi] [search]
“The Claremont report on database research”, Rakesh Agrawal, Anastasia Ailamaki, Philip A. Bernstein, Eric A. Brewer, Michael J. Carey, Surajit Chaudhuri, Anhai Doan, Daniela Florescu, Michael J. Franklin, Hector Garcia-Molina, Johannes Gehrke, Le Gruenwald, Laura M. Haas, Alon Y. Halevy, Joseph M. Hellerstein, Yannis E. Ioannidis, Hank F. Korth, Donald Kossmann, Samuel Madden, Roger Magoulas, Beng Chin Ooi, Tim O'Reilly, Raghu Ramakrishnan, Sunita Sarawagi, Michael Stonebraker, Alexander S. Szalay, Gerhard Weikum, Commun. ACM, vol. 52 (2009), pp. 56-65.
[doi.acm.org] [search]
“Using Hoarding to Increase Availability in Shared File Systems”, Jochen Hollmann, Per Stenström, Computer and Information Science, 2009. ICIS 2009. Eighth IEEE/ACIS International Conference on, pp. 422 - 429.
[abstract] [dx.doi.org] [search]
“Weighted Proximity Best-Joins for Information Retrieval”, Risi Thonangi, Hao He, Anhai Doan, Haixun Wang, Jun Yang, ICDE '09: Proceedings of the 2009 IEEE International Conference on Data Engineering, pp. 234-245.
[cs.duke.edu] [pdf] [search]
“Bootstrapping Pay-as-you-go Data Integration Systems”, Anish Das Sarma, Xin Dong, Alon Halevy, Proc. ACM SIGMOD International Conference on Management of Data, 2008, pp. 861-874.
[doi.acm.org] [search]
“Pay-as-you-go User Feedback for Dataspace Systems”, Shawn R. Jeffery, Michael J. Franklin, Alon Y. Halevy, Proc. ACM SIGMOD International Conference on Management of Data, 2008, pp. 847-860.
[doi.acm.org] [search]
“The Space Complexity of Processing XML Twig Queries over Indexed Documents”, Mirit Shalem, Ziv Bar-Yossef, Proceedings of the 24th International Conference on Data Engineering (ICDE), 2008, pp. 824-832.
[ee.technion.ac.il] [pdf] [search]
“Ad Hoc Distributed Simulations”, Richard Fujimoto, Michael Hunter, Jason Sirichoke, Mahesh Palekar, Hoe Kim, Wonhu Suh, 21st International Workshop on Principles of Advanced and Distributed Simulation (PADS'07), 2007, pp. 15-24.
[doi] [search]
“An Information Avalanche”, Vint Cerf, IEEE Computer, vol. 40, no. 1 (2007), pp. 104-105.
[doi.ieeecomputersociety.org] [pdf] [search]
“Building MEMS-Based Storage Systems for Streaming Media”, Raju Rangaswami, Zoran Dimitrijević, Edward Chang, Klaus Schauser, ACM Transactions on Storage, vol. 9 (2007).
[doi.acm.org] [pdf] [search]
“Estimating Statistical Aggregates on Probabilistic Data Streams”, T. S. Jayram, Andrew McGregor, S. Muthukrishan, Erik Vee, Principles of Database Systems (PODS) 2007, pp. 243-252.
[doi.acm.org] [search]
“Failure Trends in a Large Disk Drive Population”, Eduardo Pinheiro, Wolf-Dietrich Weber, Luiz André Barroso, 5th USENIX Conference on File and Storage Technologies (FAST 2007), pp. 17-29.
[abstract] [research.google.com] [search]
“Indexing Dataspaces”, Xin Dong, Alon Halevy, Proc. ACM SIGMOD, 2007.
[pdf] [search]
“Life on the Edge: Monitoring and Running a Very Large Perforce Installation.”, Dan Bloch, Perforce User Conference 2007.
[perforce.com] [pdf] [search]
“Optimal Traversal Planning in Road Networks with Navigational Constraints”, Leyla Kazemi, Cyrus Shahabi, Mehdi Sharifzadeh, Luc Vincent, ACM GIS (2007).
[abstract] [search]
“Query Suspend and Resume”, Badrish Chandramouli, Chris Bond, Shivnath Babu, Jun Yang, Proc. ACM SIGMOD, 2007.
[search]
“Web-scale Data Integration: You can only afford to Pay As You Go”, Jayant Madhavan, Shawn R. Jeffery, Shirley Cohen, Xin (Luna) Dong, David Ko, Cong Yu, Alon Halevy, CIDR, 2007.
[abstract] [pdf] [search]
“Achieving completion time guarantees in an opportunistic data migration scheme”, Jianyong Zhang, Prasenjit Sarkar, Anand Sivasubramaniam, ACM SIGMETRICS Performance Evaluation Review, vol. 33 (2006), pp. 11-16.
[search]
“Data integration: the teenage years”, Alon Halevy, Anand Rajaraman, Joann Ordille, Proc. 32nd International Conference on Very Large Databases, 2006, pp. 9-16.
[search]
“Data management projects at Google”, Wilson Hsieh, Jayant Madhavan, Rob Pike, SIGMOD Conference, 2006, pp. 725-726.
[doi.acm.org] [search]
“On-the-fly Sharing for Streamed Aggregation”, Sailesh Krishnamurthy, Chung Wu, Michael J. Franklin, SIGMOD Conference, 2006, pp. 623-634.
[doi.acm.org] [pdf] [search]
“Principles of dataspace systems”, Alon Y. Halevy, Michael J. Franklin, David Maier, PODS, 2006, pp. 1-9.
[doi.acm.org] [pdf] [search]
“Semantically-smart disk systems: past, present, and future”, Andrea C. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau, Lakshmi N. Bairavasundaram, Timothy E. Denehy, Florentina I. Popovici, Vijayan Prabhakaran, Muthuian Sivathanu, ACM SIGMETRICS Performance Evaluation Review, vol. 33 (2006), pp. 29-35.
[pdf] [search]
“Sender Reputation in a Large Webmail Service”, Bradley Taylor, Third Conference on Email and Anti-Spam (CEAS 2006).
[ceas.cc] [search]
“Structured Data Meets the Web: A Few Observations”, Jayant Madhavan, Alon Halevy, Shirley Cohen, Xin (Luna) Dong, Shawn R. Jeffery, David Ko, Cong Yu, Data Engineering Bulletin (2006).
[abstract] [pdf] [search]
“ULDBs: databases with uncertainty and lineage”, Omar Benjelloun, Anish Das Sarma, Alon Halevy, Jennifer Widom, Proc. 32nd International Conference on Very Large Databases, 2006, pp. 953-964.
[search]
“PADX: Querying large-scale ad hoc data with XQuery”, Mary Fernandez, Kathleen Fisher, Robert Gruber, Yitzhak Mandelbaum, Proceedings of PLAN-X 2006: Workshop on Programming Language technologies for XML (2006).
[search]
“Networking proposal for TR2”, Gerhard Wesp, 2005.
[open-std.org] [search]