SIGMOD 2020: Keynote Talks

Systems and ML: When the Sum is Greater than Its Parts

Speaker: Ion Stoica (UC Berkeley)

Ion Stoica is a Professor in the EECS Department at the University of California at Berkeley, and the Director of RISELab (https://rise.cs.berkeley.edu/). He is currently doing research on cloud computing and AI systems. Past work includes Apache Spark, Apache Mesos, Tachyon, Chord DHT, and Dynamic Packet State (DPS). He is an ACM Fellow and has received numerous awards, including the Mark Weiser Award (2019), SIGOPS Hall of Fame Award (2015), the SIGCOMM Test of Time Award (2011), and the ACM doctoral dissertation award (2001). He also co-founded three companies, Anyscale (2019), Databricks (2013) and Conviva (2006).

When the Web is your Data Lake: Creating a Search Engine for Datasets on the Web

Speaker: Natasha Noy (Google Research)

There are thousands of data repositories on the Web, providing access to millions of datasets. National and regional governments, scientific publishers and consortia, commercial data providers, and others publish data for fields ranging from social science to life science to high-energy physics to climate science and more. Access to this data is critical to facilitating reproducibility of research results, enabling scientists to build on others' work, and providing data journalists easier access to information and its provenance. In this talk, I will discuss our work on Dataset Search at Google. Dataset Search provides search capabilities over potentially all dataset repositories on the Web. I will talk about the open ecosystem for describing and citing datasets that we hope to encourage and the technical details on how we went about building Dataset Search. Finally, I will highlight research challenges in building a vibrant, heterogeneous, and open ecosystem where data becomes a first-class citizen.

Natasha Noy is a senior staff scientist at Google Research where she works on making structured data accessible and useful. She leads the team building Dataset Search, a search engine for all the datasets on the Web. Prior to joining Google, she worked at Stanford Center for Biomedical Informatics Research where she made major contributions in the areas of ontology development and alignment, and collaborative ontology engineering. Dr. Noy is a Fellow of the Association for the Advancement of Artificial Intelligence (AAAI). She has served as the President of the Semantic Web Science Association from 2011 to 2017.

The Challenge of Building Effective Data Lakes

Speaker: Awez Syed (Databricks)

There has been a rapid rise in the popularity of data lakes as the data infrastructure for modern analytics and data science. The combination of cloud storage and fast, elastic processing provides an inexpensive and scalable solution for building analytical applications. While data lakes make it easy to ingest and store vast amounts of data, the ability to effectively make use of that data is still limited. This data often lacks context, doesn't meet the quality required for applications, and is not easily understandable or discoverable by users. Problems of data consistency and accuracy make it hard to derive value from data lakes and to trust the analytics based on this data. The traditional methods of manually documenting, classifying and assessing the data don't scale to the volume of cloud-based data lakes. New automated, learning-based approaches are required to discover, curate and make the data usable for a wide variety of users. In this talk, we describe the real-world implementation patterns of data lakes and give an overview of the many open challenges in deploying successful, enterprise-scale data lakes.

Awez Syed is the VP of Products at Databricks, a Data and AI company. He has over 20 years of experience building commercially successful products for Data Integration, Data Quality and Metadata Management. He recently led the development of a market leading ML-driven Data Catalog solution for data classification, data profiling and relationship discovery.

Welcome

Organization

Special and Co-located Events

Participant Information

Calls For Submissions

PODS Program

SIGMOD Program