DATA Lab seminar: group meetings & guest speakers
We are currently meeting Wednesdays and/or Fridays at noon in WVH 462 (map) for brown bag lunch. Please subscribe to our DATA Lab talks email list or DATA Lab talks calendar if you think data is the future.
(5/10, 12:00pm): Predrag Radivojac (Khoury / Northeastern): A new class of metrics for learning on real-valued and structured data details)TITLE
A new class of metrics for learning on real-valued and structured data
We propose a new class of metrics on sets, vectors, and functions that can be used in various stages of data mining, including exploratory data analysis, learning, and result interpretation. These new distance functions unify and generalize some of the popular metrics, such as the Jaccard and bag distances on sets, Manhattan distance on vector spaces, and Marczewski-Steinhaus distance on integrable functions. We prove that the new metrics are complete and show useful relationships with $f$-divergences for probability distributions. To further extend our approach to structured objects such as ontologies, we introduce information-theoretic metrics on directed acyclic graphs drawn according to a fixed probability distribution. We conduct empirical investigation to demonstrate the effectiveness on real-valued, high-dimensional, and structured data. Overall, the new metrics compare favorably to multiple similarity and dissimilarity functions traditionally used in data mining, including the Minkowski ($L^p$) family, the fractional $L^p$ family, two $f$-divergences, cosine distance, and two correlation coefficients. We provide evidence that they are particularly appropriate for rapid processing of high-dimensional and structured data in distance-based learning.
(5/15, 12:00pm): Leilani Battle (University of Maryland): Behavior-Driven Optimizations for Big Data Exploration details)TITLE
Behavior-Driven Optimizations for Big Data Exploration
With massive amounts of data coming from field sensors, sequencers, mobile devices, etc., data-driven decision making has become increasingly important in industry, government, and the sciences. One of the key issues for analysts and scientists who work with large datasets is efficient visualization of their data to extract patterns, observe anomalies, and debug their analysis workflows. Though a variety of visualization tools exist to help people make sense of their data, these tools often rely on database management systems (or DBMSs) for data processing and storage; and unfortunately, DBMSs fail to process the data fast enough to support a fluid, interactive visualization experience.
My work blends optimization techniques from databases and methodology from HCI and visualization in order to support interactive exploration of large datasets. In this talk, I will first discuss Sculpin, a visual exploration system that learns user exploration patterns automatically, and exploits these patterns to pre-fetch data ahead of users as they explore. I will then discuss ongoing work to better understand how analysts explore data in more sophisticated analysis systems, such as Tableau Desktop. Finally, I will report on ongoing efforts to standardize the way we evaluate visual data analysis systems in general.
SIGMOD 2016: http://www.cs.umd.edu/~leilani/static/papers/forecache_cr_sigmod2016.pdf
HILDA 2018: http://www.cs.umd.edu/~leilani/static/papers/battle_visualization-centered-evaluation_hilda_2018.pdf
Leilani Battle is an Assistant Professor at the University of Maryland, College Park, with a joint appointment in the University of Maryland Institute for Advanced Computer Studies (UMIACS). She is also affiliated with the UMD Human-Computer Interaction Laboratory (HCIL). Her research interests focus on developing interactive data-intensive systems that can aid analysts in performing complex data exploration and analysis. Her current research is anchored in the field of databases, but utilizes research methodology and techniques from HCI and visualization to integrate data processing (databases) with interactive interfaces (HCI, visualization). She is an NSF Graduate Research Fellowship Recipient (2012), and her research is currently supported by an NSF CISE CRII Award (2019). In 2017, she completed a postdoc in the UW Interactive Data Lab. She holds an MS (2013) and PhD (2017) in Computer Science from MIT, where she was a member of the MIT Database Group, and a BS in Computer Engineering from UW (2011), where she was a member of the UW database group.
(5/31, 12:00pm): Raymond Chi-Wing Wong (The Hong Kong University of Science and Technology): FindYourFavorite: An Interactive System for Finding the User's Favorite Tuple in the Database details)TITLE
FindYourFavorite: An Interactive System for Finding the User's Favorite Tuple in the Database
When faced with a database containing millions of tuples, an end user might be only interested in finding his/her favorite tuple in the database. In this talk, we study how to help an end user to find such a favorite tuple with a few user interations. In each interaction, a user is presented with a small number of tuples (which can be artificial tuples outside the database or true tuples inside the database) and s/he is asked to indicate the tuple s/he favors the most among them. Different from the previous work which displays artificial tuples to users during the interaction and requires heavy user interactions, we achieve a stronger result. Specifically, we use a concept, called the utility hyperplane, to model the user preference and an effective pruning strategy to locate the favorite tuple for a user in the whole database. Based on these techniques, we developed an interactive system, called FindYourFavorite, and demonstrate that the system could identify the favorite tuple for a user with a few user interactions by always displaying true tuples in the database.
SIGMOD 2018: http://home.cse.ust.hk/~raywong/paper/sigmod18-regret.pdf
Raymond Chi-Wing Wong is an Associate Professor in Computer Science and Engineering (CSE) of The Hong Kong University of Science and Technology (HKUST). He is currently the director of the Risk Management and Business Intelligence (RMBI) program. He was the director of the Computer Engineering (CPEG) program from 2014 to 2016 and was the associate director of the Computer Engineering (CPEG) program from 2012 to 2014. He received the BSc, MPhil and PhD degrees in Computer Science and Engineering in the Chinese University of Hong Kong (CUHK) in 2002, 2004 and 2008, respectively. In 2004-2005, he worked as a research and development assistant under an R&D project funded by ITF and a local industrial company called Lifewood. From May 2006 to Aug 2006, he was a visiting student of Prof. Jian Pei and Prof. Ke Wang, at Simon Fraser University in Canada. From Aug 2007 to Sept 2007, he visited IBM T.J. Watson Research Center as a summer intern under the supervision of Prof. Philip S. Yu. From Jun 2008 to Jul 2008, he visited Prof. Tamer Ozsu at University of Waterloo as a visiting scholar. Some of his collaborators are Prof. Ada Fu (2003-2013), Prof. Ke Wang (2003-2013), Prof. Philip S. Yu (2009-2013), Prof. Jian Pei (2006-2011), Prof. Tamer Ozsu (2009-2011), Prof. Jiuyong Li (2006-2011), Prof. Yufei Tao (2007-2009), Prof. Ihab Ilyas (2009), Prof. Jeffrey Yu (2009),Prof. Lei Chen (2008) and Prof. Eamonn Keogh (2008). He received 28 awards. He published 69 conference papers (e.g., SIGMOD, SIGKDD, VLDB, ICDE and ICDM), 31 journal/chapter papers (e.g., TODS, DAMI, TKDE, VLDB journal and TKDD) and 1 book. He reviewed papers from conferences and journals related to data mining and database, including VLDB conference, SIGMOD, TODS, VLDB Journal, TKDE, TKDD, ICDE, SIGKDD, ICDM, DAMI, DaWaK, PAKDD, EDBT and IJDWM. He is a program committee member of conferences, including SIGMOD, VLDB, ICDE, KDD, ICDM and SDM, and a referee of journals, including TODS, VLDBJ, TKDE, TKDD, DAMI and KAIS.
(6/21, 12:00pm): Wolfgang Gatterbauer (Khoury): Anytime Approximation in Probabilistic Databases via Scaled Dissociations details)TITLE
Anytime Approximation in Probabilistic Databases via Scaled Dissociations
Speeding up probabilistic inference remains a key challenge in probabilistic databases (PDBs) and the related area of statistical relational learning (SRL). Since computing probabilities for query answers is #P-hard, even for fairly simple conjunctive queries, both the PDB and SRL communities have proposed a number of approximation techniques over the years. The two prevalent techniques are either (i) MCMCstyle sampling or (ii) branch-and-bound (B&B) algorithms that iteratively improve model-based bounds using a combination of variable substitution and elimination.
We propose a new anytime B&B approximation scheme that encompasses all prior model-based approximation schemes proposed in the PDB and SRL literature. Our approach relies on the novel idea of “scaled dissociation” which can improve both the upper and lower bounds of existing model-based algorithms. We apply our approach to the well-studied problem of evaluating self-join-free conjunctive queries over tuple-independent PDBs, and show a consistent reduction in approximation error in our experiments on TPC-H, Yago3, and a synthetic benchmark setting.
Based on joint work with Maarten Van den Heuvel, Peter Ivanov, Floris Geerts, and Martin Theobald.
SIGMOD 2019: http://gatterbauer.name/download/sigmod2019_anytime_approximations_scaled_dissociations.pdf
(9/20, 12:00pm): Oktie Hassanzadeh (IBM): Knowledge Graphs for Data Curation & Data Science details)TITLE
Knowledge Graphs for Data Curation & Data Science
In this informal talk, I will go over a number of projects at IBM Research that leverage knowledge graphs for large-scale data exploration, data curation, and data science. In particular, I will describe an application of knowledge graphs for large-scale data exploration and curation in the context of the IBM Research Helix project, where large-scale similarity analysis and automated linkage discovery enable the discovery and construction of data sets from a large number of data sets (or a data lake) for a given analytics task. I will then describe our ongoing project and recent progress on using knowledge graphs for automating data science. I will discuss a number of ways knowledge graphs can assist with feature engineering and the challenges we faced in implementing a prototype solution. If time permits, I will briefly go over some of our recent work on knowledge extraction from text.
* Automated Feature Enhancement for Predictive Modeling using External Knowledge, ICDM 2019 (Demo Track, To Appear)
* Exploring Big Data with Helix: Finding Needles in a Big Haystack. SIGMOD Record 43(4): 43-54 (2014) https://doi.org/10.1145/2737817.2737829
* Instance-Based Matching of Large Ontologies Using Locality-Sensitive Hashing. International Semantic Web Conference (1) 2012: 49-64 https://doi.org/10.1007/978-3-642-35176-1_4
* Inducing Implicit Relations from Text Using Distantly Supervised Deep Nets. International Semantic Web Conference (1) 2018: 38-55 https://doi.org/10.1007/978-3-030-00671-6_3 (ISWC 2017 Semantic Web Challenge Winning Solution, see blog post here: https://www.ibm.com/blogs/research/2017/11/knowledge-base-construction-iswc-2017/)
* Answering Binary Causal Questions Through Large-Scale Text Mining: An Evaluation Using Cause-Effect Pairs from Human Experts. IJCAI 2019: 5003-5009 https://doi.org/10.24963/ijcai.2019/695
Dr. Oktie Hassanzadeh is a Research Staff Member at IBM Research. He has received several academic and corporate awards for his work in data management and knowledge graphs. He is a recipient of the top prize at the prestigious Semantic Web Challenge at ISWC 2017, as well as two best-paper awards at ESWC 2018 and ESWC 2016 conferences. He has received his M.Sc. and Ph.D. degrees from the University of Toronto, where he received the IBM PhD fellowship and the Yahoo! Key Scientific Challenges awards. He is also a two-time recipient of the first prize at the Triplification Challenge at the SEMANTiCS Conference for his projects in the areas of Semantic Technologies and Linked Data. For more information, refer to his home page: http://researcher.watson.ibm.com/person/us-hassanzadeh
(9/27, 12:00pm): Boris Glavic (Illinois Institute of Technology): Relevance-based Data Management details)TITLE
Relevance-based Data Management
In this talk, I give an overview of my group's work on relevance-based data management, a novel data management paradigm centered around the idea of analyzing what data is needed to answer queries. For the most part, the talk will focus on one particular approach which we refer to as provenance-based data skipping. Since the inception of relational databases, index structures and other physical design techniques have been used to speed-up access to data. In spite of the wide variety of mature techniques that are available, there are important classes of selective queries, e.g., top-k queries, which cannot benefit from these techniques. This issue arises because database systems statically analyze incoming queries to determine which data is needed for answering which query. However, for many queries it is impossible to determine up-front what data is relevant. Provenance-based data skipping overcomes this limitation by generating provenance sketches which are concise encodings of what data is relevant for a query. Similar to query answering with views, a provenance sketch captured for one query is used to speed up the subsequent evaluation of other queries. In contrast to materialized views, provenance sketches are guaranteed to be small (100s of bytes) and can benefit from physical database design artifacts such as indexes, horizontal partitioning of tables, and zone maps. Furthermore, I will highlight open problems in provenance-based data skipping and discuss additional applications of relevance-based data management techniques in general.
IEEE Data Engineering Bulletin: GProM - A Swiss Army Knife for Your Provenance Needs IEEE Data Engineering Bulletin http://sites.computer.org/debull/A18mar/p51.pdf
Boris Glavic is an Associate Professor of Computer Science at the Illinois Institute of Technology where he leads the IIT DBGroup (http://www.cs.iit.edu/~dbgroup/ ). Before coming to IIT, Boris spend to two years as a PostDoc at the University of Toronto working with Renée J. Miller. He received his PhD from the University of Zurich in Switzerland being advised by Michael Böhlen and Gustavo Alonso. Boris is a database researcher building comprehensive systems based on solid theoretical foundations. His main research interests are provenance, data integration and cleaning, and query processing and optimization.
(10/11, 12:00pm): Owen O'Malley (Cloudera): Big Data's journey to ACID details)TITLE
Big Data's journey to ACID
Originally, the Hadoop ecosystem had no ability to handle updates and the user had to completely re-write the data to update it. Over the years, the community built the ability to have point updates in key-value stores (eg HBase) and adding new partitions of data via Hive. Five years ago, we added the first ability for Hive to use standard SQL operations to update your data. Recently, there have been multiple new systems to address similar needs: Apache Iceberg, Apache Hudi, and Databricks Delta. I'll talk about compare the new systems to Hive ACID, discussing their trade offs and approaches.
Owen O’Malley was a founder and technical fellow at Hortonworks. Hortonworks was founded in 2011 with 25 employees spinning out from Yahoo, went public in 2014, and grew to 1,400 employees before merging with Cloudera in 2019. Cloudera’s platform includes Hadoop and the large ecosystem of big data open source tools that enterprises need for their data analytics. Owen has been working on Hadoop since the beginning of 2006, was the first committer added to the project, and used Hadoop to set the Gray sort benchmark in 2008 and 2009. Over the years, he was the technical lead for MapReduce, Hadoop Security, ORC file, and adding ACID transactions to Hive.
(10/11, 3:00pm): Nan Tang (Qatar Computing Research Institute, QCRI): Data Preparation meets Data Cleaning details)TITLE
Data Preparation meets Data Cleaning
In this talk, I will first provide a brief overview of my research in data preparation, including data discovery, data cleaning and integration, and some systems (such as Data Civilizer, a project with MIT, and Columbus, a project with Wisconsin) to put different pieces together for data preparation. Afterwards, I will discuss why data visualization is important for data preparation, and what interesting research topics can happen when putting these two together, based on which I will present three works: (1) DeepEye: a visualization recommendation system, which was motivated by how to help users easily understand the discovered datasets in data preparation; (2) Interactive data cleaning for progressive visualization, which is to cheaply solve the problem (by 10-15 user interactions) that data visualization is bad if the data is very dirty; and (3) Cymphony: a generic system for collaborative task management, motivated by the practical scenarios that for many tasks (such as manual data cleaning, manually checking fake news), a team of users need to work collaboratively.
ICDE 2018: http://da.qcri.org/ntang/pubs/icde2018deepeye.pdf
EDBT 2018 vision paper: http://da.qcri.org/ntang/pubs/edbt2018.pdf
SIGMOD 2018 demo: http://da.qcri.org/ntang/pubs/deepeyesigmod2018demo.pdf
ICDE 2020 paper “Interactive cleaning for progressive visualization through composite questions”
Dr. Nan Tang is a senior scientist at Qatar Computing Research Institute (QCRI), HBKU, Doha, Qatar. Prior to joining QCRI, he has worked at Univeristy of Edinburgh (2010-2011) and CWI (2008-2010). He got his PhD. degree from The Chinese University of Hong Kong (2007). His research focuses on data preparation including data cleaning, entity matching, and building systems to support these tasks, and data visualization, especially on visualization recommendation and cleaning. He serves the community in many PCs, such as SIGMOD, PVLDB, ICDE, KDD. You can find more about me from my homepage: http://da.qcri.org/ntang/index.html
(12/9, 12:00pm): Oliver Kennedy (U of Buffalo): Safe, Reusable Heuristic Data Transformation through Caveats details)TITLE
Safe, Reusable Heuristic Data Transformation through Caveats
Through extensive surveys of data scientists in the field, we've identified numerous data transformation challenges, including time-series alignment, outlier management, entity resolution, parsing, and more. In each use case, a correct, fully general solution would have ben worthy of a PhD Thesis. By contrast, each of these use cases admitted a simple heuristic tailored to the specific dataset and analysis being performed, which could be scripted or applied manually in hours or even minutes. Although full generality is clearly not worth the price, dataset- and analysis- specific heuristics make it difficult to re-use the transformed dataset or transformation workflow.
In this talk, I introduce a form of lightweight data annotation called Caveats that make it easier to re-use both heuristically transformed datasets and heuristic transformation workflows. Caveats combine "where" provenance with techniques from incomplete data management to mark specific data values and records as dependent on a data transformation heuristic --- marked data elements are subject to change if the heuristic is altered. I will show how caveats can be seamlessly integrated into heuristic transformation workflows, introduce our lightweight scheme for propagating caveats through SQL queries, and demonstrate our prototype notebook, Vizier, built around Caveats.
https://odin.cse.buffalo.edu/papers/2019/submitted/CIDR-CrumbyNotebooks.pdf (to appear)
https://vizierdb.info/ (try it out!)
Oliver Kennedy is an associate professor at the University at Buffalo. He earned his PhD from Cornell University in 2011 and now leads the Online Data Interactions (ODIn) lab, which operates at the intersection of databases and programming languages. Oliver is the recipient of an NSF CAREER award, UB's Exceptional Scholar Award, and the UB SEAS Early Career Teacher of the Year Award. Several of his papers have been invited to "Best of" compilations from SIGMOD and VLDB. The ODIn lab is currently exploring uncertain data management, just-in-time data structure design, and "small data" management, as well as implementing VizierDB, a reproducibility- and reusability-focused notebook.
(1/9, 12:00pm): Kathleen Fisher (Tufts): Using a Declarative Description Language to Tame Ad Hoc Data: An Overview of the PADS project details)TITLE
Using a Declarative Description Language to Tame Ad Hoc Data: An Overview of the PADS project
Kathleen Fisher (Tufts University)
The goal of the PADS project (http://www.padsproj.org) is to make it easier for data analysts to extract useful information from ad hoc data files. This talk gives an overview of the project and how it helps bridge the gap between the unmanaged world of ad hoc data and the managed world of typed programming languages and databases. In particular, the paper reviews the design of PADS data description languages, describes the generated parsing tools and discusses the importance of meta-data. It also sketches the formal semantics, discusses useful tools and how can they can be generated automatically from PADS descriptions, and describes an inferencing system that can learn useful PADS descriptions from positive examples of the data format.
Kathleen Fisher is a Professor in and the Chair of the Computer Science Department at Tufts. Previously, she was a program manager at DARPA where she started and managed the HACMS and PPAML programs, a Consulting Faculty Member in the Computer Science Department at Stanford University, and a Principal Member of the Technical Staff at AT&T Labs Research. Kathleen is an ACM Fellow. She has served as Program Chair for PLDI, OOPSLA, ICFP, CUFP, and FOOL, and as General Chair for ICFP 2015. She is a former Associate Editor for TOPLAS and a former editor of the Journal of Functional Programming. Kathleen is a past Chair of the ACM Special Interest Group in Programming Languages (SIGPLAN) and past Co-Chair of CRA's Committee on the Status of Women (CRA-W). Kathleen is a recipient of the SIGPLAN Distinguished Service Award. She is Vice Chair of DARPA's ISAT Study Group and a member of the Board of Trustees of Harvey Mudd College.
Kathleen Fisher, David Walker: The PADS project: an overview. ICDT 2011: 11-17
(1/11, 12:00pm): Xiaoyu Liu (MIT): Kyrix: Interactive Visual Data Exploration at Scale details)TITLE
Kyrix: Interactive Visual Data Exploration at Scale
Scalable interactive visual data exploration is crucial in many domains due to increasingly large datasets generated at rapid rates. Details-on-demand provides a useful interaction paradigm for exploring large datasets, where the user starts at an overview, finds regions of interest, zooms in to see detailed views, zooms out and then repeats. This paradigm is the primary user interaction mode of widely used systems such as Google Maps, Aperture Tiles and ForeCache. These earlier systems, however, are highly customized with hardcoded visual representations and optimizations. A more general framework is needed to facilitate the development of visual data exploration systems at scale. We present Kyrix, an end-to-end system for developing scalable details-on-demand data exploration applications. Kyrix provides the developer with a declarative model for easy specification of general visualizations. Behind the scenes, Kyrix utilizes a suite of performance optimization techniques to achieve a response time within 500 ms for various user interactions. We also report results from a performance study which shows that a novel dynamic fetching scheme adopted by Kyrix outperforms tile-based fetching used in traditional systems.
Xiaoyu is a research intern at MIT data system group. Before that, she completed her master at Purdue, ECE department. She is interested in building scalable and reliable data-intensive systems.
(1/23, 12:00pm): Panayiotis Tsaparas (University of Ioannina): Finding Patterns in Temporal and Flow Networks details)TITLE
Finding Patterns in Temporal and Flow Networks
Networks are natural models for complex systems consisting of multiple interconnected entities. Many of these systems are dynamic, with connections appearing and disappearing over time. In the resulting networks, edges are annotated with time stamps, resulting in a graph history consisting of multiple network snapshots over time. We refer to such networks as temporal networks (graphs).
In this talk we consider two knowledge extraction problems on temporal networks. First, we consider the problem of finding lasting dense subgraphs. We provide formal definitions of graph density for a temporal graph, and we formulate the BFF problem that seeks a subset of nodes that are densely connected over time. Furthermore, we also consider the problem of finding a set of graph snapshots in which there are dense subsets of nodes. We study the complexity of the problems and propose exact, approximation and heuristic algorithms. Experiments indicate that our approach can find interesting patterns in collaboration and word co-occurrence networks.
In the second problem we assume that edges are also associated with a value, indicating a flow between the two nodes at a specific time stamp. Such networks appear naturally in practice in the case of money exchange networks (e.g., bitcoin), or traffic networks. We are interested in finding motifs in these networks, that is, small subgraphs that appear more often than random. Our motifs are restricted temporally and with respect to flow: all interactions must happen within a given time-window, and they must involve a minimum amount of flow. We propose efficient algorithms for finding such motifs and experiment with them on real networks.
Konstantinos Semertzidis, Evaggelia Pitoura, Evimaria Terzi, Panayiotis Tsaparas. Finding lasting dense subgraphs. ECML/PKDD Journal Track (DAMI), 2019 (to appear)
Chrysanthi Kosyfaki, Nikos Mamoulis, Evaggelia Pitoura, Panayiotis Tsaparas. Flow motifs in interaction networks. EDBT 2019 (to appear).
Panayiotis Tsaparas completed his undergraduate studies at the Department of Computer Science at University of Crete, Greece in 1995. He continued his graduate studies at University of Toronto, where he received his M.Sc., and Ph.D. degree, under the supervision of Allan Borodin. After graduation, he worked as a post-doctoral fellow at University of Rome, “La Sapienza”, and at University of Helsinki, and as a researcher at Microsoft Research. Since 2011 he joined the Department of Computer Science and Engineering at University of Ioannina, where he is now an Associate Professor. His research interests include Social Network Analysis, Algorithmic Data Mining, Web Mining and Information Retrieval.
(1/25, 1:15pm): Fatemeh Nargesian (University of Toronto): Table Union and Navigation details)TITLE
Table Union and Navigation
Preparing data for advanced analytics is prohibitively time-consuming and expensive for all but the best-trained and best-funded engineers. Nevertheless, the success of trained systems often depends on data and generated features more than powerful statistical algorithms. Among the data management challenges that need to be addressed is data enrichment which is the discovery and integration of meaningful data in data lakes for a data science task. However, there are many challenges to overcome: (1) the sheer size of data, (2) the unique distributions and characteristics of data lakes, and (3) the probabilistic and human-in-the-loop nature of data discovery. In this talk, I discuss two prevalent data discovery scenarios. In the first scenario, the query is a dataset and the data scientist is interested in interactively finding datasets that can be integrated (e.g unioned) with the query. I will introduce a probabilistic framework for finding and aligning unionable tables with a query table and discuss the need for distribution-aware techniques for data discovery. In the second scenario, search does not start with a query, instead, it is data-driven. I will talk about data lake organization problem where the goal is to find an organization (a directed acyclic graph) that allows a user to most efficiently navigate data lakes. I will present a probabilistic navigation model of how users interact with an organization and introduce a scalable structure learning algorithm for optimizing data lake organizations.
Nargesian, Zhu, Pu, Miller: Table Union Search on Open Data, PVLDB 2018.
Fatemeh Nargesian is a PhD student in the Data Curation Group of the Department of Computer at University of Toronto. Her research focuses on optimizing and automating data preparation for end-to-end data science, encompassing dataset discovery in data lakes and enriching datasets with new features. While at University of Toronto, Fatemeh was a joint Research intern at IBM Research-NY. Prior to University of Toronto, Fatemeh worked at the Clinical Informatics research group at McGill University on clinical data management, and received M.Sc. degrees in computer science and artificial intelligence at University of Toronto and Sharif University of Technology.
(1/28, 12:00pm): Stefano Ceri (Politecnico di Milano): WVH 366: Data-Driven Genomic Computing: Making Sense of the Signals from the Genome details)TITLE
Data-Driven Genomic Computing: Making Sense of the Signals from the Genome
Genomic computing is a new science focused on understanding the functioning of the genome, as a premise to fundamental discoveries in biology and medicine. Next Generation Sequencing (NGS) allows the production of the entire human genome sequence at a cost of about 1000 US $; many algorithms exist for the extraction of genome features, or "signals", including peaks (enriched regions), variants, or gene expression (intensity of transcription activity). The missing gap is a system supporting data integration and exploration, giving a “biological meaning” to all the available information; such a system can be used, e.g., for better understanding cancer or how environment influences cancer development.
The GeCo Project (Data-Driven Genomic Computing, ERC Advanced Grant, 2016-2021) has the objective or revisiting genomic computing through the lens of basic data management, through models, languages, and instruments, focusing on genomic data integration. Starting from an abstract model, we developed a system that can be used to query processed data produced by several large Genomic Consortia, including Encode and TCGA; the system employs internally the Spark engine, and prototypes can already be accessed from Polimi, from Cineca (Italian supercomputing center) and from the Broad Institute in Cambridge. During the five-years of the ERC project, the system will be enriched with data analysis tools and environments and will be made increasingly efficient. Among the objectives of the project, the creation of an “open source” repository of public data, available to biological and clinical research through queries, web services and search interfaces.
Stefano Ceri is professor of Database Systems at the Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB) of Politecnico di Milano. His research work covers four decades (1978-2018) and has been generally concerned with extending database technologies in order to incorporate new features: distribution, object-orientation, rules, streaming data; with the advent of the Web, his research has been targeted towards the engineering of Web-based applications and to search systems. More recently he turned to genomic computing. He authored over 350 publications (H-index 75) and authored or edited 15 books in English. He is the recipient of two ERC Advanced Grants: "Search Computing (SeCo)" (2008-2013), focused upon the rank-aware integration of search engines in order to support multi-domain queries and “Data-Centered Genomic Computing (GeCo)” (2016-2021), focused upon new abstractions for querying and integrating genomic datasets. He is the recipient of the ACM-SIGMOD "Edward T. Codd Innovation Award" (New York, June 26, 2013), an ACM Fellow and a member of Academia Europaea.
(1/28, 12:30pm): Marco Brambilla (Politecnico di Milano): Extraction of Evolving Knowledge from Social Media details)TITLE
Extraction of Evolving Knowledge from Social Media
Knowledge in the world continuously evolves. Ontologies that aim at formalizing this knowledge are largely incomplete, especially regarding data belonging to the so-called long tail. On the other side, informal sources such has social media are typically very up to date with respect to facts, events and relations between real-world entities. We propose a method for discovering emerging knowledge by extracting it from social content. Once initialized by domain experts, the method is capable of finding relevant entities by means of a mixed syntactic-semantic method. The method uses seeds, i.e. prototypes of emerging entities provided by experts, for generating candidates; then, it associates candidates to feature vectors built by using terms occurring in their social content and ranks the candidates by using their distance from the centroid of seeds, returning the top candidates. Our method can run iteratively, using the results as new seeds. The talk will describe the different extraction techniques, the advantages obtained by combining them, and the results of the experiments performed with the different methods.
Extracting Emerging Knowledge from Social Media. WWW 2017 https://dl.acm.org/citation.cfm?id=3052697
Iterative Knowledge Extraction from Social Networks. WWW Comp. 2018 https://dl.acm.org/citation.cfm?id=3191578
Marco Brambilla is associate professor at Politecnico di Milano. His research interests include data science, domain specific modeling languages and design patterns, crowdsourcing, social media monitoring, and big data analysis. He has been visiting researcher at CISCO, San Josè, and University of California, San Diego. He has been visiting professor at Dauphine University, Paris. He is co-founder of the startups Fluxedo, focusing on social media analysis and Social engagement, and WebRatio, devoted to software modeling tools for Web, Mobile and Business Process based software applications. He is author of various international books and research articles in journals and conferences, with over 200 papers. He was awarded various best paper prizes and gave keynotes and speeches at many conferences and organisations. He runs research projects on data science and industrial projects on data-driven innovation and big data. He is the main author of the OMG standard IFML.
(1/30, 12:00pm): Evimaria Terzi (BU): Active Matrix Completion Problems details)TITLE
Active Matrix Completion Problems
In many applications, e.g., recommender systems and traffic monitoring, the data comes in the form of a matrix that is only partially observed. A fundamental data-analysis task for these datasets is matrix completion, where the goal is to accurately infer the entries missing from the matrix. In this talk we will consider matrices that satisfy one of the two — very common — assumptions: low rank and positive definite. In both cases we will consider the active version of the matrix-completion problem that says: given access to an oracle that can obtain some (small) number of entries of the original matrix which entries shall we query so that we achieve the least completion error. We will demonstrate how this question can be transformed into elegant combinatorial problems and we will discuss algorithms for solving them. We will also argue that our combinatorial formulations are very different from existing work in the matrix-completion literature.
TALK BASED ON TWO PAPERS:
Charalampos Mavroforakis, Dóra Erdös, Mark Crovella, Evimaria Terzi: Active Positive-Definite Matrix Completion. SDM 2017.
Natali Ruchansky, Mark Crovella, Evimaria Terzi: Matrix Completion with Queries. KDD 2015.
Evimaria Terzi is an associate professor at the Computer Science Department at Boston University, where she also serves as an Associate Chair. Before joining BU in 2009, she was a research scientist at IBM Almaden Research Center. Evimaria has received her Ph.D. from University of Helsinki, Finland and her MSc from Purdue University. Evimaria is a recipient of the Microsoft Faculty Fellowship (2010) and NSF CAREER award and multiple NSF awards. Her research interests span a wide range of data-mining topics including algorithmic problems arising in recommendation systems, online social networks and social media.
(2/1, 12:00pm): Huy Ngyuen (Northeastern): Submodular Maximization with Nearly-optimal Approximation and Adaptivity details)TITLE
Submodular Maximization with Nearly-optimal Approximation and Adaptivity
In this talk, we present recent progress on understanding the tradeoff between the approximation guarantee and adaptivity for submodular maximization. The adaptivity of an algorithm is the number of sequential rounds of queries it makes to the evaluation oracle of the function, where in every round the algorithm is allowed to make polynomially-many parallel queries. Adaptivity is an important consideration in settings where the objective function is estimated using samples and in applications where adaptivity is the main running time bottleneck. We present nearly-optimal algorithms for submodular maximization subject to a variety of constraints.
Huy Lê Nguyen is an Assistant Professor of Computer Science in the College of Computer and Information Science at Northeastern University. Prior to joining Northeastern, he was a Research Assistant Professor at the Toyota Technological Institute in Chicago and before that, a Google Research Fellow at the Simons Institute at University of California, Berkeley. He received his PhD in Computer Science from Princeton University. Professor Nguyen is broadly interested in the design and analysis of algorithms, with an emphasis on algorithmic techniques for massive data sets and machine learning.
(2/13, 12:00pm): Sarah Ostadabbas (Northeastern): Physics-based Simulation to Bootstrap Learning in Small Data Domains details)TITLE
Physics-based Simulation to Bootstrap Learning in Small Data Domains
Deep learning (DL) regularly obliterate records for regression and classification tasks that have previously seen only incremental accuracy improvements. Furthermore, the training process is much more automated than classical techniques, no longer requiring a huge investment in feature selection and dataset pruning. However, this performance comes at a large data cost, frequently requiring upwards of 10^9 data/label pairs. Our digital economy has provided many problems for which such data exists or can be obtained cheaply relative to the benefits. There are many other fields that would get significant benefit from DL, but where data collection or labelling is expensive. This is very common for medical and military applications where data collection and/or labeling is expensive, individualized, and protected by very strong privacy or classification laws. Many applications will benefit from a learning framework with deep structure that works with limited labeled training samples, integrates domain-knowledge into the model, and maximizes the generalization of learning across domains. In this talk, I introduce our proposed 3-step framework that enables training of accurate and robust learning models under data limitation constraints based the use of simulation as its generative models. This framework includes: (i) employment of physics-based computational models referred to as simulation; (ii) design and analysis of unsupervised domain adaption techniques to close the gap between the simulated and real-world data distributions through a low-dimensional subspace transformation; (iii) development of learning techniques in the projected subspace to train an initial weak labeler; (iv) combined use of the weak labeler and a generative-adversarial framework to refine the simulated datasets by employing on a set of unlabeled real-world dataset in order to train a strong labeler; and (v) development and analysis of active learning techniques to select the most informative datasets to refine and adapt the strong labeler into a novel case with small data in the target application.
Inner Space Preserving Generative Pose Machine
Background Subtraction via Fast Robust Matrix Completion
Professor Ostadabbas is an assistant professor in the Electrical and Computer Engineering Department of Northeastern University (NEU), Boston, Massachusetts, USA. Professor Ostadabbas joined NEU in 2016 from Georgia Tech, where she was a post-doctoral researcher following completion of her PhD at the University of Texas at Dallas in 2014. At NEU, Professor Ostadabbas is the director of the Augmented Cognition Laboratory (ACLab) with the goal of enhancing human information-processing capabilities through the design of adaptive interfaces via physical, physiological, and cognitive state estimation. These interfaces are based on rigorous models adaptively parameterized using machine learning and computer vision algorithms. In particular, she has been integrating domain knowledge with machine learning by using physics-based simulation as generative models for bootstrapping deep learning recognizers. Professor Ostadabbas is the co-author of more than 50 peer-reviewed journal and conference articles and her research has been awarded by the National Science Foundation (NSF), Mathworks, Amazon AWS, and NVIDIA. She is the co-organizer of the Multimodal Data Fusion (MMDF2018) workshop, an NSF PI mini-workshop on Deep Learning in Small Data, and will be the program chair of the upcoming Machine Learning in Signal Processing (MLSP2019) conference. Prof. Ostadabbas is an associate editor of the IEEE Transactions on Biomedical Circuits and Systems, on the Editorial Board of the IEEE Sensors Letters and Digital Biomarkers Journal, and has been serving in several signal processing and machine learning conferences as a technical chair or session chair.
(2/15, 12:00pm): Olga Papaemmanouil (Brandeis University): Deep Learning meets Query Optimization details)TITLE
Deep Learning meets Query Optimization
Query optimization remains one of the most important and well studied problems in database systems. However, traditional query optimizers are complex, heuristically-driven systems that do not to learn from past experiences: they plan the execution of a query, but are ignorant of the actual performance of the picked plan. Because of the lack of feedback, a query optimizer may select the same bad query plan repeatedly, never learning from its previous good or bad choices.
In this talk, I will argue that a new type of query optimizer, one that integrates deep learning with query optimization, can drastically improve on the state-of-the-art. Towards this direction, I will discuss ReJOIN, a proof-of-concept join enumerator that relies on deep reinforcement learning. ReJOIN leverages prior experience, and learns how to optimize future queries more effectively (i.e., discovers better query plans) and efficiently (i.e., spending less time on optimization) compared with traditional optimizers. I will discuss potential challenges for future research and describe deep learning approaches that can lead the way to end-to-end learning-based query optimizers.
Olga Papaemmanouil is an Associate Professor in the Department of Computer Science at Brandeis University. She received her undergraduate degree in Computer Science and Informatics at the University of Patras, Greece in 1999. In 2001, she received her Sc.M. in Information Systems at the University of Economics and Business, Athens, Greece. She then joined the Computer Science Department at Brown University, where she completed her Ph.D in Computer Science at Brown University in 2008. Her research interests are in databases and distributed data management. She is the recipient of an NSF Career Award (2013) and a Paris Kanellakis Fellowship from Brown University (2002).
(3/5, 3:30pm): Laura Di Rocco (Northeastern): The role of geographical knowledge in subcity level geolocation algorithms details)ABSTRACT
Geolocation of microblog messages has been largely investigated in the literature. Many solutions exist, most of whichare data-driven (i.e., they rely on a training phase) and achieve good results at city-level. The development of algorithmsfor geolocation at sub-city level is still an open problem. In this paper, we investigate the role that external geographic knowledge can play in achieving reasonably accuratesub-city level geolocation. We propose a knowledge-base method, called Sherloc, to accurately geolocate messages at sub-city level, by exploiting the presence of toponyms in the message. Sherloc extracts the semantics of toponyms from geographic gazetteers and embeds them into a metric space that captures the semantic distance among them. This way we can identify the semantically closest toponyms to a microblog message and, at the same time, cluster them withrespect to their spatial locations. In contrast to the state of the art methods, Sherloc requires no prior training, it is not limited to geolocating on a fixed spatial grid, and is able to infer the location at sub-city level with higher accuracy.
(3/29, 11:00am): Yongjoo Park (University of Michigan): Bringing Statistical Tradeoffs to Data Systems details)TITLE
Bringing Statistical Tradeoffs to Data Systems
Despite advances in computing power, the cost of large-scale analytics and machine learning remains daunting to small and large enterprises alike. This has created a pressing demand for reducing infrastructure costs and query latencies. To meet these goals, data analysts and applications are in many cases willing to tolerate a slight—but controlled—degradation of accuracy in exchange for substantial gains in cost and performance, which we refer to as statistical tradeoffs. This is particularly true in the early stages of data exploration and is in stark contrast to traditional tradeoffs where the infrastructure costs must increase for higher performance.
My research builds large-scale data systems that can make these statistical tradeoffs in a principled manner. In this talk, I will focus on two specific directions. First, I will present VerdictDB, a system that enables quality-guaranteed, statistical tradeoffs without any changes to backend infrastructure; thus, it offers a universal solution for off-the-shelf query engines. Second, I will introduce Database Learning, a new query execution paradigm that allows existing query engines to constantly learn from their past executions and become “smarter” over time without any user intervention. I will conclude by briefly discussing other promising directions with emerging workloads beyond SQL, including visualization and machine learning.
(4/2, 10:00am): WVH 155/168: Volker Markl & members of his research group details)Four students from Prof Volker Markl's group in TU Berlin will present their work within 15min each (after general introductions by Renee and Volker).
Bio: Volker Markl is a Full Professor and Chair of the Database Systems and Information Management (DIMA) Group at the Technische Universität Berlin (TU Berlin). At the German Research Center for Artificial Intelligence (DFKI), he is both a Chief Scientist and Head of the Intelligent Analytics for Massive Data Research Group. In addition, he is Director of the Berlin Big Data Center (BBDC) and Co- Director of the Berlin Machine Learning Center (BzMl). Earlier in his career, he was a Research Staff Member and Project Leader at the IBM Almaden Research Center in San Jose, California, USA and a Research Group Leader at FORWISS, the Bavarian Research Center for Knowledge-based Systems located in Munich, Germany. Dr. Markl has published numerous research papers on indexing, query optimization, lightweight information integration, and scalable data processing. He holds 18 patents, has transferred technology into several commercial products, and advises several companies and startups. He has been both the Speaker and Principal Investigator for the Stratosphere Project, which resulted in a Humboldt Innovation Award as well as Apache Flink, the open-source big data analytics system. He serves as the President-Elect of the VLDB Endowment and was elected as one of Germany's leading Digital Minds (Digitale Köpfe) by the German Informatics (GI) Society. Most recently, Volker and his team earned an ACM SIGMOD Research Highlight Award 2016 for their work on “Implicit Parallelism Through Deep Language Embedding.” Volker Markl and his team earned an ACM SIGMOD Research Highlight Award 2016 for their work on implicit parallelism through deep language embedding.
LINKS: Prof Markl's group website: http://www.dima.tu-berlin.de
Speaker: Sebastian Breß
Title: Efficient Data Processing on Modern Hardware
Abstract: Processor manufacturers build increasingly specialized processors to mitigate the effects of the power wall in order to deliver improved performance. Currently, database engines have to be manually optimized for each processor which is a costly and error prone process. In this talk, we provide an overview of our research on automatic performance tuning in Hawk, a hardware-tailored code generator. Our key idea is to create processor-specific code variants and to learn a well- performing code variant for each processor. These code variants leverage various parallelization strategies and apply both generic and processor-specific code transformations. We observe that performance of different code variants may diverge up to two orders of magnitude. Thus, we need to generate custom code for each processor for peak performance. To this end, Hawk automatically finds efficient code variants for CPUs, GPUs, and MICs.
Bio: Sebastian Breß received his PhD (Dr.-Ing.) from University of Magdeburg, Germany in 2015, under the supervision of Gunter Saake (University of Magdeburg) and Jens Teubner (TU Dortmund). He is the the initiator and system architect of the research database system CoGaDB and the Hawk Code Generator. Currently, Sebastian is a Senior Researcher at German Research Center for Artificial Intelligence (DFKI) and a PostDoc at Technische Universität Berlin, working with Prof. Dr. Volker Markl and Prof. Dr. Tilmann Rabl. Sebastian‘s research interests include data management on modern hardware, query compilation, stream processing, and optimizing data management systems for heterogeneous processors. Sebastian has been selected as a Distinguished Program Committee Member at SIGMOD 2018.
Speaker: Jonas Traub
Title: On-Demand Data Stream Gathering in the Internet of Things
Abstract: In the Internet of Things (IoT), billions of sensor nodes form a sensor cloud and offer data streams to analysis systems. However, it is impossible to transfer all available data with maximal frequencies to all applications. In this talk, we show how we gather streaming data efficiently from a huge number of sensor nodes and how we make it available to a huge number of applications. Our key-idea is to gather data streams based on the data demand of streaming queries. The data demand of a query is the minimum number of data points which allows for answering the query with the desired precision. We present a solution which shares sensor nodes among many concurrent streaming queries by multiplexing their data demands. Our technique shares sensor reads and corresponding network traffic among all queries to save costs. Our experiments with real-world sensor data show that our approach saves up to 87% in data transmissions.
Bio: Jonas is a Research Associate at Technische Universität Berlin and the German Research Center for Artificial Intelligence (DFKI). His research interests include data stream processing, sensor data analysis, and data acquisition from sensor nodes. Jonas authored several publications related to data stream gathering, processing and transmission in the Internet of Things and will complete his PhD in March 2019 under the supervision of Volker Markl. Before he started his PhD, Jonas wrote his master thesis at the Royal Institute of Technology (KTH) and the Swedish Institute of Computer Science (SICS) / RISE in Stockholm under supervision of Seif Haridi and Volker Markl and advised by Paris Carbone and Asterios Katsifodimos. Prior to that, he received his B.Sc. degree at Baden-Württemberg Cooperative State University (DHBW Stuttgart) and worked several years at IBM in Germany and the USA. Jonas is an alumnus of "Software Campus", "Studienstiftung des deutschen Volkes" and "Deutschlandstipendium". All publication and supplementary material are available on http://www.user.tu-berlin.de/powibol/.
Speaker: Andreas Kunft
Title: Efficient Matrix Partitioning Through Joins
Abstract: End-to-end machine learning pipelines often consist of (i) relational operators to join the input data, (ii) user defined functions for feature extraction and vectorization, and (iii) linear algebra operators for model training and cross-validation. Often, these pipelines need to scale out to large datasets. In this case, these pipelines are usually implemented on top of dataflow engines like Hadoop, Spark, or Flink. Dataflow engines implement relational operators on row-partitioned datasets. However, efficient linear algebra operators use block-partitioned matrices. As a result, pipelines combining both kinds of operators require rather expensive changes to the physical representation, in particular re-partitioning steps. In this talk, I investigate the potential of reducing shuffling costs by fusing relational and linear algebra operations into specialized physical operators. I present BlockJoin, a distributed join algorithm which directly produces block-partitioned results. To minimize shuffling costs, BlockJoin applies database techniques known from columnar processing, such as index-joins and late materialization, in the context of parallel dataflow engines.
Bio: Andreas Kunft is a PhD student/research associate at Technische Universität Berlin in the Database Systems and Information Management Group (DIMA) led by Volker Markl. His research interests include massive parallel processing frameworks, databases, and compilers, with focus on the integration of database and compiler techniques for holistic optimization of analytics pipelines.
Speaker: Martin Kiefer
Title: Approximate Data Analysis using Modern Hardware
Abstract: Data summaries are an effective tool to balance efficiency and accuracy in data analysis tasks: Most relational query optimizers rely heavily on selectivity estimates computed from histograms or samples. Sketches have become an increasingly popular technique to perform high-bandwidth stream analysis tasks such as change or heavy-hitter detection. Using modern hardware such as GPUs and FPGAs reduces the cost of approximate
(4/5, 12:00pm): Tingjian Ge (UMass Lowell): Dense regions: a different type of event in graph streams details)TITLE
Dense regions: a different type of event in graph streams
A graph stream, as a special data stream, consists of a sequence of edge insertions and deletions over a dynamic graph. Analogous to general data streams, complex events have been studied on graph streams which incorporate both time order and structural relationship over basic edge events. In this talk, I will propose a new type of complex event – the occurrence of a dense region that also lasts for a long time. This turns out to be very useful in applications of various domains including telecommunication hotspot detection, road traffic control, spam network filtering, and dynamic community detection.
To efficiently discover dense regions, we propose an algorithmic framework called the Expectation-Maximization with a Utility function (EMU), a novel stochastic approach that nontrivially extends the conventional EM. We validate our EMU approach by showing that it converges to the optimum—by proving that it is a specification of the general Minorization-Maximization (MM) framework with convergence guarantees. We then devise EMU algorithms for the densest lasting subgraph problem. Using real-world graph streams, we experimentally verify the effectiveness and efficiency of our techniques.
Tingjian Ge is an associate professor in Computer Science at the University of Massachusetts, Lowell. He received a Ph.D. from Brown University in 2009. Prior to that, he got his Bachelor’s and Master’s degrees in Computer Science from Tsinghua University and UC Davis, respectively, and worked at Informix and IBM for six years. His research areas are in data management and analytics, with a recent focus on applying machine learning, AI, and algorithmic techniques in data management and mining. He is a recipient of the NSF CAREER Award in 2012, and a Teaching Excellence Award at UMass Lowell in 2014.
(9/13, 10:40am): WVH 366: Anshumali Shrivastava (Rice University): Hashing Algorithms for Extreme Scale Machine Learning details)TITLE
Hashing Algorithms for Extreme Scale Machine Learning.
In this talk, I will discuss some of my recent and surprising findings on the use of hashing algorithms for large-scale estimations. Locality Sensitive Hashing (LSH) is a hugely popular algorithm for sub-linear near neighbor search. However, it turns out that fundamentally LSH is a constant time (amortized) adaptive sampler from which efficient near-neighbor search is one of the many possibilities. Our observation adds another feather in the cap for LSH. LSH offers a unique capability to do smart sampling and statistical estimations at the cost of few hash lookups. Our observation bridges data structures (probabilistic hash tables) with efficient unbiased statistical estimations. I will demonstrate how this dynamic and efficient sampling beak the computational barriers in adaptive estimations where, for the first time, it is possible that we pay roughly the cost of uniform sampling but get the benefits of adaptive sampling. We will demonstrate the power of one simple idea for three favorite problems 1) Partition function estimation for large NLP models such as word2vec, 2) Adaptive Gradient Estimations for efficient SGD and 3) Sub-Linear Deep Learning with Huge Parameter Space.
In the end, if time permits, we will switch to memory cost show a simple hashing algorithm that can shrink memory requirements associated with classification problems exponentially! Using our algorithms, we can train 100,000 classes with 400,000 features, on a single Titan X while only needing 5% or less memory required to store all the weights. Running a simple logistic regression on this data, the model size of 320GB is unavoidable.
Anshumali Shrivastava is an assistant professor in the computer science department at Rice University. His broad research interests include randomized algorithms for large-scale machine learning. He is a recipient of National Science Foundation (NSF) CAREER Award, a Young Investigator Award from Air Force Office of Scientific Research (AFOSR), and machine learning research award from Amazon. His research on hashing inner products has won Best Paper Award at NIPS 2014 while his work on representing graphs got the Best Paper Award at IEEE/ACM ASONAM 2014. Anshumali got his PhD in 2015 from Cornell University.
(9/26, 12:00pm): Spyros Blanas (Ohio State University): Scaling database systems to high-performance computers details)TITLE
Scaling database systems to high-performance computers
We are witnessing the increasing use of warehouse-scale computers to analyze massive datasets quickly. This poses two challenges for database systems. The first challenge is interoperability with established analytics libraries and tools. Massive datasets often consist of images (arrays) in file formats like FITS and HDF5. To analyze such datasets users turn to domain-specific libraries and deep learning frameworks, and thus write code that directly manipulates files. We will first present ArrayBridge, an open-source I/O library that allows SciDB, TensorFlow and HDF5-based programs to co-exist in a pipeline without converting between file formats. With ArrayBridge, users benefit from the optimizations of a database system without sacrificing the ability to directly manipulate data through the existing HDF5 API when they want to.
The second challenge is scalability, as warehouse-scale computers expose communication bottlenecks in foundational data processing operations. This talk will focus on data shuffling and parallel aggregation. We will first present an RDMA-aware data shuffling algorithm that transmits data up to 4X faster than MPI. This is achieved by switching to a connectionless, datagram-based network transport layer that scales better but requires flow control in software. We will then present a parallel aggregation algorithm for high-cardinality aggregation that carefully schedules data transmissions to avoid unscaleable all-to-all communication. The algorithm leverages similarity to transmit less data over congested network links. We will conclude by highlighting additional challenges that need to be overcome to scale database systems to massive computers.
Spyros Blanas is an assistant professor in the Department of Computer Science and Engineering at The Ohio State University. His research interest is high performance database systems, and his current goal is to build a database system for high-end computing facilities. He has received the IEEE TCDE Rising Star award and a Google Research Faculty award. He completed his Ph.D. at the University of Wisconsin–Madison where part of his Ph.D. dissertation was commercialized in Microsoft SQL Server as the Hekaton in-memory transaction processing engine.
(9/28, 12:00pm): Stratos Idreos (Harvard): The periodic table of data structures details)TITLE
The periodic table of data structures.
Data structures are critical in any data-driven scenario, and they define the behavior of modern data systems and data-driven algorithms. However, they are notoriously hard to design due to a massive design space and the dependence of performance on workload and hardware which evolve continuously.
What if we knew how many and which data structures are possible to design? What if we could compute the expected performance of a data structure design on a given workload and hardware without having to implement it and without even having access to the target machine? We will discuss our quest for 1) the first principles of data structures, 2) design continuums that make it possible to automate design, and 3) self-designing systems that can morph between what we now consider fundamentally different structures. We will draw examples from the NoSQL key-value store design space and discuss how to accelerate them and balance space-time tradeoffs.
Stratos Idreos is an assistant professor of Computer Science at Harvard University where he leads DASlab, the Data Systems Laboratory. Stratos was awarded the 2011 ACM SIGMOD Jim Gray Doctoral Dissertation award and the 2011 ERCIM Cor Baayen award. He is also a recipient of an IBM zEnterpise System Recognition Award, a VLDB Challenges and Visions best paper award and an NSF Career award. In 2015 he was awarded the IEEE TCDE Rising Star Award from the IEEE Technical Committee on Data Engineering for his work on adaptive data systems.
Prior recorded talk at UW:
(10/4, 4:30pm): Angelika Kimmig (Cardiff University): A Collective, Probabilistic Approach to Schema Mapping details)TITLE
A Collective, Probabilistic Approach to Schema Mapping
We propose a probabilistic approach to the problem of schema mapping. Our approach is declarative, scalable, and extensible. It builds upon recent results in both schema mapping and probabilistic reasoning and contributes novel techniques in both ﬁelds. We introduce the problem of mapping selection, that is, choosing the best mapping from a space of potential mappings, given both metadata constraints and a data example. As selection has to reason holistically about the inputs and the dependencies between the chosen mappings, we deﬁne a new schema mapping optimization problem which captures interactions between mappings. We then introduce Collective Mapping Discovery (CMD), our solution to this problem using state-of-the-art probabilistic reasoning techniques, which allows for inconsistencies and incompleteness. Using hundreds of realistic integration scenarios, we demonstrate that the accuracy of CMD is more than 33% above that of metadata-only approaches already for small data examples, and that CMD routinely ﬁnds perfect mappings even if a quarter of the data is inconsistent.
Angelika Kimmig is a Lecturer at Cardiff University, UK. She obtained her Ph.D. from KU Leuven, Belgium, and was a postdoctoral fellow at KU Leuven and the University of Maryland, College Park. Her research interests include symbolic AI, reasoning under uncertainty, machine learning, logic programming, and especially combinations thereof such as probabilistic programming and statistical relational learning. She is a key contributor to the probabilistic logic programming language ProbLog.
PAPER from ICDE 2017:
(10/26, 12:00pm): Mania Abdi (Northeastern): D3N: A multi-layer cache for improving big-data applications’ performance in data centers with imbalanced networks details)TITLE
D3N: A multi-layer cache for improving big-data applications’ performance in data centers with imbalanced networks
Caching methods for improving the performance of datalakes for throughput-bound big-data jobs assume unlimited bandwidth across the datacenter. However, most enterprise and academic datacenters grow organically and have heavily oversubscribed networks between different computer clusters. This paper describes D3N, an architecture that caches data on the access side of potential network bottlenecks and uses a multilayer approach. In the case of a two layer cache, reuse distances are dynamically computed along with cache miss costs to determine the partitioning of the cache into L1 (used for local accesses) and L2 (used for remote access, but within same access side) layers. We have implemented a prototype of D3N by modifying Ceph’s RADOS gateway. Micro and macro evaluations of the prototype demonstrate that the implementation is performant enough to saturate the (40,Gbit/s) NICs and (5 GB/s read) SSD of our caching server and deliver substantial storage bandwidth improvements for real workloads. Numerical models show that a multi-level, dynamic cache can have substantial advantages over today’s single-level caches when bandwidth is constrained.
Mania Abdi is a PhD student at the Northeastern University Solid State Storage research group. Prior to that, she was a software engineer. She received her MSc. in Computer Engineering at the Sharif University of Technology and B.Eng. in Computer Engineering at the Amirkabir University of Technology. Mania is a computer systems researcher with a storage focus and has worked on a broad set of topics, including distribted storage, caching, data center debuging, and end-to-end tracing.
(11/7, 12:00pm): Laurel Orr (University of Washington): Probabilistic Database Summarization for Interactive Data Exploration details)TITLE:
Probabilistic Database Summarization for Interactive Data Exploration
A fundamental assumption of traditional DBMSs is that the database contains all information necessary to answer a query; i.e., the database contains the entire universe of data. Many data scientists, however, do not have access to the universe of data and instead rely on samples to answer queries. These data scientists need to use tools outside of the database or alter their queries to correctly reweight and debias their samples to get a more accurate query answer. This talk presents preliminary research into building the first open world database system (OWDB) that inherently assumes relations are samples from some universe, even if the sampling mechanism is unknown. We will discuss two different approaches for building an OWDB: probabilistic modeling of the universe and sample reweighting. We then present EntropyDB, a system that takes the former approach to summarize a database, and we lastly discuss ongoing research into the later approach using Bayesian Networks. We conclude with discussing the open challenges in developing an OWDB.
Laurel Orr is a 6th year PhD student in the Database Group at the Paul G Allen School for Computer Science and Engineering. Her research interests are data summarization, approximate query processing, data exploration, and the general development of tools and techniques to aid data scientists in their data investigation and analysis pipeline. Her current research goal is to develop a prototype open world database system that inherently treats relations a samples drawn from some larger universe of data, even when the sampling mechanism is unknown. She was a 2018 NCWIT Collegiate Award Honorable Mention and was awarded a NSF Graduate Research Fellowship in 2015.
(11/9, 12:00pm): Raul Castor Fernandez (MIT): Aurum: A Data Discovery System details)TITLE
Aurum: A Data Discovery System
Organizations store data in hundreds of different data sources, including relational databases, files, and large data lake repositories. These data sources contain valuable information and insights that can be beneficial to multiple aspects of modern data-driven organizations. However, as more data is produced, our ability to use it reduces dramatically, as no single person knows about all the existent data sources. One big challenge is to discover the data sources that are relevant to answer a particular question. Aurum is a data discovery system to answer "discovery queries" on large volumes of data. In this talk, I'll motivate the data discovery problem with use cases from different industries. I will describe Aurum's design and I will talk a bit about a new research project that aims to discover data beyond tables.
In my research, I build high-performance and scalable systems to discover, prepare and process data. I'm a postdoctoral researcher at MIT, working with Sam Madden and Mike Stonebraker. Before, I completed my PhD at Imperial College London with Peter Pietzuch.
(11/14, 12:00pm): Lawson Wong (Northeastern): Abstraction in robotics details)TITLE
Abstraction in robotics
Robotics is a big data problem. To make sense of the physical world, perform tasks well, and generalize across environments, robots need to represent and understand the world at the "correct" level of abstraction. What "correct" should mean remains to be seen.
In this talk, I will describe two lines of work that attempt to answer this question from very different perspectives. I will first discuss work on grounding natural language instructions to robot behavior, where we demonstrate that having the right representations can enable human-robot communication. This is an important problem for robotics, since we envision users using natural language to instruct robots to perform a wide variety of tasks. In the second half, I will discuss recent preliminary work on the theoretical foundations of state abstraction in reinforcement learning, a common framework used in robot learning problems. In particular, we view state abstraction as data compression, and apply results in information theory (rate-distortion theory) to the reinforcement learning setting.
Time permitting, I will describe some extensions to the above work, as well as other abstraction-related problems, that I envision my group will pursue at Northeastern.
Lawson L.S. Wong is an assistant professor in the College of Computer and Information Science at Northeastern University. His research focuses on learning, representing, and estimating knowledge about the world that an autonomous robot may find useful. Prior to Northeastern, Lawson was a postdoctoral fellow at Brown University. He completed his PhD at the Massachusetts Institute of Technology. He has received a Siebel Scholarship, AAAI Robotics Student Fellowship, and Croucher Foundation Fellowship for Postdoctoral Research.
Sequence-to-Sequence Language Grounding of Non-Markovian Task Specifications
Nakul Gopalan, Dilip Arumugam, Lawson L.S. Wong, Stefanie Tellex
Robotics: Science and Systems (2018)
Grounding Natural Language Instructions to Semantic Goal Representations for Abstraction and Generalization
Dilip Arumugam, Siddharth Karamcheti, Nakul Gopalan, Edward C. Williams, Mina Rhee, Lawson L.S. Wong, Stefanie Tellex
Autonomous Robots (in press; 2018)
Extended journal version of Robotics: Science and Systems (2017) paper
State Abstraction as Compression in Apprenticeship Learning
David Abel, Dilip Arumugam, Kavosh Asadi, Yuu Jinnai, Michael L. Littman, Lawson L.S. Wong
Preprint available on request (by e-mail)
To appear in AAAI Conference on Artificial Intelligence (2019)
(11/16, 12:00pm): Alexandra Meliou (UMass Amherst): Creating a Higher-Quality Data World details)TITLE
Diagnoses and Explanations: Creating a Higher-Quality Data World
The correctness and proper function of data-driven systems and applications relies heavily on the correctness of their data. Low quality data can be costly and disruptive, leading to revenue loss, incorrect conclusions, and misguided policy decisions. Improving data quality is far more than purging datasets of errors; it is critical to improve the processes that produce the data, to collect good data sources for generating the data, and to address the root causes of problems.
Our work is grounded on an important insight: While existing data cleaning techniques can be effective at purging datasets of errors, they disregard the fact that a lot of errors are systemic, inherent to the process that produces the data, and thus will keep occurring unless the problem is corrected at its source. In contrast to traditional data cleaning, we focus on data diagnosis: explaining where and how the errors happen in a data generative process. I will describe our work on Data X-Ray and QFix, two diagnostic frameworks for large-scale extraction systems and relational data systems. I will also provide a brief overview of new results on knowledge augmentation and explanations for dataset differences, building towards a vision for toolsets that assist the exploration of information in a varied, diverse, and highly non-integrated data world.
Alexandra Meliou is an Assistant Professor in the College of Information and Computer Sciences, at the University of Massachusetts, Amherst. Prior to that, she was a Post-Doctoral Research Associate at the University of Washington. Alexandra received her PhD degree from the Electrical Engineering and Computer Sciences Department at the University of California, Berkeley. She has received recognitions for research and teaching, including a CACM Research Highlight, an ACM SIGMOD Research Highlight Award, an ACM SIGSOFT Distinguished Paper Award, an NSF CAREER Award, a Google Faculty Research Award, and a Lilly Fellowship for Teaching Excellence. Her research focuses on data provenance, causality, explanations, data quality, and algorithmic fairness.
(12/7, 11:30am): Fei Chiang (McMaster University): Contextual and Spatio-temporal Data Cleaning details)TITLE
Contextual and Spatio-temporal Data Cleaning
It is becoming increasingly difficult for organizations to reap value from their data due to poor data quality. This is motivated by the observation that real data is rarely error free, containing incomplete, inconsistent, and stale values. This leads to inaccurate, and out-of-date data analysis downstream. Addressing data inconsistency requires not only reconciling differing syntactic references to an entity, but it is often necessary to include domain expertise to correctly interpret the data. For example, understanding that a reference to ‘jaguar’ may be interpreted as an animal or as a vehicle. Secondly, having up-to-date (or current) data is important for timely data analysis. Cleaning stale values goes beyond just relying on timestamps, especially when timestamps may be missing, inaccurate or incomplete.
In this talk, I will present our work towards achieving consistent and up-to-date data. First, I will discuss contextual data cleaning that uses a new class of data integrity constraints that tightly integrate domain semantics from an ontology. Second, we argue that data currency is a relative notion based on individual spatio-temporal update patterns, and these patterns can be learned and predicted. I will present our framework to achieve these two objectives, and provide a brief overview of recent extensions with applications to knowledge fusion.
Fei Chiang is an Assistant Professor in the Department of Computing and Software at McMaster University. She is a Faculty Fellow at the IBM Centre for Advanced Studies, and served as an inaugural Associate Director of the McMaster MacData Institute. She received her M. Math from the University of Waterloo, and B.Sc and PhD degrees from the University of Toronto, all in Computer Science. Her research interests are in data quality, data cleaning, data privacy and text mining. She holds four patents for her work in self-managing database systems. Her work has been featured in the Southern Ontario Smart Computing Impact Report. She is a recipient of the Dean’s Teaching Honour Roll, and a 2018 Ontario Early Researcher Award.
(12/12, 12:00pm): Erkang Zhu (University of Toronto): Get Your Data Together! Algorithms for Managing Data Lakes details)TITLE
Get Your Data Together! Algorithms for Managing Data Lakes
Data lakes (e.g., enterprise data catalogs and Open Data portals) are data dumps if users cannot find and utilize the data in them. In this talk, I present two problems in massive, dynamic data lakes: 1) searching for joinable tables without a precomputed join graph, and 2) joining tables from different sources through auto-generated syntactic transformation on join values. I will also present two algorithmic solutions that can be used for data lakes that are large both in the number of tables (millions) and table sizes. The presented work has been published in SIGMOD and VLDB.
Erkang (Eric) Zhu is a 5th year computer science PhD candidate at University of Toronto. His supervisor is Prof. Renée J. Miller. His research focuses on data discovery, large-scale similarity search, and randomized algorithms (data sketches).
Erkang Zhu, Fatemeh Nargesian, Ken Q. Pu, Renée J. Miller: LSH Ensemble: Internet-Scale Domain Search. PVLDB 9(12): 1185-1196 (2016)
Erkang Zhu, Ken Q. Pu, Fatemeh Nargesian, Renée J. Miller: Interactive Navigation of Open Data Linkages. PVLDB 10(12): 1837-1840 (2017)
- (7/25, noon): Wolfgang Gatterbauer: Oblivious Bounds for the probability of Non-monotone Boolean functions (pdf)
- (7/24, noon): Renée Miller: Open Data Integration (pdf), (pdf)
- (7/13, noon): Arjit Khan: Data management for emerging problems in large networks
- (6/22, noon): Magy Seif El-Nasr: Modeling Player Behaviors through Game Data
- (6/1, 10:30am, 366 WVH): Raymond Wong: Big data analytics on big spatial databases
- (4/27, 10am): Clemens Heitzinger: Computational Bayesian Estimation with Applications in Sensors and Tomography
- (3/28, noon): Ravi Sundaram : no free lunch, succinct data structures, information theory lower bounds
- (3/21, noon): Xiaofeng and Rundong: SIGMOD and WWW practice talks
- (3/14, 10am): Ravi Sundaram : A case for learned index structures (pdf)
- (2/28, noon): Ehsan Elhamifar : Subset Selection and Summarization in Sequential Data (pdf)
- (2/14, noon): Ruiyang Xu: Evaluating Player Skill and Position Difficulty in Sequential Two-Person Games with Game Outcome Prediction demonstrated with the Gamification of an Optimization Problem
- (1/31, noon): Casper Harteveld : Studycrafter / Wolfgang Gatterbauer : Bootstrapping Virtuous Learning Cycles (Youtube)
- (1/19, all day): Northeast Database day 2018 : Come and see our talk and posters!
- (1/18, 3pm, Forsyth #97): Dan Suciu : Rethinking Query Execution on Big Data
- (1/12, noon, 366 WVH): Guoliang Li : Human-in-the-Loop Data Integration
This semester the data lab seminar is combined with in our special topics class (CS 7290: Special topics: Foundations in scalable data management) and takes place every Tuesday 11:45am-1:25pm and Thursday 2:50-4:30pm in Ryder Hall 126 (map).
- (11/16): Stratis Ioannidis : Distributing Frank-Wolfe via Map-Reduce (pdf)
- (11/9): Niccolo Meneghetti : Beta Probabilistic Databases: A Scalable Approach to Belief Updating and Parameter Learning (pdf)
- (10/12): Georgia Koutrika : User analytics for recommender systems
- (9/28): Jon Ullman : Differential privacy and data exploration
- (9/26): Cibele Freire : The complexity of resilience and responsibility for self-join-free conjunctive queries (pdf)