DATA Lab seminar: group meetings & guest speakers
We are currently meeting Wednesdays and/or Fridays at noon in WVH 462 (map) for brown bag lunch. Please subscribe to our DATA Lab talks email list or DATA Lab talks calendar if you think data is the future.
(1/9, 12:00pm): Kathleen Fisher (Tufts): Using a Declarative Description Language to Tame Ad Hoc Data: An Overview of the PADS project details)TITLE
Using a Declarative Description Language to Tame Ad Hoc Data: An Overview of the PADS project
Kathleen Fisher (Tufts University)
The goal of the PADS project (http://www.padsproj.org) is to make it easier for data analysts to extract useful information from ad hoc data files. This talk gives an overview of the project and how it helps bridge the gap between the unmanaged world of ad hoc data and the managed world of typed programming languages and databases. In particular, the paper reviews the design of PADS data description languages, describes the generated parsing tools and discusses the importance of meta-data. It also sketches the formal semantics, discusses useful tools and how can they can be generated automatically from PADS descriptions, and describes an inferencing system that can learn useful PADS descriptions from positive examples of the data format.
Kathleen Fisher is a Professor in and the Chair of the Computer Science Department at Tufts. Previously, she was a program manager at DARPA where she started and managed the HACMS and PPAML programs, a Consulting Faculty Member in the Computer Science Department at Stanford University, and a Principal Member of the Technical Staff at AT&T Labs Research. Kathleen is an ACM Fellow. She has served as Program Chair for PLDI, OOPSLA, ICFP, CUFP, and FOOL, and as General Chair for ICFP 2015. She is a former Associate Editor for TOPLAS and a former editor of the Journal of Functional Programming. Kathleen is a past Chair of the ACM Special Interest Group in Programming Languages (SIGPLAN) and past Co-Chair of CRA's Committee on the Status of Women (CRA-W). Kathleen is a recipient of the SIGPLAN Distinguished Service Award. She is Vice Chair of DARPA's ISAT Study Group and a member of the Board of Trustees of Harvey Mudd College.
Kathleen Fisher, David Walker: The PADS project: an overview. ICDT 2011: 11-17
(1/11, 12:00pm): Xiaoyu Liu (MIT): Kyrix: Interactive Visual Data Exploration at Scale details)TITLE
Kyrix: Interactive Visual Data Exploration at Scale
Scalable interactive visual data exploration is crucial in many domains due to increasingly large datasets generated at rapid rates. Details-on-demand provides a useful interaction paradigm for exploring large datasets, where the user starts at an overview, finds regions of interest, zooms in to see detailed views, zooms out and then repeats. This paradigm is the primary user interaction mode of widely used systems such as Google Maps, Aperture Tiles and ForeCache. These earlier systems, however, are highly customized with hardcoded visual representations and optimizations. A more general framework is needed to facilitate the development of visual data exploration systems at scale. We present Kyrix, an end-to-end system for developing scalable details-on-demand data exploration applications. Kyrix provides the developer with a declarative model for easy specification of general visualizations. Behind the scenes, Kyrix utilizes a suite of performance optimization techniques to achieve a response time within 500 ms for various user interactions. We also report results from a performance study which shows that a novel dynamic fetching scheme adopted by Kyrix outperforms tile-based fetching used in traditional systems.
Xiaoyu is a research intern at MIT data system group. Before that, she completed her master at Purdue, ECE department. She is interested in building scalable and reliable data-intensive systems.
(1/23, 12:00pm): Panayiotis Tsaparas (University of Ioannina): Finding Patterns in Temporal and Flow Networks details)TITLE
Finding Patterns in Temporal and Flow Networks
Networks are natural models for complex systems consisting of multiple interconnected entities. Many of these systems are dynamic, with connections appearing and disappearing over time. In the resulting networks, edges are annotated with time stamps, resulting in a graph history consisting of multiple network snapshots over time. We refer to such networks as temporal networks (graphs).
In this talk we consider two knowledge extraction problems on temporal networks. First, we consider the problem of finding lasting dense subgraphs. We provide formal definitions of graph density for a temporal graph, and we formulate the BFF problem that seeks a subset of nodes that are densely connected over time. Furthermore, we also consider the problem of finding a set of graph snapshots in which there are dense subsets of nodes. We study the complexity of the problems and propose exact, approximation and heuristic algorithms. Experiments indicate that our approach can find interesting patterns in collaboration and word co-occurrence networks.
In the second problem we assume that edges are also associated with a value, indicating a flow between the two nodes at a specific time stamp. Such networks appear naturally in practice in the case of money exchange networks (e.g., bitcoin), or traffic networks. We are interested in finding motifs in these networks, that is, small subgraphs that appear more often than random. Our motifs are restricted temporally and with respect to flow: all interactions must happen within a given time-window, and they must involve a minimum amount of flow. We propose efficient algorithms for finding such motifs and experiment with them on real networks.
Konstantinos Semertzidis, Evaggelia Pitoura, Evimaria Terzi, Panayiotis Tsaparas. Finding lasting dense subgraphs. ECML/PKDD Journal Track (DAMI), 2019 (to appear)
Chrysanthi Kosyfaki, Nikos Mamoulis, Evaggelia Pitoura, Panayiotis Tsaparas. Flow motifs in interaction networks. EDBT 2019 (to appear).
Panayiotis Tsaparas completed his undergraduate studies at the Department of Computer Science at University of Crete, Greece in 1995. He continued his graduate studies at University of Toronto, where he received his M.Sc., and Ph.D. degree, under the supervision of Allan Borodin. After graduation, he worked as a post-doctoral fellow at University of Rome, “La Sapienza”, and at University of Helsinki, and as a researcher at Microsoft Research. Since 2011 he joined the Department of Computer Science and Engineering at University of Ioannina, where he is now an Associate Professor. His research interests include Social Network Analysis, Algorithmic Data Mining, Web Mining and Information Retrieval.
(1/25, 1:15pm): Fatemeh Nargesian (University of Toronto): Table Union and Navigation details)TITLE
Table Union and Navigation
Preparing data for advanced analytics is prohibitively time-consuming and expensive for all but the best-trained and best-funded engineers. Nevertheless, the success of trained systems often depends on data and generated features more than powerful statistical algorithms. Among the data management challenges that need to be addressed is data enrichment which is the discovery and integration of meaningful data in data lakes for a data science task. However, there are many challenges to overcome: (1) the sheer size of data, (2) the unique distributions and characteristics of data lakes, and (3) the probabilistic and human-in-the-loop nature of data discovery. In this talk, I discuss two prevalent data discovery scenarios. In the first scenario, the query is a dataset and the data scientist is interested in interactively finding datasets that can be integrated (e.g unioned) with the query. I will introduce a probabilistic framework for finding and aligning unionable tables with a query table and discuss the need for distribution-aware techniques for data discovery. In the second scenario, search does not start with a query, instead, it is data-driven. I will talk about data lake organization problem where the goal is to find an organization (a directed acyclic graph) that allows a user to most efficiently navigate data lakes. I will present a probabilistic navigation model of how users interact with an organization and introduce a scalable structure learning algorithm for optimizing data lake organizations.
Nargesian, Zhu, Pu, Miller: Table Union Search on Open Data, PVLDB 2018.
Fatemeh Nargesian is a PhD student in the Data Curation Group of the Department of Computer at University of Toronto. Her research focuses on optimizing and automating data preparation for end-to-end data science, encompassing dataset discovery in data lakes and enriching datasets with new features. While at University of Toronto, Fatemeh was a joint Research intern at IBM Research-NY. Prior to University of Toronto, Fatemeh worked at the Clinical Informatics research group at McGill University on clinical data management, and received M.Sc. degrees in computer science and artificial intelligence at University of Toronto and Sharif University of Technology.
(1/28, 12:00pm): Stefano Ceri (Politecnico di Milano): WVH 366: Data-Driven Genomic Computing: Making Sense of the Signals from the Genome details)TITLE
Data-Driven Genomic Computing: Making Sense of the Signals from the Genome
Genomic computing is a new science focused on understanding the functioning of the genome, as a premise to fundamental discoveries in biology and medicine. Next Generation Sequencing (NGS) allows the production of the entire human genome sequence at a cost of about 1000 US $; many algorithms exist for the extraction of genome features, or "signals", including peaks (enriched regions), variants, or gene expression (intensity of transcription activity). The missing gap is a system supporting data integration and exploration, giving a “biological meaning” to all the available information; such a system can be used, e.g., for better understanding cancer or how environment influences cancer development.
The GeCo Project (Data-Driven Genomic Computing, ERC Advanced Grant, 2016-2021) has the objective or revisiting genomic computing through the lens of basic data management, through models, languages, and instruments, focusing on genomic data integration. Starting from an abstract model, we developed a system that can be used to query processed data produced by several large Genomic Consortia, including Encode and TCGA; the system employs internally the Spark engine, and prototypes can already be accessed from Polimi, from Cineca (Italian supercomputing center) and from the Broad Institute in Cambridge. During the five-years of the ERC project, the system will be enriched with data analysis tools and environments and will be made increasingly efficient. Among the objectives of the project, the creation of an “open source” repository of public data, available to biological and clinical research through queries, web services and search interfaces.
Stefano Ceri is professor of Database Systems at the Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB) of Politecnico di Milano. His research work covers four decades (1978-2018) and has been generally concerned with extending database technologies in order to incorporate new features: distribution, object-orientation, rules, streaming data; with the advent of the Web, his research has been targeted towards the engineering of Web-based applications and to search systems. More recently he turned to genomic computing. He authored over 350 publications (H-index 75) and authored or edited 15 books in English. He is the recipient of two ERC Advanced Grants: "Search Computing (SeCo)" (2008-2013), focused upon the rank-aware integration of search engines in order to support multi-domain queries and “Data-Centered Genomic Computing (GeCo)” (2016-2021), focused upon new abstractions for querying and integrating genomic datasets. He is the recipient of the ACM-SIGMOD "Edward T. Codd Innovation Award" (New York, June 26, 2013), an ACM Fellow and a member of Academia Europaea.
(1/28, 12:30pm): Marco Brambilla (Politecnico di Milano): Extraction of Evolving Knowledge from Social Media details)TITLE
Extraction of Evolving Knowledge from Social Media
Knowledge in the world continuously evolves. Ontologies that aim at formalizing this knowledge are largely incomplete, especially regarding data belonging to the so-called long tail. On the other side, informal sources such has social media are typically very up to date with respect to facts, events and relations between real-world entities. We propose a method for discovering emerging knowledge by extracting it from social content. Once initialized by domain experts, the method is capable of finding relevant entities by means of a mixed syntactic-semantic method. The method uses seeds, i.e. prototypes of emerging entities provided by experts, for generating candidates; then, it associates candidates to feature vectors built by using terms occurring in their social content and ranks the candidates by using their distance from the centroid of seeds, returning the top candidates. Our method can run iteratively, using the results as new seeds. The talk will describe the different extraction techniques, the advantages obtained by combining them, and the results of the experiments performed with the different methods.
Extracting Emerging Knowledge from Social Media. WWW 2017 https://dl.acm.org/citation.cfm?id=3052697
Iterative Knowledge Extraction from Social Networks. WWW Comp. 2018 https://dl.acm.org/citation.cfm?id=3191578
Marco Brambilla is associate professor at Politecnico di Milano. His research interests include data science, domain specific modeling languages and design patterns, crowdsourcing, social media monitoring, and big data analysis. He has been visiting researcher at CISCO, San Josè, and University of California, San Diego. He has been visiting professor at Dauphine University, Paris. He is co-founder of the startups Fluxedo, focusing on social media analysis and Social engagement, and WebRatio, devoted to software modeling tools for Web, Mobile and Business Process based software applications. He is author of various international books and research articles in journals and conferences, with over 200 papers. He was awarded various best paper prizes and gave keynotes and speeches at many conferences and organisations. He runs research projects on data science and industrial projects on data-driven innovation and big data. He is the main author of the OMG standard IFML.
(1/30, 12:00pm): Evimaria Terzi (BU): Active Matrix Completion Problems details)TITLE
Active Matrix Completion Problems
In many applications, e.g., recommender systems and traffic monitoring, the data comes in the form of a matrix that is only partially observed. A fundamental data-analysis task for these datasets is matrix completion, where the goal is to accurately infer the entries missing from the matrix. In this talk we will consider matrices that satisfy one of the two — very common — assumptions: low rank and positive definite. In both cases we will consider the active version of the matrix-completion problem that says: given access to an oracle that can obtain some (small) number of entries of the original matrix which entries shall we query so that we achieve the least completion error. We will demonstrate how this question can be transformed into elegant combinatorial problems and we will discuss algorithms for solving them. We will also argue that our combinatorial formulations are very different from existing work in the matrix-completion literature.
TALK BASED ON TWO PAPERS:
Charalampos Mavroforakis, Dóra Erdös, Mark Crovella, Evimaria Terzi: Active Positive-Definite Matrix Completion. SDM 2017.
Natali Ruchansky, Mark Crovella, Evimaria Terzi: Matrix Completion with Queries. KDD 2015.
Evimaria Terzi is an associate professor at the Computer Science Department at Boston University, where she also serves as an Associate Chair. Before joining BU in 2009, she was a research scientist at IBM Almaden Research Center. Evimaria has received her Ph.D. from University of Helsinki, Finland and her MSc from Purdue University. Evimaria is a recipient of the Microsoft Faculty Fellowship (2010) and NSF CAREER award and multiple NSF awards. Her research interests span a wide range of data-mining topics including algorithmic problems arising in recommendation systems, online social networks and social media.
(2/1, 12:00pm): Huy Ngyuen (Northeastern): Submodular Maximization with Nearly-optimal Approximation and Adaptivity details)TITLE
Submodular Maximization with Nearly-optimal Approximation and Adaptivity
In this talk, we present recent progress on understanding the tradeoff between the approximation guarantee and adaptivity for submodular maximization. The adaptivity of an algorithm is the number of sequential rounds of queries it makes to the evaluation oracle of the function, where in every round the algorithm is allowed to make polynomially-many parallel queries. Adaptivity is an important consideration in settings where the objective function is estimated using samples and in applications where adaptivity is the main running time bottleneck. We present nearly-optimal algorithms for submodular maximization subject to a variety of constraints.
Huy Lê Nguyen is an Assistant Professor of Computer Science in the College of Computer and Information Science at Northeastern University. Prior to joining Northeastern, he was a Research Assistant Professor at the Toyota Technological Institute in Chicago and before that, a Google Research Fellow at the Simons Institute at University of California, Berkeley. He received his PhD in Computer Science from Princeton University. Professor Nguyen is broadly interested in the design and analysis of algorithms, with an emphasis on algorithmic techniques for massive data sets and machine learning.
(2/13, 12:00pm): Sarah Ostadabbas (Northeastern): Physics-based Simulation to Bootstrap Learning in Small Data Domains details)TITLE
Physics-based Simulation to Bootstrap Learning in Small Data Domains
Deep learning (DL) regularly obliterate records for regression and classification tasks that have previously seen only incremental accuracy improvements. Furthermore, the training process is much more automated than classical techniques, no longer requiring a huge investment in feature selection and dataset pruning. However, this performance comes at a large data cost, frequently requiring upwards of 10^9 data/label pairs. Our digital economy has provided many problems for which such data exists or can be obtained cheaply relative to the benefits. There are many other fields that would get significant benefit from DL, but where data collection or labelling is expensive. This is very common for medical and military applications where data collection and/or labeling is expensive, individualized, and protected by very strong privacy or classification laws. Many applications will benefit from a learning framework with deep structure that works with limited labeled training samples, integrates domain-knowledge into the model, and maximizes the generalization of learning across domains. In this talk, I introduce our proposed 3-step framework that enables training of accurate and robust learning models under data limitation constraints based the use of simulation as its generative models. This framework includes: (i) employment of physics-based computational models referred to as simulation; (ii) design and analysis of unsupervised domain adaption techniques to close the gap between the simulated and real-world data distributions through a low-dimensional subspace transformation; (iii) development of learning techniques in the projected subspace to train an initial weak labeler; (iv) combined use of the weak labeler and a generative-adversarial framework to refine the simulated datasets by employing on a set of unlabeled real-world dataset in order to train a strong labeler; and (v) development and analysis of active learning techniques to select the most informative datasets to refine and adapt the strong labeler into a novel case with small data in the target application.
Inner Space Preserving Generative Pose Machine
Background Subtraction via Fast Robust Matrix Completion
Professor Ostadabbas is an assistant professor in the Electrical and Computer Engineering Department of Northeastern University (NEU), Boston, Massachusetts, USA. Professor Ostadabbas joined NEU in 2016 from Georgia Tech, where she was a post-doctoral researcher following completion of her PhD at the University of Texas at Dallas in 2014. At NEU, Professor Ostadabbas is the director of the Augmented Cognition Laboratory (ACLab) with the goal of enhancing human information-processing capabilities through the design of adaptive interfaces via physical, physiological, and cognitive state estimation. These interfaces are based on rigorous models adaptively parameterized using machine learning and computer vision algorithms. In particular, she has been integrating domain knowledge with machine learning by using physics-based simulation as generative models for bootstrapping deep learning recognizers. Professor Ostadabbas is the co-author of more than 50 peer-reviewed journal and conference articles and her research has been awarded by the National Science Foundation (NSF), Mathworks, Amazon AWS, and NVIDIA. She is the co-organizer of the Multimodal Data Fusion (MMDF2018) workshop, an NSF PI mini-workshop on Deep Learning in Small Data, and will be the program chair of the upcoming Machine Learning in Signal Processing (MLSP2019) conference. Prof. Ostadabbas is an associate editor of the IEEE Transactions on Biomedical Circuits and Systems, on the Editorial Board of the IEEE Sensors Letters and Digital Biomarkers Journal, and has been serving in several signal processing and machine learning conferences as a technical chair or session chair.
(2/15, 12:00pm): Olga Papaemmanouil (Brandeis University): Deep Learning meets Query Optimization details)TITLE
Deep Learning meets Query Optimization
Query optimization remains one of the most important and well studied problems in database systems. However, traditional query optimizers are complex, heuristically-driven systems that do not to learn from past experiences: they plan the execution of a query, but are ignorant of the actual performance of the picked plan. Because of the lack of feedback, a query optimizer may select the same bad query plan repeatedly, never learning from its previous good or bad choices.
In this talk, I will argue that a new type of query optimizer, one that integrates deep learning with query optimization, can drastically improve on the state-of-the-art. Towards this direction, I will discuss ReJOIN, a proof-of-concept join enumerator that relies on deep reinforcement learning. ReJOIN leverages prior experience, and learns how to optimize future queries more effectively (i.e., discovers better query plans) and efficiently (i.e., spending less time on optimization) compared with traditional optimizers. I will discuss potential challenges for future research and describe deep learning approaches that can lead the way to end-to-end learning-based query optimizers.
Olga Papaemmanouil is an Associate Professor in the Department of Computer Science at Brandeis University. She received her undergraduate degree in Computer Science and Informatics at the University of Patras, Greece in 1999. In 2001, she received her Sc.M. in Information Systems at the University of Economics and Business, Athens, Greece. She then joined the Computer Science Department at Brown University, where she completed her Ph.D in Computer Science at Brown University in 2008. Her research interests are in databases and distributed data management. She is the recipient of an NSF Career Award (2013) and a Paris Kanellakis Fellowship from Brown University (2002).
(3/1, 12:00pm): Laura Di Rocco (Northeastern): details)
(3/13, 12:00pm): TENTATIVE: Luis Carvalho (BU) details)
(3/29, 12:00pm): Tingjian Ge (UMass Lowell) details)
(4/12, 12:00pm): Nga Tran (Vertica) details)
(9/13, 10:40am): WVH 366: Anshumali Shrivastava (Rice University): Hashing Algorithms for Extreme Scale Machine Learning details)TITLE
Hashing Algorithms for Extreme Scale Machine Learning.
In this talk, I will discuss some of my recent and surprising findings on the use of hashing algorithms for large-scale estimations. Locality Sensitive Hashing (LSH) is a hugely popular algorithm for sub-linear near neighbor search. However, it turns out that fundamentally LSH is a constant time (amortized) adaptive sampler from which efficient near-neighbor search is one of the many possibilities. Our observation adds another feather in the cap for LSH. LSH offers a unique capability to do smart sampling and statistical estimations at the cost of few hash lookups. Our observation bridges data structures (probabilistic hash tables) with efficient unbiased statistical estimations. I will demonstrate how this dynamic and efficient sampling beak the computational barriers in adaptive estimations where, for the first time, it is possible that we pay roughly the cost of uniform sampling but get the benefits of adaptive sampling. We will demonstrate the power of one simple idea for three favorite problems 1) Partition function estimation for large NLP models such as word2vec, 2) Adaptive Gradient Estimations for efficient SGD and 3) Sub-Linear Deep Learning with Huge Parameter Space.
In the end, if time permits, we will switch to memory cost show a simple hashing algorithm that can shrink memory requirements associated with classification problems exponentially! Using our algorithms, we can train 100,000 classes with 400,000 features, on a single Titan X while only needing 5% or less memory required to store all the weights. Running a simple logistic regression on this data, the model size of 320GB is unavoidable.
Anshumali Shrivastava is an assistant professor in the computer science department at Rice University. His broad research interests include randomized algorithms for large-scale machine learning. He is a recipient of National Science Foundation (NSF) CAREER Award, a Young Investigator Award from Air Force Office of Scientific Research (AFOSR), and machine learning research award from Amazon. His research on hashing inner products has won Best Paper Award at NIPS 2014 while his work on representing graphs got the Best Paper Award at IEEE/ACM ASONAM 2014. Anshumali got his PhD in 2015 from Cornell University.
(9/26, 12:00pm): Spyros Blanas (Ohio State University): Scaling database systems to high-performance computers details)TITLE
Scaling database systems to high-performance computers
We are witnessing the increasing use of warehouse-scale computers to analyze massive datasets quickly. This poses two challenges for database systems. The first challenge is interoperability with established analytics libraries and tools. Massive datasets often consist of images (arrays) in file formats like FITS and HDF5. To analyze such datasets users turn to domain-specific libraries and deep learning frameworks, and thus write code that directly manipulates files. We will first present ArrayBridge, an open-source I/O library that allows SciDB, TensorFlow and HDF5-based programs to co-exist in a pipeline without converting between file formats. With ArrayBridge, users benefit from the optimizations of a database system without sacrificing the ability to directly manipulate data through the existing HDF5 API when they want to.
The second challenge is scalability, as warehouse-scale computers expose communication bottlenecks in foundational data processing operations. This talk will focus on data shuffling and parallel aggregation. We will first present an RDMA-aware data shuffling algorithm that transmits data up to 4X faster than MPI. This is achieved by switching to a connectionless, datagram-based network transport layer that scales better but requires flow control in software. We will then present a parallel aggregation algorithm for high-cardinality aggregation that carefully schedules data transmissions to avoid unscaleable all-to-all communication. The algorithm leverages similarity to transmit less data over congested network links. We will conclude by highlighting additional challenges that need to be overcome to scale database systems to massive computers.
Spyros Blanas is an assistant professor in the Department of Computer Science and Engineering at The Ohio State University. His research interest is high performance database systems, and his current goal is to build a database system for high-end computing facilities. He has received the IEEE TCDE Rising Star award and a Google Research Faculty award. He completed his Ph.D. at the University of Wisconsin–Madison where part of his Ph.D. dissertation was commercialized in Microsoft SQL Server as the Hekaton in-memory transaction processing engine.
(9/28, 12:00pm): Stratos Idreos (Harvard): The periodic table of data structures details)TITLE
The periodic table of data structures.
Data structures are critical in any data-driven scenario, and they define the behavior of modern data systems and data-driven algorithms. However, they are notoriously hard to design due to a massive design space and the dependence of performance on workload and hardware which evolve continuously.
What if we knew how many and which data structures are possible to design? What if we could compute the expected performance of a data structure design on a given workload and hardware without having to implement it and without even having access to the target machine? We will discuss our quest for 1) the first principles of data structures, 2) design continuums that make it possible to automate design, and 3) self-designing systems that can morph between what we now consider fundamentally different structures. We will draw examples from the NoSQL key-value store design space and discuss how to accelerate them and balance space-time tradeoffs.
Stratos Idreos is an assistant professor of Computer Science at Harvard University where he leads DASlab, the Data Systems Laboratory. Stratos was awarded the 2011 ACM SIGMOD Jim Gray Doctoral Dissertation award and the 2011 ERCIM Cor Baayen award. He is also a recipient of an IBM zEnterpise System Recognition Award, a VLDB Challenges and Visions best paper award and an NSF Career award. In 2015 he was awarded the IEEE TCDE Rising Star Award from the IEEE Technical Committee on Data Engineering for his work on adaptive data systems.
Prior recorded talk at UW:
(10/4, 4:30pm): Angelika Kimmig (Cardiff University): A Collective, Probabilistic Approach to Schema Mapping details)TITLE
A Collective, Probabilistic Approach to Schema Mapping
We propose a probabilistic approach to the problem of schema mapping. Our approach is declarative, scalable, and extensible. It builds upon recent results in both schema mapping and probabilistic reasoning and contributes novel techniques in both ﬁelds. We introduce the problem of mapping selection, that is, choosing the best mapping from a space of potential mappings, given both metadata constraints and a data example. As selection has to reason holistically about the inputs and the dependencies between the chosen mappings, we deﬁne a new schema mapping optimization problem which captures interactions between mappings. We then introduce Collective Mapping Discovery (CMD), our solution to this problem using state-of-the-art probabilistic reasoning techniques, which allows for inconsistencies and incompleteness. Using hundreds of realistic integration scenarios, we demonstrate that the accuracy of CMD is more than 33% above that of metadata-only approaches already for small data examples, and that CMD routinely ﬁnds perfect mappings even if a quarter of the data is inconsistent.
Angelika Kimmig is a Lecturer at Cardiff University, UK. She obtained her Ph.D. from KU Leuven, Belgium, and was a postdoctoral fellow at KU Leuven and the University of Maryland, College Park. Her research interests include symbolic AI, reasoning under uncertainty, machine learning, logic programming, and especially combinations thereof such as probabilistic programming and statistical relational learning. She is a key contributor to the probabilistic logic programming language ProbLog.
PAPER from ICDE 2017:
(10/26, 12:00pm): Mania Abdi (Northeastern): D3N: A multi-layer cache for improving big-data applications’ performance in data centers with imbalanced networks details)TITLE
D3N: A multi-layer cache for improving big-data applications’ performance in data centers with imbalanced networks
Caching methods for improving the performance of datalakes for throughput-bound big-data jobs assume unlimited bandwidth across the datacenter. However, most enterprise and academic datacenters grow organically and have heavily oversubscribed networks between different computer clusters. This paper describes D3N, an architecture that caches data on the access side of potential network bottlenecks and uses a multilayer approach. In the case of a two layer cache, reuse distances are dynamically computed along with cache miss costs to determine the partitioning of the cache into L1 (used for local accesses) and L2 (used for remote access, but within same access side) layers. We have implemented a prototype of D3N by modifying Ceph’s RADOS gateway. Micro and macro evaluations of the prototype demonstrate that the implementation is performant enough to saturate the (40,Gbit/s) NICs and (5 GB/s read) SSD of our caching server and deliver substantial storage bandwidth improvements for real workloads. Numerical models show that a multi-level, dynamic cache can have substantial advantages over today’s single-level caches when bandwidth is constrained.
Mania Abdi is a PhD student at the Northeastern University Solid State Storage research group. Prior to that, she was a software engineer. She received her MSc. in Computer Engineering at the Sharif University of Technology and B.Eng. in Computer Engineering at the Amirkabir University of Technology. Mania is a computer systems researcher with a storage focus and has worked on a broad set of topics, including distribted storage, caching, data center debuging, and end-to-end tracing.
(11/7, 12:00pm): Laurel Orr (University of Washington): Probabilistic Database Summarization for Interactive Data Exploration details)TITLE:
Probabilistic Database Summarization for Interactive Data Exploration
A fundamental assumption of traditional DBMSs is that the database contains all information necessary to answer a query; i.e., the database contains the entire universe of data. Many data scientists, however, do not have access to the universe of data and instead rely on samples to answer queries. These data scientists need to use tools outside of the database or alter their queries to correctly reweight and debias their samples to get a more accurate query answer. This talk presents preliminary research into building the first open world database system (OWDB) that inherently assumes relations are samples from some universe, even if the sampling mechanism is unknown. We will discuss two different approaches for building an OWDB: probabilistic modeling of the universe and sample reweighting. We then present EntropyDB, a system that takes the former approach to summarize a database, and we lastly discuss ongoing research into the later approach using Bayesian Networks. We conclude with discussing the open challenges in developing an OWDB.
Laurel Orr is a 6th year PhD student in the Database Group at the Paul G Allen School for Computer Science and Engineering. Her research interests are data summarization, approximate query processing, data exploration, and the general development of tools and techniques to aid data scientists in their data investigation and analysis pipeline. Her current research goal is to develop a prototype open world database system that inherently treats relations a samples drawn from some larger universe of data, even when the sampling mechanism is unknown. She was a 2018 NCWIT Collegiate Award Honorable Mention and was awarded a NSF Graduate Research Fellowship in 2015.
(11/9, 12:00pm): Raul Castor Fernandez (MIT): Aurum: A Data Discovery System details)TITLE
Aurum: A Data Discovery System
Organizations store data in hundreds of different data sources, including relational databases, files, and large data lake repositories. These data sources contain valuable information and insights that can be beneficial to multiple aspects of modern data-driven organizations. However, as more data is produced, our ability to use it reduces dramatically, as no single person knows about all the existent data sources. One big challenge is to discover the data sources that are relevant to answer a particular question. Aurum is a data discovery system to answer "discovery queries" on large volumes of data. In this talk, I'll motivate the data discovery problem with use cases from different industries. I will describe Aurum's design and I will talk a bit about a new research project that aims to discover data beyond tables.
In my research, I build high-performance and scalable systems to discover, prepare and process data. I'm a postdoctoral researcher at MIT, working with Sam Madden and Mike Stonebraker. Before, I completed my PhD at Imperial College London with Peter Pietzuch.
(11/14, 12:00pm): Lawson Wong (Northeastern): Abstraction in robotics details)TITLE
Abstraction in robotics
Robotics is a big data problem. To make sense of the physical world, perform tasks well, and generalize across environments, robots need to represent and understand the world at the "correct" level of abstraction. What "correct" should mean remains to be seen.
In this talk, I will describe two lines of work that attempt to answer this question from very different perspectives. I will first discuss work on grounding natural language instructions to robot behavior, where we demonstrate that having the right representations can enable human-robot communication. This is an important problem for robotics, since we envision users using natural language to instruct robots to perform a wide variety of tasks. In the second half, I will discuss recent preliminary work on the theoretical foundations of state abstraction in reinforcement learning, a common framework used in robot learning problems. In particular, we view state abstraction as data compression, and apply results in information theory (rate-distortion theory) to the reinforcement learning setting.
Time permitting, I will describe some extensions to the above work, as well as other abstraction-related problems, that I envision my group will pursue at Northeastern.
Lawson L.S. Wong is an assistant professor in the College of Computer and Information Science at Northeastern University. His research focuses on learning, representing, and estimating knowledge about the world that an autonomous robot may find useful. Prior to Northeastern, Lawson was a postdoctoral fellow at Brown University. He completed his PhD at the Massachusetts Institute of Technology. He has received a Siebel Scholarship, AAAI Robotics Student Fellowship, and Croucher Foundation Fellowship for Postdoctoral Research.
Sequence-to-Sequence Language Grounding of Non-Markovian Task Specifications
Nakul Gopalan, Dilip Arumugam, Lawson L.S. Wong, Stefanie Tellex
Robotics: Science and Systems (2018)
Grounding Natural Language Instructions to Semantic Goal Representations for Abstraction and Generalization
Dilip Arumugam, Siddharth Karamcheti, Nakul Gopalan, Edward C. Williams, Mina Rhee, Lawson L.S. Wong, Stefanie Tellex
Autonomous Robots (in press; 2018)
Extended journal version of Robotics: Science and Systems (2017) paper
State Abstraction as Compression in Apprenticeship Learning
David Abel, Dilip Arumugam, Kavosh Asadi, Yuu Jinnai, Michael L. Littman, Lawson L.S. Wong
Preprint available on request (by e-mail)
To appear in AAAI Conference on Artificial Intelligence (2019)
(11/16, 12:00pm): Alexandra Meliou (UMass Amherst): Creating a Higher-Quality Data World details)TITLE
Diagnoses and Explanations: Creating a Higher-Quality Data World
The correctness and proper function of data-driven systems and applications relies heavily on the correctness of their data. Low quality data can be costly and disruptive, leading to revenue loss, incorrect conclusions, and misguided policy decisions. Improving data quality is far more than purging datasets of errors; it is critical to improve the processes that produce the data, to collect good data sources for generating the data, and to address the root causes of problems.
Our work is grounded on an important insight: While existing data cleaning techniques can be effective at purging datasets of errors, they disregard the fact that a lot of errors are systemic, inherent to the process that produces the data, and thus will keep occurring unless the problem is corrected at its source. In contrast to traditional data cleaning, we focus on data diagnosis: explaining where and how the errors happen in a data generative process. I will describe our work on Data X-Ray and QFix, two diagnostic frameworks for large-scale extraction systems and relational data systems. I will also provide a brief overview of new results on knowledge augmentation and explanations for dataset differences, building towards a vision for toolsets that assist the exploration of information in a varied, diverse, and highly non-integrated data world.
Alexandra Meliou is an Assistant Professor in the College of Information and Computer Sciences, at the University of Massachusetts, Amherst. Prior to that, she was a Post-Doctoral Research Associate at the University of Washington. Alexandra received her PhD degree from the Electrical Engineering and Computer Sciences Department at the University of California, Berkeley. She has received recognitions for research and teaching, including a CACM Research Highlight, an ACM SIGMOD Research Highlight Award, an ACM SIGSOFT Distinguished Paper Award, an NSF CAREER Award, a Google Faculty Research Award, and a Lilly Fellowship for Teaching Excellence. Her research focuses on data provenance, causality, explanations, data quality, and algorithmic fairness.
(12/7, 11:30am): Fei Chiang (McMaster University): Contextual and Spatio-temporal Data Cleaning details)TITLE
Contextual and Spatio-temporal Data Cleaning
It is becoming increasingly difficult for organizations to reap value from their data due to poor data quality. This is motivated by the observation that real data is rarely error free, containing incomplete, inconsistent, and stale values. This leads to inaccurate, and out-of-date data analysis downstream. Addressing data inconsistency requires not only reconciling differing syntactic references to an entity, but it is often necessary to include domain expertise to correctly interpret the data. For example, understanding that a reference to ‘jaguar’ may be interpreted as an animal or as a vehicle. Secondly, having up-to-date (or current) data is important for timely data analysis. Cleaning stale values goes beyond just relying on timestamps, especially when timestamps may be missing, inaccurate or incomplete.
In this talk, I will present our work towards achieving consistent and up-to-date data. First, I will discuss contextual data cleaning that uses a new class of data integrity constraints that tightly integrate domain semantics from an ontology. Second, we argue that data currency is a relative notion based on individual spatio-temporal update patterns, and these patterns can be learned and predicted. I will present our framework to achieve these two objectives, and provide a brief overview of recent extensions with applications to knowledge fusion.
Fei Chiang is an Assistant Professor in the Department of Computing and Software at McMaster University. She is a Faculty Fellow at the IBM Centre for Advanced Studies, and served as an inaugural Associate Director of the McMaster MacData Institute. She received her M. Math from the University of Waterloo, and B.Sc and PhD degrees from the University of Toronto, all in Computer Science. Her research interests are in data quality, data cleaning, data privacy and text mining. She holds four patents for her work in self-managing database systems. Her work has been featured in the Southern Ontario Smart Computing Impact Report. She is a recipient of the Dean’s Teaching Honour Roll, and a 2018 Ontario Early Researcher Award.
(12/12, 12:00pm): Erkang Zhu (University of Toronto): Get Your Data Together! Algorithms for Managing Data Lakes details)TITLE
Get Your Data Together! Algorithms for Managing Data Lakes
Data lakes (e.g., enterprise data catalogs and Open Data portals) are data dumps if users cannot find and utilize the data in them. In this talk, I present two problems in massive, dynamic data lakes: 1) searching for joinable tables without a precomputed join graph, and 2) joining tables from different sources through auto-generated syntactic transformation on join values. I will also present two algorithmic solutions that can be used for data lakes that are large both in the number of tables (millions) and table sizes. The presented work has been published in SIGMOD and VLDB.
Erkang (Eric) Zhu is a 5th year computer science PhD candidate at University of Toronto. His supervisor is Prof. Renée J. Miller. His research focuses on data discovery, large-scale similarity search, and randomized algorithms (data sketches).
Erkang Zhu, Fatemeh Nargesian, Ken Q. Pu, Renée J. Miller: LSH Ensemble: Internet-Scale Domain Search. PVLDB 9(12): 1185-1196 (2016)
Erkang Zhu, Ken Q. Pu, Fatemeh Nargesian, Renée J. Miller: Interactive Navigation of Open Data Linkages. PVLDB 10(12): 1837-1840 (2017)
- (7/25, noon): Wolfgang Gatterbauer: Oblivious Bounds for the probability of Non-monotone Boolean functions (pdf)
- (7/24, noon): Renée Miller: Open Data Integration (pdf), (pdf)
- (7/13, noon): Arjit Khan: Data management for emerging problems in large networks
- (6/22, noon): Magy Seif El-Nasr: Modeling Player Behaviors through Game Data
- (6/1, 10:30am, 366 WVH): Raymond Wong: Big data analytics on big spatial databases
- (4/27, 10am): Clemens Heitzinger: Computational Bayesian Estimation with Applications in Sensors and Tomography
- (3/28, noon): Ravi Sundaram : no free lunch, succinct data structures, information theory lower bounds
- (3/21, noon): Xiaofeng and Rundong: SIGMOD and WWW practice talks
- (3/14, 10am): Ravi Sundaram : A case for learned index structures (pdf)
- (2/28, noon): Ehsan Elhamifar : Subset Selection and Summarization in Sequential Data (pdf)
- (2/14, noon): Ruiyang Xu: Evaluating Player Skill and Position Difficulty in Sequential Two-Person Games with Game Outcome Prediction demonstrated with the Gamification of an Optimization Problem
- (1/31, noon): Casper Harteveld : Studycrafter / Wolfgang Gatterbauer : Bootstrapping Virtuous Learning Cycles (Youtube)
- (1/19, all day): Northeast Database day 2018 : Come and see our talk and posters!
- (1/18, 3pm, Forsyth #97): Dan Suciu : Rethinking Query Execution on Big Data
- (1/12, noon, 366 WVH): Guoliang Li : Human-in-the-Loop Data Integration
This semester the data lab seminar is combined with in our special topics class (CS 7290: Special topics: Foundations in scalable data management) and takes place every Tuesday 11:45am-1:25pm and Thursday 2:50-4:30pm in Ryder Hall 126 (map).
- (11/16): Stratis Ioannidis : Distributing Frank-Wolfe via Map-Reduce (pdf)
- (11/9): Niccolo Meneghetti : Beta Probabilistic Databases: A Scalable Approach to Belief Updating and Parameter Learning (pdf)
- (10/12): Georgia Koutrika : User analytics for recommender systems
- (9/28): Jon Ullman : Differential privacy and data exploration
- (9/26): Cibele Freire : The complexity of resilience and responsibility for self-join-free conjunctive queries (pdf)