DATA Lab seminar: group meetings & guest speakers
In fall 2020 we are most meeting Wednesdays at noon
in WVH 462 (map) via Zoom for virtual brown bag lunch. Please subscribe to our DATA Lab talks email list or DATA Lab talks calendar if you think data is the future. When you sign up to the email list, please use a name we can recognize and let us know who you are.
(1/28, 12:00pm): Paolo Bold: Axioms for centrality details)ABSTRACT
Given a social network, which of its nodes are more central? This question has been asked many times in sociology, psychology and computer science, and a whole plethora of centrality measures (a.k.a. centrality indices, or rankings) were proposed to account for the importance of the nodes of a network. In this paper, we try to provide a mathematically sound survey of the most important classic centrality measures known from the literature and propose an axiomatic approach to establish whether they are actually doing what they have been designed for. Our axioms suggest some simple, basic properties that a centrality measure should exhibit. Surprisingly, only a new simple measure based on distances, harmonic centrality, turns out to satisfy all axioms; essentially, harmonic centrality is a correction to Bavelas's classic closeness centrality designed to take unreachable nodes into account in a natural way. As a sanity check, we examine in turn each measure under the lens of information retrieval, leveraging state-of-the-art knowledge in the discipline to measure the effectiveness of the various indices in locating web pages that are relevant to a query. While there are some examples of this comparisons in the literature, here for the first time we take into consideration centrality measures based on distances, such as closeness, in an information-retrieval setting. The results match closely the data we gathered using our axiomatic approach. Our results suggest that centrality measures based on distances, which have been neglected in information retrieval in favour of spectral centrality measures in the last years, are actually of very high quality; moreover, harmonic centrality pops up as an excellent general-purpose centrality index for arbitrary directed graphs.
(9/10, 12:00pm): Bert Huang (Tufts University): Speeding up Markov Random Field Inference and Learning details)TITLE
Speeding up Markov Random Field Inference and Learning
Markov random fields (MRFs) are powerful statistical models that can encode global distributions with local relationships over variables. Unfortunately, algorithms for learning and inference have historically suffered poor scalability and difficult software engineering. In this talk, I’ll cover two strategies we’ve developed to improve the scalability of MRF algorithms on modern hardware, both that build on aspects that have made deep learning successful. First, I’ll discuss an abstraction that enables multiple approximate inference algorithms to be computed via sparse matrix operations. Like the dense matrix backends that have made deep learning software highly portable, sparse matrix operations are low-level, general computations, they are often efficiently optimized on a variety of computing environments. This abstraction allows seamless implementation on many hardware backends, making MRF algorithms fast on current and future computers. Second, I’ll discuss provably convergent stochastic learning algorithms that exploit repeated structure in MRF models. Finally, I will discuss the big picture surrounding these modeling approaches and their role in the modern era.
* Sparse-Matrix Belief Propagation Reid Bixler, Bert Huang. Conference on Uncertainty in Artificial Intelligence, 2018.
* Block Belief Propagation for Parameter Learning in Markov Random Fields You Lu, Zhiyuan Liu, Bert Huang. AAAI Conference on Artificial Intelligence, 2019.
Bert Huang is an assistant professor in the Department of Computer Science and the Data Intensive Studies Center at Tufts University. He earned his Ph.D. from Columbia University in 2011, was a postdoctoral research associate at the University of Maryland, and previously was an assistant professor at Virginia Tech. His research addresses topics surrounding machine learning, including structured prediction, weakly supervised learning, and algorithmic fairness. His papers have been published at conferences including NeurIPS, ICML, UAI, and AISTATS, and he is an action editor for the Journal of Machine Learning Research.
(9/16, 12:00pm): Andreas Loukas (EPFL Lausanne): Erdoes Goes Neural: an Unsupervised Learning Framework for Combinatorial Optimization on Graphs details)TITLE
Erdoes Goes Neural: an Unsupervised Learning Framework for Combinatorial Optimization on Graphs
Combinatorial optimization (CO) problems are notoriously challenging for neural networks, especially in the absence of labeled instances. This work proposes an unsupervised learning framework for CO problems on graphs that can provide integral solutions of certiﬁed quality. Inspired by Erd˝os’ probabilistic method, we use a neural network to parametrize a probability distribution over sets. Crucially, we show that when the network is optimized w.r.t. a suitably chosen loss, the learned distribution contains, with controlled probability, a low-cost integral solution that obeys the constraints of the combinatorial problem. The probabilistic proof of existence is then derandomized to decode the desired solutions. We demonstrate the efﬁcacy of this approach to obtain valid solutions to the maximum clique problem and to perform local graph clustering. Our method achieves competitive results on both real datasets and synthetic hard instances.
I am a research scientist (Ambizione fellow) at the LTS2 lab in EPFL, Switzerland. Previously, I spent time at TU Berlin, Ben Gurion, and TU Delft. My research focuses on the foundations and applications of graph methods in machine learning and data science. I aim to find elegant explanations for phenomena associated with learning and to exploit them in order to design specialized learning machines. I am also interested in graph problems in signal processing and theoretical computer science, as well as in the analysis of neural networks.
(9/24, 12:00pm): Laurent Lessard (Northeastern): Automating the analysis and design of large-scale optimization algorithms details)TITLE
Automating the analysis and design of large-scale optimization algorithms
Most complicated optimization problems, in particular those involving a large number of variables, are solved in practice using iterative algorithms. The problem of selecting a suitable algorithm is currently more of an art than a science; a great deal of expertise is required to know which algorithms to try and how to properly tune them. Moreover, there are seldom performance guarantees. In this talk, I will show how the problem of algorithm selection can be approached using tools from robust control theory. By solving simple semidefinite programs (that do not scale with problem size), we can derive robust bounds on convergence rates for popular algorithms such as the gradient method, proximal methods, fast/accelerated methods, and operator-splitting methods such as ADMM. The bounds derived in this manner either match or improve upon the best known bounds from the literature. The bounds also lead to a natural energy dissipation interpretation and an associated Lyapunov function. Finally, our framework can be used to search for algorithms that meet desired performance specifications, thus establishing a principled methodology for designing new algorithms. The talk will be broadly accessible; no prior control theory knowledge needed!
Laurent Lessard is a new Associate Professor of Mechanical and Industrial Engineering at Northeastern University (joined in Fall 2020), with courtesy appointments in ECE and Khoury. Before coming to Northeastern, Laurent was a Charles Ringrose Assistant Professor of Electrical and Computer Engineering at the UW–Madison and Faculty Member at the Wisconsin Institute for Discovery. He received the B.A.Sc. in Engineering Science from the University of Toronto, and received the M.S. and Ph.D. in Aeronautics and Astronautics at Stanford University. After completing his doctoral work, he was an LCCC Postdoc at Lund University in Sweden, and a postdoctoral researcher at the University of California, Berkeley. Dr. Lessard is a recipient of the Hugo Schuck best paper award and the NSF CAREER award. His research interests include: decentralized control, robust control, optimization, and machine learning. For more information, see https://laurentlessard.com/
(9/30, 12:00pm): Xin Luna Dong (Amazon): Self-driving product understanding for thousands of categories details)TITLE
Self-driving product understanding for thousands of categories
Knowledge graphs have been used to support a wide range of applications and enhance search results for multiple major search engines, such as Google and Bing. At Amazon we are building a Product Graph, an authoritative knowledge graph for all products in the world. The thousands of product verticals we need to model, the vast number of data sources we need to extract knowledge from, the huge volume of new products we need to handle every day, and the various applications in Search, Discovery, Personalization, Voice, that we wish to support, all present big challenges in constructing such a graph.
In this talk we describe our efforts for self-driving knowledge collection for products of thousands of types. The system includes a suite of novel techniques for taxonomy construction, product property identification, knowledge extraction, anomaly detection, and synonym discovery. Our system is a) automatic, requiring little human intervention, b) multi-scalable, scalable in multiple dimensions including many domains, products, and attributes, and c) integrative, exploiting rich customer behavior logs. We describe what we learned in building this product graph and applying it to support customer-facing applications.
- Xin Luna Dong, Xiang He, Andrey Kan, Xian Li, Yan Liang, Jun Ma, Yifan Ethan Xu, Chenwei Zhang, Tong Zhao, Gabriel Blanco Saldana, Saurabh Deshpande, Alexandre Michetti Manduca, Jay Ren, Surender Pal Singh, Fan Xiao, Haw-Shiuan Chang, Giannis Karamanolakis, Yuning Mao, Yaqing Wang, Christos Faloutsos, Andrew McCallum, Jiawei Han. AutoKnow: Self-Driving Knowledge Collection for Products of Thousands of Types. In KDD, 2020.
- Giannis Karamanolakis, Jun Ma, Xin Luna Dong. TXtract: Taxonomy-aware knowledge extraction for thousands of product categories. In ACL, 2020.
- Yuning Mao, Tong Zhao, Andrey Kan, Chenwei Zhang, Xin Luna Dong, Christos Faloutsos, Jiawei Han. Octet: Online catalog taxonomy enrichment with self-supervision. In SigKDD, 2020.
Xin Luna Dong is a Senior Principal Scientist at Amazon, leading the efforts of constructing Amazon Product Knowledge Graph. She was one of the major contributors to the Google Knowledge Vault project, and has led the Knowledge-based Trust project, which is called the “Google Truth Machine” by Washington’s Post. She has co-authored book “Big Data Integration”, was awarded ACM Distinguished Member, VLDB Early Career Research Contribution Award for “advancing the state of the art of knowledge fusion”, and Best Demo award in Sigmod 2005. She serves in VLDB endowment and PVLDB advisory committee, and is a PC co-chair for VLDB 2021, KDD'2020 ADS Invited Talk Series, ICDE Industry 2019, VLDB Tutorial 2019, and Sigmod 2018.
(12/2, 12:00pm): Yinjun Wu (UPenn): Rapid retraining of machine learning models details)
(12/11, 12:00pm): Chris Jermaine (Rice): Tensor relational algebra for machine learning system design details)
(5/8, 12:00pm): Panagiotis Mandros (Max Planck Institute): Functional dependencies and how to find them details)TITLE
Functional Dependencies and how to find them
Functional dependencies come in many forms and shapes since they answer the fundamental question "what are the relevant features for a given target" that naturally occurs in many applications. These applications can range from database normalization, all the way to answering questions such as "does this bad habit cause this bad disease". In my talk, I will give a brief overview of these different FDs, and demonstrate how one can go from one end of the spectrum, i.e., database keys, to the other, i.e., causes and effects. By doing so, I hope to clarify a few things regarding which FD approach is appropriate for which application. At some point during the talk, I will present our solution to the FD discovery problem for exploratory data analysis. Our solution, DORA, finds the top FDs for a given target, is interpretable, i.e., facilitates the analysis of the results, is robust, i.e., the results can be trusted, and is very efficient in practice. Finally, I will end my talk by showing that DORA tackles the FD problem in a principled manner, contrary to a certain method that claims to have solved the FD problem once and for all.
* ICDM 2018: Discovering Reliable Dependencies from Data: Hardness and Improved Algorithms (best paper award)
Mandros, P, Boley, M, & Vreeken, J
* KDD 2017: Discovering Reliable Approximate Functional Dependencies
Mandros, P, Boley, M, & Vreeken, J
Panagiotis Mandros is a PhD candidate at the Max Planck Institute for Informatics and the University of Saarland, Germany. He works on the intersection of Information Theory and Optimization, developing theory and algorithms for discovering associations in high-dimensional data. He is particularly interested in Exploratory Data Analysis, i.e., the principle of exploring data without assumptions on the underlying data generating process, and has contributed methods for finding "patterns" in both unsupervised and supervised scenarios (e.g., functional dependencies, Markov blankets). His work is published at KDD, ICDM, and is the recipient of the 2018 IEEE ICDM best paper award.
(5/29, 12:30pm): Hung Ngo (Relational AI): Answering (Functional Aggregate) Queries via Tensor Decomposition details)TITLE
Answering (Functional Aggregate) Queries via Tensor Decomposition
In this talk I will present a framework to think about and to design efficient algorithms for answering a generic class of queries called Functional Aggregate Queries (or FAQs). FAQs can be used to capture queries from a while range of application areas: databases, machine learning, CSPs, for instance. The tensor decomposition framework can be used to explain recent advances in CSP and database query processing such as the notions of fractional hypertree widths, submodular widths, worst-case optimal join algorithms, or the junction tree algorithm in graphical models.
The talk is based on several joint works with Mahmoud Abo Khamis, Ryan Curtin, Ben Moseley, Dan Olteanu, Atri Rudra, Maximilian Schleich, and Dan Suciu.
* FANDA: https://cse.buffalo.edu/~hungngo/papers/long-panda-slides.pdf
* FAQ: https://cse.buffalo.edu/~hungngo/papers/faq-insideout.pdf
Hung Ngo is currently VP of Algorithms at relationalAI. Before that, he was a professor at SUNY Buffalo, and a Computer Scientist with LogicBlox. His current research interests are in designing and implementing algorithms for solving computational problems defined declaratively.
(6/5, 12:30pm): Nikos Tizavelis (Khoury): Optimal join algorithms meet top-k details)TITLE
Optimal Join Algorithms Meet Top-k
Top-k queries have been studied intensively in the database community and they are an important means to reduce query cost when only the “best” or “most interesting” results are needed instead of the full output. While some optimality results exist, e.g., the famous Threshold Algorithm, they hold only in a fairly limited model of computation that does not account for the cost incurred by large intermediate results and hence is not aligned with typical database-optimizer cost models. On the other hand, the idea of avoiding large intermediate results is arguably the main goal of recent work on optimal join algorithms, which uses the standard RAM model of computation to determine algorithm complexity. This research has created a lot of excitement due to its promise of reducing the time complexity of join queries with cycles, but it has mostly focused on full-output computation. We argue that the two areas can and should be studied from a unified point of view in order to achieve optimality in the common model of computation for a very general class of top-k-style join queries. This tutorial has two main objectives. First, we will explore and contrast the main assumptions, concepts, and algorithmic achievements of the two research areas. Second, we will cover recent, as well as some older, approaches that emerged at the intersection to support efficient ranked enumeration of join-query results. These are related to classic work on k-shortest path algorithms and more general optimization problems, some of which dates back to the 1950s. We demonstrate that this line of research warrants renewed attention in the challenging context of ranked enumeration for general join queries.
Based on an upcoming SIGMOD 2020 tutorial and VLDB 2020 paper:
(6/10, 12:00pm): Deepak Ajwani (UC Dublin): Average-case behavior of k-shortest path algorithms details)TITLE
Average-case behavior of k-shortest path algorithms
The k-shortest path problem is a generalization of the fundamental shortest path problem, where the goal is to compute k simple paths from a given source to a target node, in non-decreasing order of their weight. With numerous applications modeling various optimization problems and as a feature in some learning systems, there is a need for efﬁcient algorithms for this problem. Unfortunately, despite many decades of research, the best directed graph algorithm still has a worst-case asymptotic complexity of ˜O(kn(n+m)). In contrast to the worst-case complexity, many algorithms have been shown to perform well on small diameter directed graphs in practice. In this paper, we prove that the average-case complexity of the popular Yen’s algorithm on directed random graphs with edge probability p = Ω(logn)/n in the unweighted and uniformly distributed weight setting is O(kmlogn), thus explaining the gap between the worst-case complexity and observed empirical performance. While we also provide a weaker bound of O(kmlog 4 n) for sparser graphs with p ≥ 4/n, we show empirical evidence that the stronger bound should also hold in the sparser setting. We then prove that Feng’s directed k-shortest path algorithm computes the second shortest path in expected O(m) time on random graphs with edge probability p = Ω(logn)/n. Empirical evidence suggests that the average-case result for the Feng’s algorithm holds even for k > 2 and sparser graphs.
Complex networks 2018: https://researchrepository.ucd.ie/bitstream/10197/9887/2/ajwani_complex_networks18.pdf
Dr. Deepak Ajwani is an Assistant Professor at the School of Computer Science, University College Dublin. His research is at the confluence of diverse areas such as algorithms and data structures (with a focus on scalable graph algorithms), algorithm engineering, combinatorial optimization and machine learning.
He received his Ph.D. from Max Planck Institute for Informatics in 2008, for his work on I/O-efficient graph traversal algorithms. From 2008-2010, he worked as a Postdoctoral Researcher at MADALGO - Centre for Massive Data Algorithms - where he developed shared memory multicore algorithms for discrete optimization problems. From 2010-2012, he worked on an IRC funded project at University College Cork, in collaboration with IBM Research, for a project on designing graph partitioning and repartitioning techniques in the context of Exascale stream computing systems. From 2012-2018, he worked as a research scientist at Nokia Bell Labs, on the design and development of a cognitive computing tool for interpreting, organizing and navigating unstructured content.
(6/26, 12:00pm): Pablo Barcelo (PUC Chile): The Logical Expressiveness of Graph Neural Networks details)TITLE
The Logical Expressiveness of Graph Neural Networks
The ability of graph neural networks (GNNs) for distinguishing nodes in graphs has been recently characterized in terms of the Weisfeiler-Lehman (WL) test for checking graph isomorphism. This characterization, however, does not settle the issue of which Boolean node classifiers (i.e., functions classifying nodes in graphs as true or false) can be expressed by GNNs. We tackle this problem by focusing on Boolean classifiers expressible as formulas in the logic FOC2, a well-studied fragment of first order logic. FOC2 is tightly related to the WL test, and hence to GNNs. We start by studying a popular class of GNNs, which we call AC-GNNs, in which the features of each node in the graph are updated, in successive layers, only in terms of the features of its neighbors. We show that this class of GNNs is too weak to capture all FOC2 classifiers, and provide a syntactic characterization of the largest subclass of FOC2 classifiers that can be captured by AC-GNNs. This subclass coincides with a logic heavily used by the knowledge representation community. We then look at what needs to be added to AC-GNNs for capturing all FOC2 classifiers. We show that it suffices to add readout functions, which allow to update the features of a node not only in terms of its neighbors, but also in terms of a global attribute vector. We call GNNs of this kind ACR-GNNs.
ICLR 2020 paper: https://openreview.net/forum?id=r1lZ7AEKvB
Previous work on the relationship between WL and GNNs:
a) Weisfeiler and Leman Go Neural: Higher-Order Graph Neural Networks https://aaai.org/ojs/index.php/AAAI/article/view/4384
b) How Powerful are Graph Neural Networks? https://openreview.net/forum?id=ryGs6iA5Km
Pablo Barcelo is Full Professor at Pontificia Universidad Católica de Chile, where he also acts as Director of the Institute for Mathematical and Computational Engineering, and Deputy Director of Millennium Institute for Foundational Research on Data. He is the author of more than 70 technical papers, has chaired ICDT 2019, will be chairing ACM PODS 2022, and is currently a member of the editorial committee of Logical Methods in Computer Science. From 2011 to 2014 he was the editor of the database theory column of SIGMOD Record. His areas of interest are database theory, logic in computer science, and the emerging relationship between these areas and machine learning.
(7/1, 12:00pm): Jan Tönshoff (RWTH Aachen): Unsupervised Machine Learning for Constraint Satisfaction Problems details)TITLE
Unsupervised Machine Learning for Constraint Satisfaction Problems
Constraint Satisfaction Problems (CSPs) are a powerful theoretical framework for modeling a wide range of combinatorial problems. I will present a novel neural network architecture for solving CSPs, which was developed by our research group. The architecture is generic and can be applied to any binary CSP. Training is unsupervised and it suffices to train on small random instances. The trained networks generalize well to much larger and structurally complex instances. Experiments on Max-2-Sat, Max-Cut and Max-IS demonstrate that our approach matches or surpasses conventional algorithms on several NP-hard problems, while being simple, generic and efficient.
Graph Neural Networks for Maximum Constraint Satisfaction: https://arxiv.org/abs/1909.08387
I am a PhD student at RWTH Aachen University, where I work for Prof. Martin Grohe at the Chair of Logic and Theory of Discrete Systems. My research mainly focuses on applying machine learning to relational data structures, such as graphs, CSPs and databases.
(7/3, 12:00pm): Leopoldo Bertossi (Adolfo Ibanez University, Santiago, Chile): Score-Based Explanations in Data and Machine Learning details)TITLE
Score-Based Explanations in Data and Machine Learning
Explanations have become important in data management and machine learning. In this presentation we review some recent work on the use of counterfactual interventions, as they appear in actual causality, to provide scores that measure the contribution of database tuples to query answering, and that of feature values to results of classification models.
* Livshits, Bertossi, Kimelfeld, Sebag. The Shapley Value of Tuples in Query Answering. ICDT 2020.
* Bertossi, Li, Schleich, Suciu, Vagena. Causality-based Explanation of Classification Outcomes. DEEM@SIGMOD 2020.
* Bertossi. An ASP-Based Approach to Counterfactual Explanations for Classification. RuleML-RR 2020.
Leopoldo Bertossi is a Full Professor at the Faculty of Engineering and Sciences, "Universidad Adolfo Ibáñez" (Santiago, Chile), where he is the Director of the Graduate Program in Data Science. He is a Bachelor, Master and PhD in Mathematics by the "Pontificia Universidad Catolica de Chile" (PUC). His research has been concentrated in the areas of computational logic, knowledge representation, and data management. His research interests are related to Data Science and Artificial Intelligence in general. His is an Emeritus Professor of the School of Computer Science at Carleton University (Ottawa, Canada), Senior Computer Scientist at RelationalAI Inc., and Senior Researcher at the "Millenium Institute on Foundations of Data" (IMFD, Chile).
(7/8, 12:00pm): Ji-Yong Shin (Yale / Northeastern): Isolation in cloud storage systems details)TITLE
Isolation in cloud storage systems
Guaranteeing isolation in cloud storage systems is challenging due to heterogeneous user workloads and concurrent I/O requests. This talk focuses on improving two types of isolation support -- transactional isolation and fine-grained client-centric consistency control -- in cloud storage systems.
In the first part of the talk, the design and implementation of a block-level ACID transactional storage system called Isotope will be discussed. While transactional isolation has traditionally been implemented toward high levels of the storage stack, Isotope implements ACID transactions at the lowest common layer of the storage stack to fulfill the general need for concurrency control. Isotope demonstrates that the block-level transaction support trivializes the construction of transactional systems and even cross-application transactions.
The second part of the talk presents a new class of storage systems called StaleStore, which trades off consistency and performance within a storage server. StaleStore is designed based on the observation that modern servers are as powerful and parallel as distributed systems in the past and employs client-centric consistency (also known as distributed weak consistency) within a server. StaleStore accelerates server and distributed applications by returning an older version of data within a server if accessing the latest version is expected to be slower.
- Isotope conference version: https://www.usenix.org/system/files/conference/fast16/fast16-papers-shin.pdf
- Isotope journal version: https://dl.acm.org/doi/pdf/10.1145/3032967
- StaleStore conference version: https://dl.acm.org/doi/pdf/10.1145/2987550.2987579
Ji-Yong Shin is an associate research scientist in the Department of Computer Science at Yale University and will be joining Khoury College of Computer Science in the coming fall. His research interests are in designing novel distributed systems and exploring practical formal verification methods that can be applied to system designs. He received his Ph.D. from Cornell University, where he designed cloud storage systems with enhanced isolation support and a completely wireless datacenter.
(7/10, 12:00pm): Hongyang Zhang (Stanford / Wharton/ Northeastern): On the Generalization Effects of Linear Transformations in Data Augmentation details)TITLE
On the Generalization Effects of Linear Transformations in Data Augmentation
Data augmentation is a powerful technique to improve performance in applications such as image and text classification tasks. Yet, there is little rigorous understanding of why and how various augmentations work. In this work, we consider a family of linear transformations and study their effects on the ridge estimator in an over-parametrized linear regression setting. First, we show that transformations which preserve the labels of the data can improve estimation by enlarging the span of the training data. Second, we show that transformations which mix data can improve estimation by playing a regularization effect. Finally, we validate our theoretical insights on MNIST. Based on the insights, we propose an augmentation scheme that searches over the space of transformations by how uncertain the model is about the transformed data. We validate our proposed scheme on image and text datasets. For example, our method outperforms RandAugment by 1.24% on CIFAR-100 using Wide-ResNet-28-10. Furthermore, we achieve comparable accuracy to the SoTA Adversarial AutoAugment on CIFAR datasets.
Based on joint work with Sen Wu, Greg Valiant, and Chris Re that appeared in ICML’20: https://arxiv.org/abs/2005.00695.
Hongyang Zhang is a postdoctoral researcher at the Statistics Department at Wharton School, University of Pennsylvania. He received a Ph.D. from Stanford University and a B.E. from Shanghai Jiao Tong University, both in computer science. His research interests lie in the intersection of machine learning, algorithms and statistics, including topics such as non-convex optimization, multi-task and transfer learning, weakly supervised learning, neural networks and its theory. He is a recipient of the best paper award at COLT 2018.
(7/17, 12:00pm): Zach Ives (UPenn): Juneau: Managing and Guiding Data Science details)TITLE
Juneau: Managing and Guiding Data Science
Many modern data science applications build on data lakes, schema-agnostic repositories of data files and data products that offer limited organization and management capabilities. There is a need to build data lake search capabilities into data science environments, so scientists and analysts can find tables, schemas, workflows, and datasets useful to their task at hand. We develop search and management solutions for the Jupyter Notebook data science platform, to enable scientists to augment training data, find potential features to extract, clean data, and find joinable or linkable tables. Our core methods also generalize to other settings where computational tasks involve execution of programs or scripts.
CIDR 2019: http://cidrdb.org/cidr2019/papers/p55-ives-cidr19.pdf
SIGMOD 2020: https://dl.acm.org/doi/pdf/10.1145/3318464.3389726
Zachary Ives is the Department Chair and Adani President's Distinguished Professor of Computer and Information Science at the University of Pennsylvania. He is a co-founder of Blackfynn, Inc., a company focused on enabling life sciences research and discovery through data integration. Zack's research interests include data integration and sharing, managing "big data," sensor networks, and data provenance and authoritativeness. He is a recipient of the NSF CAREER award, and an alumnus of the DARPA Computer Science Study Panel and Information Science and Technology advisory panel. He has also been awarded the Christian R. and Mary F. Lindback Foundation Award for Distinguished Teaching. He is a co-author of the textbook Principles of Data Integration, and has received an ICDE 2013 ten-year Most Influential Paper award as well as the 2017 SWSA Ten-Year Award at the International Semantic Web Conference. He has been an Associate Editor for Proceedings of the VLDB Endowment (2014) and a Program Co-Chair (2015) and Group Leader (2018) for the ACM SIGMOD conference.
(7/22, 9:00am): Nofar Carmeli (Technion): Answering (Unions of) Conjunctive Queries using Random Access and Random-Order Enumeration details)TITLE
Answering (Unions of) Conjunctive Queries using Random Access and Random-Order Enumeration
As data analytics becomes more crucial to digital systems, so grows the importance of characterizing the database queries that admit a more efficient evaluation. We consider the tractability yardstick of answer enumeration with a logarithmic delay after a lineartime preprocessing phase. Such an evaluation is obtained by constructing, in the preprocessing phase, a data structure that supports logarithmic-delay enumeration. In this talk, we seek a structure that supports the more demanding task of a “random permutation”: logarithmic-delay enumeration in truly random order. Enumeration of this kind is required if downstream applications assume that the intermediate results are representative of the whole result set in a statistically valuable manner. An even more demanding task is that of a “random access”: logarithmic-time retrieval of an answer whose position is given. We establish that the free-connex acyclic CQs are tractable in all three senses: enumeration, random-order enumeration, and random access; and in the absence of self-joins, it follows from past results that every other CQ is intractable by each of the three (under some fine-grained complexity assumptions). However, the three yardsticks are separated in the case of a union of CQs (UCQ): while a union of free-connex acyclic CQs has a tractable enumeration, it may (provably) admit no random access. For such UCQs we devise a random-order enumeration whose delay is logarithmic in expectation.
PODS 2020: https://dl.acm.org/doi/pdf/10.1145/3375395.3387662
Nofar Carmeli is a Ph.D. student in the Data and Knowledge group at Technion, Israel Institute of Technology, advised by Prof. Benny Kimelfeld. Her research focuses on query optimization with guarantees using enumeration techniques. Nofar completed her BSc in 2015 in the Lapidim excellence program of the Computer Science Department of Technion.
(7/24, 12:00pm): Jorge Perez (Universidad de Chile): On the Turing Completeness of Modern Neural Network Architectures details)TITLE
On the Turing Completeness of Modern Neural Network Architectures
Alternatives to recurrent neural networks, in particular, architectures based on attention or convolutions, have been gaining momentum for processing input sequences. In spite of their relevance, the computational properties of these alternatives have not yet been fully explored. We study the computational power of two of the most paradigmatic architectures exemplifying these mechanisms: the Transformer (Vaswani et al., 2017) and the Neural GPU (Kaiser & Sutskever, 2016). We show both models to be Turing complete exclusively based on their capacity to compute and access internal dense representations of the data. In particular, neither the Transformer nor the Neural GPU requires access to an external memory to become Turing complete. Our study also reveals some minimal sets of elements needed to obtain these completeness results.
ICLR 2019: https://openreview.net/forum?id=HyGBdo0qFm
Jorge Pérez is Associate Professor in the Department of Computer Science at Universidad de Chile, and Associate Researcher at the Millennium Institute for Foundational Research on Data. His research interests include data exchange and integration, Web data, and the theory of modern neural network architectures. He has received several awards for his research, including the best paper award in 5 international conferences, the Microsoft Research PhD Fellowship, and the SWSA Ten-Years Award for his work on query languages for Web data. His interests also include the analysis of social, medical and political text data, particularly in Spanish.
(7/29, 12:00pm): Nima Dehmamy (Kellog @ Northwestern University): Understanding the representation power of graph neural networks in learning graph topology details)TITLE
Understanding the representation power of graph neural networks in learning graph topology
To deepen our understanding of graph neural networks, we investigate the representation power of Graph Convolutional Networks (GCN) through the looking glass of graph moments, a key property of graph topology encoding path of various lengths. We find that GCNs are rather restrictive in learning graph moments. Without careful design, GCNs can fail miserably even with multiple layers and nonlinear activation functions. We analyze theoretically the expressiveness of GCNs, concluding a modular GCN design, using different propagation rules with residual connections could significantly improve the performance of GCN. We demonstrate that such modular designs are capable of distinguishing graphs from different graph generation models for surprisingly small graphs, a notoriously difficult problem in network science. Our investigation suggests that, depth is much more influential than width, with deeper GCNs being more capable of learning higher order graph moments. Additionally, combining GCN modules with different propagation rules is critical to the representation power of GCNs.
NeurIPS paper 2019: https://arxiv.org/pdf/1907.05008
Nima Dehmamy is a research assistant professor at Northwestern Institute for Complex Systems in Kellogg School of Management at Northwestern University, Evanston IL, USA. He is a physicist working on complex systems and theory of machine learning. His research involves AI in graph learning, using physics to understand optimization landscapes and neuroscience. He earned his PhD in physics with Eugene Stanley at Boston University in 2016 and he was a postdoctoral fellow at the Barabasi Lab at Northeastern University, Boston, MA.
(7/31, 12:00pm): Donghui Zhang (Google): Spanner's SQL evolution details)TITLE
Spanner's SQL Evolution
This talk introduces Google's distributed RDBMS called Spanner. It will cover Spanner's SQL interface, distributed query processing, and some lessons learned, as well as blockwise columnar storage.
Donghui Zhang has been working in the database field since he got a Ph.D. from University of California, Riverside in 2002. His first job was an Assistant Professor at Northeastern University. In 2009 he left academia and joined the Microsoft Jim Gray Systems Lab where he researched on the best usage of Flash SSDs inside Microsoft SQL Server, and on lock-free hash table. Between 2012 and 2017 he led the engineering in Paradigm4, a Stonebraker startup building SciDB, a distributed computational database system. After that, he worked in Facebook's data warehouse for two years before joining Google's Spanner team.
(8/5, 12:00pm): Juan Sequeda (data.world): The Socio-Technical Phenomena of Data Integration details)TITLE
The Socio-Technical Phenomena of Data Integration
Data Integration has been an active area of computer science research for over two decades. A modern manifestation is as Knowledge Graphs which integrates not just data but also knowledge at scale. Tasks such as Domain modeling and Schema/Ontology Matching are fundamental in the data integration process. Research focus has been on studying the data integration phenomena from a technical point of view (algorithms and systems) with the ultimate goal of automating this task.
In the process of applying scientific results to real world enterprise data integration scenarios to design and build Knowledge Graphs, we have experienced numerous obstacles. In this talk, I will share insights about these obstacles. I will argue that we need to think outside of a technical box and further study the phenomena of data integration with a human-centric lens: from a socio-technical point of view.
A Pay-as-you-go Methodology to Design and Build Enterprise Knowledge Graphs from Relational Databases, ISWC 2019: https://github.com/juansequeda/papers/raw/master/iswc2019.pdf
Juan F. Sequeda is the Principal Scientist at data.world. He joined through the acquisition of Capsenta, a company he founded as a spin-off from his research. He holds a PhD in Computer Science from The University of Texas at Austin. Juan is the recipient of the NSF Graduate Research Fellowship, received 2nd Place in the 2013 Semantic Web Challenge for his work on ConstituteProject.org, Best Student Research Paper at the 2014 International Semantic Web Conference and the 2015 Best Transfer and Innovation Project awarded by the Institute for Applied Informatics. Juan is on the Editorial Board of the Journal of Web Semantics, member of multiple program committees (ISWC, ESWC, WWW, AAAI, IJCAI). He was the General Chair of AMW2018, PC chair of ISWC 2017 In-Use track, co-creator of COLD workshop (7 years co-located at ISWC). He has served as a bridge between academia and industry as the current chair of the Property Graph Schema Working Group, member of the Graph Query Languages task force of the Linked Data Benchmark Council (LDBC) and past invited expert member and standards editor at the World Wide Web Consortium (W3C). Wearing his scientific hat, Juan's goal is to reliably create knowledge from inscrutable data. His research interests are on the intersection of Logic and Data for (ontology-based) data integration and semantic/graph data management, and what now is called Knowledge Graphs. Wearing his business hat, Juan is a product manager, does business development and strategy, technical sales and works with customers to understand their problems to translated back to R&D.
(8/7, 12:00pm): Vikas Garg (MIT): Representation Limits and Generalization of Graph Neural Networks details)TITLE
Representation Limits and Generalization of Graph Neural Networks
Graphs provide a natural abstraction to model relational and strategic data in domains as diverse as biology (e.g., molecules), multiagent settings (e.g., online vendors on ecommerce platforms), and distributed systems (e.g., Internet). Graphs also find much use as theoretical objects (e.g., probabilistic graphical models), and several important algorithms (e.g., max-flow for image segmentation) can be invoked when tasks are formulated in terms of graphs. Graph neural networks are naturally suited for making predictions based on graphs but they remain poorly understood in terms of what they can and cannot do. We analyze whether GNNs can distinguish graphs that differ in properties such as cycles, but have similar local structure. We also provide the first data dependent generalization bounds for message passing GNNs.
The talk will be based on recent work with Stefanie Jegelka and Tommi Jaakkola.
ICML 2020: https://people.csail.mit.edu/tommi/papers/GJJ_ICML2020.pdf
Vikas Garg is a PhD student in Computer Science at MIT. His research interests in machine learning include generative models, graphical models, theory of deep learning, and learning under uncertainty or resource constraints along with their intersections with optimization and game theory.
(8/14, 12:00pm): Wang-Chiew Tan (Megagon Labs): Unleashing the Power of Subjective Data: Managing Experiences as First-Class Citizens details)TITLE
Unleashing the Power of Subjective Data: Managing Experiences as First-Class Citizens
Subjective data refers to data that contains opinions and experiences. Such data is ubiquitous in product reviews, tweets, and discussion forums in social media. Consumers today spend considerable time sifting through subjective data to make informed decisions about purchases. At Megagon Labs, we are building technologies to synthesize knowledge from subjective data and to facilitate searching over them.
In this talk, I will overview some of the technologies we have developed in the directions of subjective data preparation and integration, experience discovery and search, subjective knowledge representation, and data analytics and exploration.
* OpinionDigest: A Simple Framework for Opinion Summarization. https://arxiv.org/pdf/2005.01901.pdf. Yoshihiko Suhara, Xiaolan Wang, Stefanos Angelidis, Wang-Chiew Tan – ACL 2020
* ExtremeReader: An Interactive Explorer For Customizable And Explainable Review Summarization. https://megagon.ai/wp-content/uploads/2020/05/ExtremeReader.pdf. Xiaolan Wang, Yoshihiko Suhara, Natalie Nuno, Yuliang Li, Jinfeng Li, Nofar Carmeli, Stefanos Angelidis, Eser Kindogan, Wang-Chiew Tan – WWW 2020
* Deep Entity Matching with Pre-Trained Language Models. https://arxiv.org/pdf/2004.00584 Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, Wang-Chiew Tan
Wang-Chiew leads the research efforts at Megagon Labs with the goal of building advanced technologies to enhance search by experience. Her team conducts research on data integration, information extraction, text mining and summarization, knowledge base construction and commonsense reasoning, and data visualization. Prior to that, she was a Professor of Computer Science at University of California, Santa Cruz. She also spent two years at IBM Research - Almaden. She received her B.Sc. (First Class) in Computer Science from the National University of Singapore and her Ph.D. in Computer Science from the University of Pennsylvania.
(8/21, 12:00pm): Predrag Radivojac (Khoury): Graph kernels details)TITLE
We will discuss basics of learning on graphs via graph kernels. We will first cover basics of kernel methods, as well as graph classification and node classification with graph kernels. We will then extend graph kernel methods to edge classification and link prediction problems in an attempt to unify these problems as vertex classification on hypergraphs, a problem we will solve by developing edit-distance hypergraphlet kernels. We will show results on both directed and undirected graphs from social sciences and computational biology, with primary focus on undirected graphs and biological applications.
Predrag Radivojac is a Professor of Computer Science at Northeastern University, where he recently moved from Indiana University. Prof. Radivojac received his Bachelor's and Master's degrees in Electrical Engineering from the University of Novi Sad and University of Belgrade, Serbia. His Ph.D. degree is in Computer Science from Temple University (2003) under the direction of Prof. Zoran Obradovic and co-direction of Prof. Keith Dunker. In 2004 he held a post-doctoral position in Keith Dunker's lab at Indiana University School of Medicine, after which he joined Indiana University Bloomington. Prof. Radivojac's research is in the areas of computational biology and machine learning with specific interests in protein function, MS/MS proteomics, genome interpretation, and precision health. He received the National Science Foundation (NSF) CAREER Award in 2007 and is an August-Wilhelm Scheer Visiting Professor at Technical University of Munich (TUM) as well as an honorary member of the Institute for Advanced Study at TUM. At Indiana University, he was Associate Chair of the Department of Computer Science and a co-Director of all of Informatics and Data Science for the multi-campus Precision Health Initiative. Prof. Radivojac's projects have been regularly supported by NSF and National Institutes of Health (NIH). He is currently an Editorial Board member for the journal Bioinformatics, Associate Editor for PLoS Computational Biology, and serves his third term (elected) on the Board of Directors of the International Society for Computational Biology (ISCB).
(8/24, 12:00pm): Shantanu Jain (Khoury): Approaches and Algorithms for Positive Unlabeled Learning details)TITLE
Approaches and Algorithms for Positive Unlabeled Learning
Positive Unlabeled (PU) learning is an extreme case of class imbalance in binary classification, where labeled data from the negative class is not available. Typically, unlabeled examples are used as a substitute for the negatives to train a non-traditional classifier that separates them from the positives. Surprisingly, the score function of a non-traditional classifier provides an optimal ranking of the inputs. However, for many other tasks such as 1) thresholding the score function to predict the hard class labels, 2) estimating the classifiers performance, and 3) obtaining well calibrated scores, treating the unlabeled examples as negatives, leads to significant bias. The bias can be corrected by the knowledge of the class proportion, making class proportion estimation the central task in PU learning. I will describe the theoretical issue of non-uniqueness (unidentifiability) in the estimation of the class proportion and identify key assumptions that make the estimation feasible. I will then present a nonparametric algorithm AlphaMax for practical estimation. Next, I’ll present an empirical risk minimization based criteria for PU classification. Finally, I will have an open ended discussion on deriving label propagation type algorithms for PU learning and extensions of PU learning to DAG structured multilabel classification.
- Nonparametric semi-supervised learning of class proportions: https://arxiv.org/abs/1601.01944
- Estimating the class prior and posterior from noisy positives and unlabeled data, NIPS 2016: https://papers.nips.cc/paper/6168-estimating-the-class-prior-and-posterior-from-noisy-positives-and-unlabeled-data.pdf
- Recovering True Classifier Performance in Positive-Unlabeled Learning, AAAI 2017: https://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14990
Shantanu Jain is an associate research scientist in the Khoury College of Computer Sciences at Northeastern University. He is interested in the field of statistical modeling and machine learning. His research focuses on developing semi-supervised methods under data constraints for which standard approaches lead to biased estimates. His recent work has addressed issues in binary classification and its evaluation that arise due to the absence of labeled examples from one of the classes (positive-unlabeled learning), incorrectly labeled examples and bias in the labeled examples. His research has been applied to many bioinformatics problems and mass spectrometry data.
(1/24, 12:00pm): Fatemeh Nargesian (University of Rochester): Organizing Data Lakes for Navigation details)TITLE
Organizing Data Lakes for Navigation
In this talk, I discuss data-driven discovery in data lakes and introduce the data lake organization problem, where the goal is to find a navigation structure that allows a user to most effectively navigate a data lake. An organization is defined as a graph that contains nodes representing sets of attributes within a data lake and edges indicating subset relationships between nodes. I present a new probabilistic model of how users interact with an organization and an approximate algorithm for efficiently constructing a data lake organization. I discuss quantitative results on both real data lakes containing data from open data portals and on benchmarks that emulate the observed characteristics of real data lakes. I also present the results of a user study showing that 42% of users preferred the use of navigation over keyword search, suggesting these are complementary and both useful modalities for data discovery in data lakes.
Fatemeh Nargesian is an assistant professor of computer science at the University of Rochester. Her primary research interests are in data discovery and data integration. She received her PhD from the University of Toronto.
(2/6, 1:35pm): BHS 030: Stratis Ioannidis (Northeastern): Learning from Comparisons details)TITLE
Learning from Comparisons
We consider learning from comparisons, i.e., learning from a dataset containing pairs of samples and labels indicating their relative order. For example, a medical expert presented with a pair of patient files can order the pair w.r.t. the relative severity of a disease. Comparisons are often less noisy than class labels: human labelers disagreeing when generating class judgments often exhibit reduced variability when asked to compare pairs of samples instead. Comparisons are also more informative, as they capture both inter and intra-class relationships; the latter are not revealed via class labels alone. We discuss inference algorithms in this setting, as well as means to identify which comparison labels to collect.
Stratis Ioannidis is an assistant professor in the Electrical and Computer Engineering Department of Northeastern University, in Boston, MA, where he also holds a courtesy appointment with the College of Computer and Information Science. He received his B.Sc. (2002) in Electrical and Computer Engineering from the National Technical University of Athens, Greece, and his M.Sc. (2004) and Ph.D. (2009) in Computer Science from the University of Toronto, Canada. Prior to joining Northeastern, he was a research scientist at the Technicolor research centers in Paris, France, and Palo Alto, CA, as well as at Yahoo Labs in Sunnyvale, CA. He is the recipient of an NSF CAREER Award, a Google Faculty Research Award, and a best paper award at ACM ICN 2017. His research interests span machine learning, distributed systems, networking, optimization, and privacy.
- A Severity Score for Retinopathy of Prematurity Peng Tian, Yuan Guo, Jayashree Kalpathy-Cramer, Susan Ostmo, J. Peter Campbell, Michael F. Chiang, Jennifer Dy, Deniz Erdogmus, and Stratis Ioannidis. *Knowledge Discovery and Data Mining (KDD) *, Anchorage, AK, 2019.
- Classification and Comparison via Neural Networks Peng Tian, Jennifer Dy, Deniz Erdogmus, James Brown, Jayashree Kalpathy-Cramer, Susan Ostmo, J. Peter Campbell, Michael F. Chiang, and Stratis Ioannidis. *Elsevier Journal of Neural Networks *, 2019.
- Experimental Design Under the Bradley-Terry Model Yuan Guo, Peng Tian, Jayashree Kalpathy-Cramer, Susan Ostmo, J. Peter Campbell, Michael F. Chiang, Deniz Erdogmus, Jennifer Dy, and Stratis Ioannidis. *International Joint Conference on Artificial Intelligence (IJCAI) *, Stockholm, Sweden, 2018.
- Accelerated Experimental Design for Pairwise Comparisons Yuan Guo, Jennifer Dy, Deniz Erdogmus, Jayashree Kalpathy-Cramer, Susan Ostmo, J. Peter Campbell, Michael F. Chiang, and Stratis Ioannidis. *SIAM International Conference on Data Mining (SDM) *, Calgary, Alberta, Canada, 2019.
(2/12, 1:35pm): BHS 030: Alina Oprea (Northeastern): Security analytics: the application of machine learning and artificial intelligence in cyber security details)
(2/19, 1:35pm): BHS 030: Cristina Nita-Rotaru (Northeastern): Designing (byzantine-)resilient distributed systems details)Title: Designing (byzantine-)resilient distributed systems
Speaker: Cristina Nita-Rotaru
Abstract: What makes bitcoin work? How do you build a highly-resilient infrastructure for the power-grid? How could a 43 seconds network partition have cascaded into an over a day of service unavailability for github?
In this talk, I will describe several fundamental concepts from distributed systems and show their applications to blockchains, resilient infrastructure, and cloud services. I will highlight the main challenges when building such systems and describe several ongoing research projects in my group.
Bio: Cristina Nita-Rotaru is a Professor of Computer Science in the Khoury College of Computer Sciences at Northeastern University. Prior to joining Northeastern she was a faculty in the Department of Computer Science at Purdue University from 2003 to 2015. Her research lies at the intersection of security, distributed systems, and computer networks. The overarching goal of her work is designing and building secure and resilient distributed systems and network protocols, with assurance that their deployed implementations provide their security, resilience, and performance goals. Her work received several best paper awards in IEEE SafeThings 2019, NDSS 2018, ISSRE 2017, DSN 2015 as well as two IETF/IRTF Applied Networking Research Prize in 2018 and 2016 for her work on QUIC. Cristina Nita-Rotaru is a recipient of the NSF Career Award in 2006.
Network and Distributed Systems Security Laboratory: https://nds2.ccs.neu.edu/
(2/26, 1:35pm): BHS 030: Predrag Radivojac (Northeastern): Deciphering molecular mechanisms of disease upon mutation via semi-suprvised learning details)Title: Deciphering molecular mechanisms of disease upon mutation via semi-suprvised learning
Speaker: Predrag Radivojac
Abstract: A major goal in computational biology is the development of algorithms, analysis techniques, and tools towards deep mechanistic understanding of life at a molecular level. In the process, computational biology must take advantage of the new developments in artificial intelligence and machine learning, and then extend beyond pattern analysis to provide testable hypotheses for experimental scientists. This talk will focus on our contributions to this process and relevant related work. We will first discuss the development of machine learning techniques for partially observable domains such as molecular biology; in particular, methods for accurate estimation of frequency of occurrence of hard-to-measure and rare events. We will show some identifiability results in parametric and nonparametric situations as well as how such frequencies can be used to correct estimated model accuracies. We will then show how these methods play key roles in inferring protein cellular roles and phenotypic effects of genomic mutations, with an emphasis on understanding the molecular mechanisms of human genetic disease. We further assessed the value of these methods in the wet lab where we tested the molecular mechanisms behind selected de novo mutations in a cohort of individuals with neurodevelopmental disorders. Finally, we will discuss implications on future research in machine learning, genome interpretation, and precision health.
Bio: Predrag Radivojac joined Northeastern University as a Professor in the Khoury College of Computer Sciences. Prior to joining Northeastern he was a Professor of Computer Science at Indiana University Bloomington and Associate Chair in the Department of Computer Science.
Prof. Radivojac’s primary research interests include computational biology and machine learning. He is motivated to improve our understanding of life at a molecular level and how molecular events affect higher level phenotypes. His group addresses such questions through the development of algorithms and analysis techniques related to the function of biological macromolecules, mass spectrometry proteomics, genome interpretation, and precision health; e.g., he is interested in elucidating the molecular mechanisms of disease consequent to genetic variation. In the area of machine learning, Prof. Radivojac’s research addresses foundational and applied problems in semi-supervised learning, structured-output learning, and active learning, and investigates topics such as kernels and distance functions (e.g., metrics) across data types and analysis techniques. He is also interested in performance evaluation of machine learning algorithms, especially in the hierarchical structured-output domains and cases of selection bias that often arise in the open world setting.
Prof. Radivojac received the National Science Foundation CAREER Award in 2007 and is an August-Wilhelm Scheer Visiting Professor at Technical University of Munich (TUM) as well as an honorary member of the Institute for Advanced Study at TUM. Prof. Radivojac co-directed all of data sciences and informatics within the multi-campus Precision Health Initiative of Indiana University. He is currently an Editorial Board member for the journal Bioinformatics, Associate Editor for PLoS Computational Biology, and serves his third (elected) term on the Board of Directors of the International Society for Computational Biology (ISCB).
(2/26, 2:50pm): WVH 362: Evimaria Terzi (Boston University): Simple models for optimizing driver earnings in ride-sharing platforms details)
Simple models for optimizing driver earnings in ride-sharing platforms
On-demand ride-hailing platforms like Uber and Lyft are helping reshape urban transportation, by enabling car owners to become drivers for hire with minimal overhead. Such platforms are a multi-sided market and offer a rich space for studies with socio-economic implications.
In this talk I am going to address two questions:
1. In the absence of coordination, what is the best course of action for a self-interested driver that wants to optimize his earnings?
2. In the presence of coordination, is it possible to maximize social welfare objectives in an environment where the objectives of the participants (drivers, customers and the platform) are (often) misaligned?
We will discuss the computational problems behind these problems and describe simple algorithmic solutions that work extremely well in practice. We will demonstrate the practical strength of our approaches with well-designed experiments on novel datasets we collected from such platforms.
Evimaria Terzi is an Associate Professor at the Computer Science Department of Boston University. She joined the department in 2009 after spending two years as a Research Staff Member at the IBM Almaden Research Center. Evimaria got her PhD from the University of Helsinki in 2007. Her research interests span a big range of data-centered problems. Currently she is interested in problems related to team formation, ride-sharing platforms as well as recommender systems. He research is funded by multiple NSF grants and she was a recipient of the NSF CAREER award (2013) and Microsoft Faculty Fellowship (2010).
(2/27, 1:35pm): BHC 030: Jon Ullman (Northeastern): (Preventing) Overfitting in Adaptive Data Analysis details)Title: (Preventing) Overfitting in Adaptive Data Analysis
Speaker: Jon Ullman
Abstract: How can we use a random sample to draw meaningful conclusions about populations, without overfitting to the sample? This problem remains vexing despite decades of study by statisticians. One cause of overfitting is so-called adaptive data analysis---the common practice where datasets are re-used for multiple analyses, and the outcome of one analysis can influence the choice of future analyses. Adaptivity can arise in a range of ways, including data shared across research groups, multi-stage statistical methods, and data science competitions with shared validation data. Unfortunately, adaptivity invalidates traditional methods for preventing overfitting because it breaks a fundamental assumption about the independence of the data and the method.
In this talk I will introduce a relatively recent approach to understanding and preventing false discovery in adaptive data analysis based on the concept of algorithmic stability. Specifically, I will introduce and motivate the problem of adaptive data analysis, describe a model for studying this problem, and demonstrate how overfitting can occur and, to some extent, be mitigated in this model.
Related blog: http://blog.mrtz.org/2015/03/09/competition.html
(3/11, 3:45pm): 362 WVH: Stavros Sintos (Duke University): Efficient algorithms for querying uncertain data details)TITLE
Efficient Algorithms for Querying Uncertain Data
Motivated by applications in sensor network, data cleaning, and data integration, there has been a growing interest in managing uncertain data. Uncertainty is typically captured using stochastic data models, and querying data requires either statistics about the probabilistic behavior of the underlying data, or data cleaning to reduce the uncertainty of the query answer. In the first part of the talk I will present efficient indexes for answering range-max queries on uncertain data. Given a collection of uncertain points in the d-dimensional space, where each point is associated with a value, the goal is to preprocess the points to an index such that given a query rectangle R, compute the expected maximum or the most likely maximum of the values of the points that lie in R. Next, I will present data cleaning problems with applications in Fact-Checking. Given a claim, made by a politician, over uncertain data our goal is to ascertain the quality of the claim, which is measured by a function f. We propose algorithms for choosing a subset of objects to clean with cost at most C to i) minimize the variance of f or ii) maximize the surprise of f. I will conclude by discussing current research and future research directions.
Stavros Sintos is a Ph.D. candidate in the Department of Computer Science at Duke University under supervision of Prof. Pankaj K. Agarwal. His main research interest is in the design of efficient algorithms for problems in databases, data mining, and computational geometry. In particular, he works on designing practical algorithms and indexes with theoretical guarantees for problems in data summarization and data uncertainty. An important aspect of his research has been on combining geometric optimization with query processing. He is also a recipient of the James B. Duke Fellowship. Before joining Duke he obtained his BS in the Department of Computer Science at University of Ioannina in Greece. For more details please see http://www.cs.duke.edu/~ssintos.
* Range-Max Queries on Uncertain Data (PODS 2016)
*Selecting Data to Clean for Fact Checking: Minimizing Uncertainty vs. Maximizing Surprise (PVLDB 2019)
* Other papers: https://sites.google.com/view/stavros-sintos/publica
(3/12, 1:35pm): Zoom: Tina Eliassi-Rad (Northeastern): Network science details)Title: Geometric and Topological Graph Analysis for Machine Learning
Speaker: Tina Eliassi-Rad
Abstract:This talk has two parts: (1) geometric analysis for graph embedding and (2) topological analysis for graph distances. First, graph embedding seeks to build an accurate low-dimensional representation of a graph. This low-dimensional representation is then used for various downstream tasks such as link prediction. One popular approach is Laplacian Eigenmaps, which constructs a graph embedding based on the spectral properties of the Laplacian matrix of a graph. The intuition behind it, and many other embedding techniques, is that the embedding of a graph must respect node similarity: similar nodes must have embeddings that are close to one another. We dispose of this distance-minimization assumption. In its place, we use the Laplacian matrix to find an embedding with geometric properties (instead of spectral ones) by leveraging the simplex geometry of the graph. We introduce Geometric Laplacian Eigenmap Embedding (or GLEE for short) and demonstrate that it outperforms various other techniques (including Laplacian Eigenmaps) in the tasks of graph reconstruction and link prediction. This work is joint with Leo Torres and Kevin Chan, and is forthcoming in the Journal of Complex Networks. Second, measuring graph distance is a fundamental task in graph mining. For graph distance, determining the structural dissimilarity between networks is an ill-defined problem, as there is no canonical way to compare two networks. Indeed, many of the existing approaches for network comparison differ in their heuristics, efficiency, interpretability, and theoretical soundness. Thus, having a notion of distance that is built on theoretically robust first principles and that is interpretable with respect to features ubiquitous in complex networks would allow for a meaningful comparison between different networks. We rely on the theory of the length spectrum function from algebraic topology, and its relationship to the non-backtracking cycles of a graph, in order to introduce the Non-Backtracking Spectral Distance (NBD) for measuring the distance between undirected, unweighted graphs. NBD is interpretable in terms of features of complex networks such as presence of hubs and triangles. We showcase the ability of NBD to discriminate between networks in both real and synthetic data sets. This work is joint with Leo Torres and Suarez-Serrato, and was published in the Journal of Applied Network Science in June 2019.
Bio: Tina Eliassi-Rad is an Associate Professor of Computer Science at Northeastern University in Boston, MA. She is also a core faculty member at Northeastern University's Network Science Institute. Prior to joining Northeastern, Tina was an Associate Professor of Computer Science at Rutgers University; and before that she was a Member of Technical Staff and Principal Investigator at Lawrence Livermore National Laboratory. Tina earned her Ph.D. in Computer Sciences (with a minor in Mathematical Statistics) at the University of Wisconsin-Madison. Her research is rooted in data mining and machine learning; and spans theory, algorithms, and applications of big data from networked representations of physical and social phenomena. She has over 100 peer-reviewed publications (including a few best paper and best paper runner-up awardees); and has given over 200 invited talks and 14 tutorials. Tina's work has been applied to personalized search on the World-Wide Web, statistical indices of large-scale scientific simulation data, fraud detection, mobile ad targeting, cyber situational awareness, and ethics in machine learning. Her algorithms have been incorporated into systems used by the government and industry (e.g., IBM System G Graph Analytics) as well as open-source software (e.g., Stanford Network Analysis Project). In 2017, Tina served as the program co-chair for the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (a.k.a. KDD, which is the premier conference on data mining) and as the program co-chair for the International Conference on Network Science (a.k.a. NetSci, which is the premier conference on network science). In 2020, she is serving as the program co-chair for the International Conference on Computational Social Science (a.k.a. IC2S2, which is the premier conference on computational social science). Tina received an Outstanding Mentor Award from the Office of Science at the US Department of Energy in 2010; and became a Fellow of the ISI Foundation in Turin Italy in 2019.
(4/1, 1:35pm): Zoom: David Lazer: Fake news on Twitter during the 2016 U.S. presidential election details)Title: Fake news on Twitter during the 2016 U.S. presidential election
Speaker: David Lazer
Abstract:The spread of fake news on social media became a public concern in the United States after the 2016 presidential election. We examined exposure to and sharing of fake news by registered voters on Twitter and found that engagement with fake news sources was extremely concentrated. Only 1% of individuals accounted for 80% of fake news source exposures, and 0.1% accounted for nearly 80% of fake news sources shared. Individuals most likely to engage with fake news sources were conservative leaning, older, and highly engaged with political news. A cluster of fake news sources shared overlapping audiences on the extreme right, but for people across the political spectrum, most political news exposure still came from mainstream media outlets.Paper: https://northeastern-datalab.github.io/cs3950/download/lazer.paper.pdf
(4/3, 12:00pm): Zoom: Yanlei Diao (UMass Amherst): AIDEme -- An active learning based system for interactive exploration of large datasets details)TITLE
AIDEme: An active learning based system for interactive exploration of large datasets
There is an increasing gap between fast growth of data and limited human ability to comprehend data. Consequently, there has been a growing demand for analytics tools that can bridge this gap and help the user retrieve high-value content from data. In this talk, I introduce AIDEme, a scalable interactive data exploration system for efficiently learning a user interest pattern over a large dataset. The system is cast in a principled active learning (AL) framework, which iteratively presents strategically selected records for user labeling, thereby building an increasingly-more-accurate model of the user interest. However, existing AL techniques experience slow convergence when learning the user interest on large datasets. To overcome the problem, AIDEme explores properties of the user labeling process and the class distribution of observed data to design new AL algorithms, which come with provable results on model accuracy and approximation, and have evaluation results showing much improved convergence over existing AL methods while maintaining interactive speed.
Yanlei Diao is Professor of Computer Science at the University of Massachusetts Amherst, USA, and Ecole Polytechnique, France. Her research interests lie in big data analytics and scalable intelligent information systems, with a focus on interactive data exploration, explainable anomaly detection, optimization in cloud analytics, data streams, and uncertain data management. She received her PhD in Computer Science from the University of California, Berkeley in 2005.
Prof. Diao was a recipient of the 2016 ERC Consolidator Award, 2013 CRA-W Borg Early Career Award (one female computer scientist selected each year for outstanding contributions), IBM Scalable Innovation Faculty Award, and NSF Career Award. She spoke at the Distinguished Faculty Lecture Series at the University of Texas at Austin and Technische Universitaet Darmstadt. She has served as Editor-in-Chief of the ACM SIGMOD Record, Associate Editor of ACM TODS, Chair of the ACM SIGMOD Research Highlight Award Committee, and member of the SIGMOD and PVLDB Executive Committees. She was PC Co-Chair of IEEE ICDE 2017 and ACM SoCC 2016, and served on the organizing committees of SIGMOD, PVLDB, and CIDR, as well as on the program committees of many international conferences and workshops.
* A Factorized Version Space Algorithm for" Human-In-the-Loop" Data Exploration, ICDM 2019
* AIDEme: An active learning based system for interactive exploration of large datasets, NeurIPS 2019
* Optimization for Active Learning-based Interactive Database Exploration, VLDB 2018
(5/10, 12:00pm): Predrag Radivojac (Khoury / Northeastern): A new class of metrics for learning on real-valued and structured data details)TITLE
A new class of metrics for learning on real-valued and structured data
We propose a new class of metrics on sets, vectors, and functions that can be used in various stages of data mining, including exploratory data analysis, learning, and result interpretation. These new distance functions unify and generalize some of the popular metrics, such as the Jaccard and bag distances on sets, Manhattan distance on vector spaces, and Marczewski-Steinhaus distance on integrable functions. We prove that the new metrics are complete and show useful relationships with $f$-divergences for probability distributions. To further extend our approach to structured objects such as ontologies, we introduce information-theoretic metrics on directed acyclic graphs drawn according to a fixed probability distribution. We conduct empirical investigation to demonstrate the effectiveness on real-valued, high-dimensional, and structured data. Overall, the new metrics compare favorably to multiple similarity and dissimilarity functions traditionally used in data mining, including the Minkowski ($L^p$) family, the fractional $L^p$ family, two $f$-divergences, cosine distance, and two correlation coefficients. We provide evidence that they are particularly appropriate for rapid processing of high-dimensional and structured data in distance-based learning.
(5/15, 12:00pm): Leilani Battle (University of Maryland): Behavior-Driven Optimizations for Big Data Exploration details)TITLE
Behavior-Driven Optimizations for Big Data Exploration
With massive amounts of data coming from field sensors, sequencers, mobile devices, etc., data-driven decision making has become increasingly important in industry, government, and the sciences. One of the key issues for analysts and scientists who work with large datasets is efficient visualization of their data to extract patterns, observe anomalies, and debug their analysis workflows. Though a variety of visualization tools exist to help people make sense of their data, these tools often rely on database management systems (or DBMSs) for data processing and storage; and unfortunately, DBMSs fail to process the data fast enough to support a fluid, interactive visualization experience.
My work blends optimization techniques from databases and methodology from HCI and visualization in order to support interactive exploration of large datasets. In this talk, I will first discuss Sculpin, a visual exploration system that learns user exploration patterns automatically, and exploits these patterns to pre-fetch data ahead of users as they explore. I will then discuss ongoing work to better understand how analysts explore data in more sophisticated analysis systems, such as Tableau Desktop. Finally, I will report on ongoing efforts to standardize the way we evaluate visual data analysis systems in general.
SIGMOD 2016: http://www.cs.umd.edu/~leilani/static/papers/forecache_cr_sigmod2016.pdf
HILDA 2018: http://www.cs.umd.edu/~leilani/static/papers/battle_visualization-centered-evaluation_hilda_2018.pdf
Leilani Battle is an Assistant Professor at the University of Maryland, College Park, with a joint appointment in the University of Maryland Institute for Advanced Computer Studies (UMIACS). She is also affiliated with the UMD Human-Computer Interaction Laboratory (HCIL). Her research interests focus on developing interactive data-intensive systems that can aid analysts in performing complex data exploration and analysis. Her current research is anchored in the field of databases, but utilizes research methodology and techniques from HCI and visualization to integrate data processing (databases) with interactive interfaces (HCI, visualization). She is an NSF Graduate Research Fellowship Recipient (2012), and her research is currently supported by an NSF CISE CRII Award (2019). In 2017, she completed a postdoc in the UW Interactive Data Lab. She holds an MS (2013) and PhD (2017) in Computer Science from MIT, where she was a member of the MIT Database Group, and a BS in Computer Engineering from UW (2011), where she was a member of the UW database group.
(5/31, 12:00pm): Raymond Chi-Wing Wong (The Hong Kong University of Science and Technology): FindYourFavorite: An Interactive System for Finding the User's Favorite Tuple in the Database details)TITLE
FindYourFavorite: An Interactive System for Finding the User's Favorite Tuple in the Database
When faced with a database containing millions of tuples, an end user might be only interested in finding his/her favorite tuple in the database. In this talk, we study how to help an end user to find such a favorite tuple with a few user interations. In each interaction, a user is presented with a small number of tuples (which can be artificial tuples outside the database or true tuples inside the database) and s/he is asked to indicate the tuple s/he favors the most among them. Different from the previous work which displays artificial tuples to users during the interaction and requires heavy user interactions, we achieve a stronger result. Specifically, we use a concept, called the utility hyperplane, to model the user preference and an effective pruning strategy to locate the favorite tuple for a user in the whole database. Based on these techniques, we developed an interactive system, called FindYourFavorite, and demonstrate that the system could identify the favorite tuple for a user with a few user interactions by always displaying true tuples in the database.
SIGMOD 2018: http://home.cse.ust.hk/~raywong/paper/sigmod18-regret.pdf
Raymond Chi-Wing Wong is an Associate Professor in Computer Science and Engineering (CSE) of The Hong Kong University of Science and Technology (HKUST). He is currently the director of the Risk Management and Business Intelligence (RMBI) program. He was the director of the Computer Engineering (CPEG) program from 2014 to 2016 and was the associate director of the Computer Engineering (CPEG) program from 2012 to 2014. He received the BSc, MPhil and PhD degrees in Computer Science and Engineering in the Chinese University of Hong Kong (CUHK) in 2002, 2004 and 2008, respectively. In 2004-2005, he worked as a research and development assistant under an R&D project funded by ITF and a local industrial company called Lifewood. From May 2006 to Aug 2006, he was a visiting student of Prof. Jian Pei and Prof. Ke Wang, at Simon Fraser University in Canada. From Aug 2007 to Sept 2007, he visited IBM T.J. Watson Research Center as a summer intern under the supervision of Prof. Philip S. Yu. From Jun 2008 to Jul 2008, he visited Prof. Tamer Ozsu at University of Waterloo as a visiting scholar. Some of his collaborators are Prof. Ada Fu (2003-2013), Prof. Ke Wang (2003-2013), Prof. Philip S. Yu (2009-2013), Prof. Jian Pei (2006-2011), Prof. Tamer Ozsu (2009-2011), Prof. Jiuyong Li (2006-2011), Prof. Yufei Tao (2007-2009), Prof. Ihab Ilyas (2009), Prof. Jeffrey Yu (2009),Prof. Lei Chen (2008) and Prof. Eamonn Keogh (2008). He received 28 awards. He published 69 conference papers (e.g., SIGMOD, SIGKDD, VLDB, ICDE and ICDM), 31 journal/chapter papers (e.g., TODS, DAMI, TKDE, VLDB journal and TKDD) and 1 book. He reviewed papers from conferences and journals related to data mining and database, including VLDB conference, SIGMOD, TODS, VLDB Journal, TKDE, TKDD, ICDE, SIGKDD, ICDM, DAMI, DaWaK, PAKDD, EDBT and IJDWM. He is a program committee member of conferences, including SIGMOD, VLDB, ICDE, KDD, ICDM and SDM, and a referee of journals, including TODS, VLDBJ, TKDE, TKDD, DAMI and KAIS.
(6/21, 12:00pm): Wolfgang Gatterbauer (Khoury): Anytime Approximation in Probabilistic Databases via Scaled Dissociations details)TITLE
Anytime Approximation in Probabilistic Databases via Scaled Dissociations
Speeding up probabilistic inference remains a key challenge in probabilistic databases (PDBs) and the related area of statistical relational learning (SRL). Since computing probabilities for query answers is #P-hard, even for fairly simple conjunctive queries, both the PDB and SRL communities have proposed a number of approximation techniques over the years. The two prevalent techniques are either (i) MCMCstyle sampling or (ii) branch-and-bound (B&B) algorithms that iteratively improve model-based bounds using a combination of variable substitution and elimination.
We propose a new anytime B&B approximation scheme that encompasses all prior model-based approximation schemes proposed in the PDB and SRL literature. Our approach relies on the novel idea of “scaled dissociation” which can improve both the upper and lower bounds of existing model-based algorithms. We apply our approach to the well-studied problem of evaluating self-join-free conjunctive queries over tuple-independent PDBs, and show a consistent reduction in approximation error in our experiments on TPC-H, Yago3, and a synthetic benchmark setting.
Based on joint work with Maarten Van den Heuvel, Peter Ivanov, Floris Geerts, and Martin Theobald.
SIGMOD 2019: http://gatterbauer.name/download/sigmod2019_anytime_approximations_scaled_dissociations.pdf
(9/20, 12:00pm): Oktie Hassanzadeh (IBM): Knowledge Graphs for Data Curation & Data Science details)TITLE
Knowledge Graphs for Data Curation & Data Science
In this informal talk, I will go over a number of projects at IBM Research that leverage knowledge graphs for large-scale data exploration, data curation, and data science. In particular, I will describe an application of knowledge graphs for large-scale data exploration and curation in the context of the IBM Research Helix project, where large-scale similarity analysis and automated linkage discovery enable the discovery and construction of data sets from a large number of data sets (or a data lake) for a given analytics task. I will then describe our ongoing project and recent progress on using knowledge graphs for automating data science. I will discuss a number of ways knowledge graphs can assist with feature engineering and the challenges we faced in implementing a prototype solution. If time permits, I will briefly go over some of our recent work on knowledge extraction from text.
* Automated Feature Enhancement for Predictive Modeling using External Knowledge, ICDM 2019 (Demo Track, To Appear)
* Exploring Big Data with Helix: Finding Needles in a Big Haystack. SIGMOD Record 43(4): 43-54 (2014) https://doi.org/10.1145/2737817.2737829
* Instance-Based Matching of Large Ontologies Using Locality-Sensitive Hashing. International Semantic Web Conference (1) 2012: 49-64 https://doi.org/10.1007/978-3-642-35176-1_4
* Inducing Implicit Relations from Text Using Distantly Supervised Deep Nets. International Semantic Web Conference (1) 2018: 38-55 https://doi.org/10.1007/978-3-030-00671-6_3 (ISWC 2017 Semantic Web Challenge Winning Solution, see blog post here: https://www.ibm.com/blogs/research/2017/11/knowledge-base-construction-iswc-2017/)
* Answering Binary Causal Questions Through Large-Scale Text Mining: An Evaluation Using Cause-Effect Pairs from Human Experts. IJCAI 2019: 5003-5009 https://doi.org/10.24963/ijcai.2019/695
Dr. Oktie Hassanzadeh is a Research Staff Member at IBM Research. He has received several academic and corporate awards for his work in data management and knowledge graphs. He is a recipient of the top prize at the prestigious Semantic Web Challenge at ISWC 2017, as well as two best-paper awards at ESWC 2018 and ESWC 2016 conferences. He has received his M.Sc. and Ph.D. degrees from the University of Toronto, where he received the IBM PhD fellowship and the Yahoo! Key Scientific Challenges awards. He is also a two-time recipient of the first prize at the Triplification Challenge at the SEMANTiCS Conference for his projects in the areas of Semantic Technologies and Linked Data. For more information, refer to his home page: http://researcher.watson.ibm.com/person/us-hassanzadeh
(9/27, 12:00pm): Boris Glavic (Illinois Institute of Technology): Relevance-based Data Management details)TITLE
Relevance-based Data Management
In this talk, I give an overview of my group's work on relevance-based data management, a novel data management paradigm centered around the idea of analyzing what data is needed to answer queries. For the most part, the talk will focus on one particular approach which we refer to as provenance-based data skipping. Since the inception of relational databases, index structures and other physical design techniques have been used to speed-up access to data. In spite of the wide variety of mature techniques that are available, there are important classes of selective queries, e.g., top-k queries, which cannot benefit from these techniques. This issue arises because database systems statically analyze incoming queries to determine which data is needed for answering which query. However, for many queries it is impossible to determine up-front what data is relevant. Provenance-based data skipping overcomes this limitation by generating provenance sketches which are concise encodings of what data is relevant for a query. Similar to query answering with views, a provenance sketch captured for one query is used to speed up the subsequent evaluation of other queries. In contrast to materialized views, provenance sketches are guaranteed to be small (100s of bytes) and can benefit from physical database design artifacts such as indexes, horizontal partitioning of tables, and zone maps. Furthermore, I will highlight open problems in provenance-based data skipping and discuss additional applications of relevance-based data management techniques in general.
IEEE Data Engineering Bulletin: GProM - A Swiss Army Knife for Your Provenance Needs IEEE Data Engineering Bulletin http://sites.computer.org/debull/A18mar/p51.pdf
Boris Glavic is an Associate Professor of Computer Science at the Illinois Institute of Technology where he leads the IIT DBGroup (http://www.cs.iit.edu/~dbgroup/ ). Before coming to IIT, Boris spend to two years as a PostDoc at the University of Toronto working with Renée J. Miller. He received his PhD from the University of Zurich in Switzerland being advised by Michael Böhlen and Gustavo Alonso. Boris is a database researcher building comprehensive systems based on solid theoretical foundations. His main research interests are provenance, data integration and cleaning, and query processing and optimization.
(10/11, 12:00pm): Owen O'Malley (Cloudera): Big Data's journey to ACID details)TITLE
Big Data's journey to ACID
Originally, the Hadoop ecosystem had no ability to handle updates and the user had to completely re-write the data to update it. Over the years, the community built the ability to have point updates in key-value stores (eg HBase) and adding new partitions of data via Hive. Five years ago, we added the first ability for Hive to use standard SQL operations to update your data. Recently, there have been multiple new systems to address similar needs: Apache Iceberg, Apache Hudi, and Databricks Delta. I'll talk about compare the new systems to Hive ACID, discussing their trade offs and approaches.
Owen O’Malley was a founder and technical fellow at Hortonworks. Hortonworks was founded in 2011 with 25 employees spinning out from Yahoo, went public in 2014, and grew to 1,400 employees before merging with Cloudera in 2019. Cloudera’s platform includes Hadoop and the large ecosystem of big data open source tools that enterprises need for their data analytics. Owen has been working on Hadoop since the beginning of 2006, was the first committer added to the project, and used Hadoop to set the Gray sort benchmark in 2008 and 2009. Over the years, he was the technical lead for MapReduce, Hadoop Security, ORC file, and adding ACID transactions to Hive.
(10/11, 3:00pm): Nan Tang (Qatar Computing Research Institute, QCRI): Data Preparation meets Data Cleaning details)TITLE
Data Preparation meets Data Cleaning
In this talk, I will first provide a brief overview of my research in data preparation, including data discovery, data cleaning and integration, and some systems (such as Data Civilizer, a project with MIT, and Columbus, a project with Wisconsin) to put different pieces together for data preparation. Afterwards, I will discuss why data visualization is important for data preparation, and what interesting research topics can happen when putting these two together, based on which I will present three works: (1) DeepEye: a visualization recommendation system, which was motivated by how to help users easily understand the discovered datasets in data preparation; (2) Interactive data cleaning for progressive visualization, which is to cheaply solve the problem (by 10-15 user interactions) that data visualization is bad if the data is very dirty; and (3) Cymphony: a generic system for collaborative task management, motivated by the practical scenarios that for many tasks (such as manual data cleaning, manually checking fake news), a team of users need to work collaboratively.
ICDE 2018: http://da.qcri.org/ntang/pubs/icde2018deepeye.pdf
EDBT 2018 vision paper: http://da.qcri.org/ntang/pubs/edbt2018.pdf
SIGMOD 2018 demo: http://da.qcri.org/ntang/pubs/deepeyesigmod2018demo.pdf
ICDE 2020 paper “Interactive cleaning for progressive visualization through composite questions”
Dr. Nan Tang is a senior scientist at Qatar Computing Research Institute (QCRI), HBKU, Doha, Qatar. Prior to joining QCRI, he has worked at Univeristy of Edinburgh (2010-2011) and CWI (2008-2010). He got his PhD. degree from The Chinese University of Hong Kong (2007). His research focuses on data preparation including data cleaning, entity matching, and building systems to support these tasks, and data visualization, especially on visualization recommendation and cleaning. He serves the community in many PCs, such as SIGMOD, PVLDB, ICDE, KDD. You can find more about me from my homepage: http://da.qcri.org/ntang/index.html
(12/4, 12:00pm): Wolfgang Gatterbauer (Khoury): Scalable Compatibility Estimation from Extremely Sparse Graphs details)TITLE
Scalable Compatibility Estimation from Extremely Sparse Graphs
Node classification is an important problem of graph management and Belief Propagation (BP) methods, along with its linearization, have been widely used for graphs with arbitrary compatibilities between classes. These methods rely on a prior specified compatibility matrix between classes, which is commonly given by domain experts or heuristics. Thus an important unresolved questions is: Can we do better in deriving compatibilities from the actual data in a principled and scalable way? We address this problem by indeed estimating the compatibilities on graphs with very sparse labels (e.g., 1 in 10,000 nodes is labeled). We propose methods to perform this estimation in a fraction of time it later takes to label the remaining nodes, which makes it a cheap preprocessing step for any existing label propagation method. Our approach consists of two steps: first, we create multiple consistent and compact graph statistics; and second, perform estimation on these far smaller representations. We show that the labeling accuracy of our proposed estimator is comparable to using the "correct" compatibilities and that our estimator is orders of magnitude faster than standard approaches based on train-test sets.
(12/9, 12:00pm): Oliver Kennedy (U of Buffalo): Safe, Reusable Heuristic Data Transformation through Caveats details)TITLE
Safe, Reusable Heuristic Data Transformation through Caveats
Through extensive surveys of data scientists in the field, we've identified numerous data transformation challenges, including time-series alignment, outlier management, entity resolution, parsing, and more. In each use case, a correct, fully general solution would have ben worthy of a PhD Thesis. By contrast, each of these use cases admitted a simple heuristic tailored to the specific dataset and analysis being performed, which could be scripted or applied manually in hours or even minutes. Although full generality is clearly not worth the price, dataset- and analysis- specific heuristics make it difficult to re-use the transformed dataset or transformation workflow.
In this talk, I introduce a form of lightweight data annotation called Caveats that make it easier to re-use both heuristically transformed datasets and heuristic transformation workflows. Caveats combine "where" provenance with techniques from incomplete data management to mark specific data values and records as dependent on a data transformation heuristic --- marked data elements are subject to change if the heuristic is altered. I will show how caveats can be seamlessly integrated into heuristic transformation workflows, introduce our lightweight scheme for propagating caveats through SQL queries, and demonstrate our prototype notebook, Vizier, built around Caveats.
https://odin.cse.buffalo.edu/papers/2019/submitted/CIDR-CrumbyNotebooks.pdf (to appear)
https://vizierdb.info/ (try it out!)
Oliver Kennedy is an associate professor at the University at Buffalo. He earned his PhD from Cornell University in 2011 and now leads the Online Data Interactions (ODIn) lab, which operates at the intersection of databases and programming languages. Oliver is the recipient of an NSF CAREER award, UB's Exceptional Scholar Award, and the UB SEAS Early Career Teacher of the Year Award. Several of his papers have been invited to "Best of" compilations from SIGMOD and VLDB. The ODIn lab is currently exploring uncertain data management, just-in-time data structure design, and "small data" management, as well as implementing VizierDB, a reproducibility- and reusability-focused notebook.
(1/9, 12:00pm): Kathleen Fisher (Tufts): Using a Declarative Description Language to Tame Ad Hoc Data: An Overview of the PADS project details)TITLE
Using a Declarative Description Language to Tame Ad Hoc Data: An Overview of the PADS project
Kathleen Fisher (Tufts University)
The goal of the PADS project (http://www.padsproj.org) is to make it easier for data analysts to extract useful information from ad hoc data files. This talk gives an overview of the project and how it helps bridge the gap between the unmanaged world of ad hoc data and the managed world of typed programming languages and databases. In particular, the paper reviews the design of PADS data description languages, describes the generated parsing tools and discusses the importance of meta-data. It also sketches the formal semantics, discusses useful tools and how can they can be generated automatically from PADS descriptions, and describes an inferencing system that can learn useful PADS descriptions from positive examples of the data format.
Kathleen Fisher is a Professor in and the Chair of the Computer Science Department at Tufts. Previously, she was a program manager at DARPA where she started and managed the HACMS and PPAML programs, a Consulting Faculty Member in the Computer Science Department at Stanford University, and a Principal Member of the Technical Staff at AT&T Labs Research. Kathleen is an ACM Fellow. She has served as Program Chair for PLDI, OOPSLA, ICFP, CUFP, and FOOL, and as General Chair for ICFP 2015. She is a former Associate Editor for TOPLAS and a former editor of the Journal of Functional Programming. Kathleen is a past Chair of the ACM Special Interest Group in Programming Languages (SIGPLAN) and past Co-Chair of CRA's Committee on the Status of Women (CRA-W). Kathleen is a recipient of the SIGPLAN Distinguished Service Award. She is Vice Chair of DARPA's ISAT Study Group and a member of the Board of Trustees of Harvey Mudd College.
Kathleen Fisher, David Walker: The PADS project: an overview. ICDT 2011: 11-17
(1/11, 12:00pm): Xiaoyu Liu (MIT): Kyrix: Interactive Visual Data Exploration at Scale details)TITLE
Kyrix: Interactive Visual Data Exploration at Scale
Scalable interactive visual data exploration is crucial in many domains due to increasingly large datasets generated at rapid rates. Details-on-demand provides a useful interaction paradigm for exploring large datasets, where the user starts at an overview, finds regions of interest, zooms in to see detailed views, zooms out and then repeats. This paradigm is the primary user interaction mode of widely used systems such as Google Maps, Aperture Tiles and ForeCache. These earlier systems, however, are highly customized with hardcoded visual representations and optimizations. A more general framework is needed to facilitate the development of visual data exploration systems at scale. We present Kyrix, an end-to-end system for developing scalable details-on-demand data exploration applications. Kyrix provides the developer with a declarative model for easy specification of general visualizations. Behind the scenes, Kyrix utilizes a suite of performance optimization techniques to achieve a response time within 500 ms for various user interactions. We also report results from a performance study which shows that a novel dynamic fetching scheme adopted by Kyrix outperforms tile-based fetching used in traditional systems.
Xiaoyu is a research intern at MIT data system group. Before that, she completed her master at Purdue, ECE department. She is interested in building scalable and reliable data-intensive systems.
(1/23, 12:00pm): Panayiotis Tsaparas (University of Ioannina): Finding Patterns in Temporal and Flow Networks details)TITLE
Finding Patterns in Temporal and Flow Networks
Networks are natural models for complex systems consisting of multiple interconnected entities. Many of these systems are dynamic, with connections appearing and disappearing over time. In the resulting networks, edges are annotated with time stamps, resulting in a graph history consisting of multiple network snapshots over time. We refer to such networks as temporal networks (graphs).
In this talk we consider two knowledge extraction problems on temporal networks. First, we consider the problem of finding lasting dense subgraphs. We provide formal definitions of graph density for a temporal graph, and we formulate the BFF problem that seeks a subset of nodes that are densely connected over time. Furthermore, we also consider the problem of finding a set of graph snapshots in which there are dense subsets of nodes. We study the complexity of the problems and propose exact, approximation and heuristic algorithms. Experiments indicate that our approach can find interesting patterns in collaboration and word co-occurrence networks.
In the second problem we assume that edges are also associated with a value, indicating a flow between the two nodes at a specific time stamp. Such networks appear naturally in practice in the case of money exchange networks (e.g., bitcoin), or traffic networks. We are interested in finding motifs in these networks, that is, small subgraphs that appear more often than random. Our motifs are restricted temporally and with respect to flow: all interactions must happen within a given time-window, and they must involve a minimum amount of flow. We propose efficient algorithms for finding such motifs and experiment with them on real networks.
Konstantinos Semertzidis, Evaggelia Pitoura, Evimaria Terzi, Panayiotis Tsaparas. Finding lasting dense subgraphs. ECML/PKDD Journal Track (DAMI), 2019 (to appear)
Chrysanthi Kosyfaki, Nikos Mamoulis, Evaggelia Pitoura, Panayiotis Tsaparas. Flow motifs in interaction networks. EDBT 2019 (to appear).
Panayiotis Tsaparas completed his undergraduate studies at the Department of Computer Science at University of Crete, Greece in 1995. He continued his graduate studies at University of Toronto, where he received his M.Sc., and Ph.D. degree, under the supervision of Allan Borodin. After graduation, he worked as a post-doctoral fellow at University of Rome, “La Sapienza”, and at University of Helsinki, and as a researcher at Microsoft Research. Since 2011 he joined the Department of Computer Science and Engineering at University of Ioannina, where he is now an Associate Professor. His research interests include Social Network Analysis, Algorithmic Data Mining, Web Mining and Information Retrieval.
(1/24, 12:00am): Northeast Database Day 2019 details)
(1/25, 1:15pm): Fatemeh Nargesian (University of Toronto): Table Union and Navigation details)TITLE
Table Union and Navigation
Preparing data for advanced analytics is prohibitively time-consuming and expensive for all but the best-trained and best-funded engineers. Nevertheless, the success of trained systems often depends on data and generated features more than powerful statistical algorithms. Among the data management challenges that need to be addressed is data enrichment which is the discovery and integration of meaningful data in data lakes for a data science task. However, there are many challenges to overcome: (1) the sheer size of data, (2) the unique distributions and characteristics of data lakes, and (3) the probabilistic and human-in-the-loop nature of data discovery. In this talk, I discuss two prevalent data discovery scenarios. In the first scenario, the query is a dataset and the data scientist is interested in interactively finding datasets that can be integrated (e.g unioned) with the query. I will introduce a probabilistic framework for finding and aligning unionable tables with a query table and discuss the need for distribution-aware techniques for data discovery. In the second scenario, search does not start with a query, instead, it is data-driven. I will talk about data lake organization problem where the goal is to find an organization (a directed acyclic graph) that allows a user to most efficiently navigate data lakes. I will present a probabilistic navigation model of how users interact with an organization and introduce a scalable structure learning algorithm for optimizing data lake organizations.
Nargesian, Zhu, Pu, Miller: Table Union Search on Open Data, PVLDB 2018.
Fatemeh Nargesian is a PhD student in the Data Curation Group of the Department of Computer at University of Toronto. Her research focuses on optimizing and automating data preparation for end-to-end data science, encompassing dataset discovery in data lakes and enriching datasets with new features. While at University of Toronto, Fatemeh was a joint Research intern at IBM Research-NY. Prior to University of Toronto, Fatemeh worked at the Clinical Informatics research group at McGill University on clinical data management, and received M.Sc. degrees in computer science and artificial intelligence at University of Toronto and Sharif University of Technology.
(1/28, 12:00pm): Stefano Ceri (Politecnico di Milano): WVH 366: Data-Driven Genomic Computing: Making Sense of the Signals from the Genome details)TITLE
Data-Driven Genomic Computing: Making Sense of the Signals from the Genome
Genomic computing is a new science focused on understanding the functioning of the genome, as a premise to fundamental discoveries in biology and medicine. Next Generation Sequencing (NGS) allows the production of the entire human genome sequence at a cost of about 1000 US $; many algorithms exist for the extraction of genome features, or "signals", including peaks (enriched regions), variants, or gene expression (intensity of transcription activity). The missing gap is a system supporting data integration and exploration, giving a “biological meaning” to all the available information; such a system can be used, e.g., for better understanding cancer or how environment influences cancer development.
The GeCo Project (Data-Driven Genomic Computing, ERC Advanced Grant, 2016-2021) has the objective or revisiting genomic computing through the lens of basic data management, through models, languages, and instruments, focusing on genomic data integration. Starting from an abstract model, we developed a system that can be used to query processed data produced by several large Genomic Consortia, including Encode and TCGA; the system employs internally the Spark engine, and prototypes can already be accessed from Polimi, from Cineca (Italian supercomputing center) and from the Broad Institute in Cambridge. During the five-years of the ERC project, the system will be enriched with data analysis tools and environments and will be made increasingly efficient. Among the objectives of the project, the creation of an “open source” repository of public data, available to biological and clinical research through queries, web services and search interfaces.
Stefano Ceri is professor of Database Systems at the Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB) of Politecnico di Milano. His research work covers four decades (1978-2018) and has been generally concerned with extending database technologies in order to incorporate new features: distribution, object-orientation, rules, streaming data; with the advent of the Web, his research has been targeted towards the engineering of Web-based applications and to search systems. More recently he turned to genomic computing. He authored over 350 publications (H-index 75) and authored or edited 15 books in English. He is the recipient of two ERC Advanced Grants: "Search Computing (SeCo)" (2008-2013), focused upon the rank-aware integration of search engines in order to support multi-domain queries and “Data-Centered Genomic Computing (GeCo)” (2016-2021), focused upon new abstractions for querying and integrating genomic datasets. He is the recipient of the ACM-SIGMOD "Edward T. Codd Innovation Award" (New York, June 26, 2013), an ACM Fellow and a member of Academia Europaea.
(1/28, 12:30pm): Marco Brambilla (Politecnico di Milano): Extraction of Evolving Knowledge from Social Media details)TITLE
Extraction of Evolving Knowledge from Social Media
Knowledge in the world continuously evolves. Ontologies that aim at formalizing this knowledge are largely incomplete, especially regarding data belonging to the so-called long tail. On the other side, informal sources such has social media are typically very up to date with respect to facts, events and relations between real-world entities. We propose a method for discovering emerging knowledge by extracting it from social content. Once initialized by domain experts, the method is capable of finding relevant entities by means of a mixed syntactic-semantic method. The method uses seeds, i.e. prototypes of emerging entities provided by experts, for generating candidates; then, it associates candidates to feature vectors built by using terms occurring in their social content and ranks the candidates by using their distance from the centroid of seeds, returning the top candidates. Our method can run iteratively, using the results as new seeds. The talk will describe the different extraction techniques, the advantages obtained by combining them, and the results of the experiments performed with the different methods.
Extracting Emerging Knowledge from Social Media. WWW 2017 https://dl.acm.org/citation.cfm?id=3052697
Iterative Knowledge Extraction from Social Networks. WWW Comp. 2018 https://dl.acm.org/citation.cfm?id=3191578
Marco Brambilla is associate professor at Politecnico di Milano. His research interests include data science, domain specific modeling languages and design patterns, crowdsourcing, social media monitoring, and big data analysis. He has been visiting researcher at CISCO, San Josè, and University of California, San Diego. He has been visiting professor at Dauphine University, Paris. He is co-founder of the startups Fluxedo, focusing on social media analysis and Social engagement, and WebRatio, devoted to software modeling tools for Web, Mobile and Business Process based software applications. He is author of various international books and research articles in journals and conferences, with over 200 papers. He was awarded various best paper prizes and gave keynotes and speeches at many conferences and organisations. He runs research projects on data science and industrial projects on data-driven innovation and big data. He is the main author of the OMG standard IFML.
(1/30, 12:00pm): Evimaria Terzi (BU): Active Matrix Completion Problems details)TITLE
Active Matrix Completion Problems
In many applications, e.g., recommender systems and traffic monitoring, the data comes in the form of a matrix that is only partially observed. A fundamental data-analysis task for these datasets is matrix completion, where the goal is to accurately infer the entries missing from the matrix. In this talk we will consider matrices that satisfy one of the two — very common — assumptions: low rank and positive definite. In both cases we will consider the active version of the matrix-completion problem that says: given access to an oracle that can obtain some (small) number of entries of the original matrix which entries shall we query so that we achieve the least completion error. We will demonstrate how this question can be transformed into elegant combinatorial problems and we will discuss algorithms for solving them. We will also argue that our combinatorial formulations are very different from existing work in the matrix-completion literature.
TALK BASED ON TWO PAPERS:
Charalampos Mavroforakis, Dóra Erdös, Mark Crovella, Evimaria Terzi: Active Positive-Definite Matrix Completion. SDM 2017.
Natali Ruchansky, Mark Crovella, Evimaria Terzi: Matrix Completion with Queries. KDD 2015.
Evimaria Terzi is an associate professor at the Computer Science Department at Boston University, where she also serves as an Associate Chair. Before joining BU in 2009, she was a research scientist at IBM Almaden Research Center. Evimaria has received her Ph.D. from University of Helsinki, Finland and her MSc from Purdue University. Evimaria is a recipient of the Microsoft Faculty Fellowship (2010) and NSF CAREER award and multiple NSF awards. Her research interests span a wide range of data-mining topics including algorithmic problems arising in recommendation systems, online social networks and social media.
(2/1, 12:00pm): Huy Ngyuen (Northeastern): Submodular Maximization with Nearly-optimal Approximation and Adaptivity details)TITLE
Submodular Maximization with Nearly-optimal Approximation and Adaptivity
In this talk, we present recent progress on understanding the tradeoff between the approximation guarantee and adaptivity for submodular maximization. The adaptivity of an algorithm is the number of sequential rounds of queries it makes to the evaluation oracle of the function, where in every round the algorithm is allowed to make polynomially-many parallel queries. Adaptivity is an important consideration in settings where the objective function is estimated using samples and in applications where adaptivity is the main running time bottleneck. We present nearly-optimal algorithms for submodular maximization subject to a variety of constraints.
Huy Lê Nguyen is an Assistant Professor of Computer Science in the College of Computer and Information Science at Northeastern University. Prior to joining Northeastern, he was a Research Assistant Professor at the Toyota Technological Institute in Chicago and before that, a Google Research Fellow at the Simons Institute at University of California, Berkeley. He received his PhD in Computer Science from Princeton University. Professor Nguyen is broadly interested in the design and analysis of algorithms, with an emphasis on algorithmic techniques for massive data sets and machine learning.
(2/13, 12:00pm): Sarah Ostadabbas (Northeastern): Physics-based Simulation to Bootstrap Learning in Small Data Domains details)TITLE
Physics-based Simulation to Bootstrap Learning in Small Data Domains
Deep learning (DL) regularly obliterate records for regression and classification tasks that have previously seen only incremental accuracy improvements. Furthermore, the training process is much more automated than classical techniques, no longer requiring a huge investment in feature selection and dataset pruning. However, this performance comes at a large data cost, frequently requiring upwards of 10^9 data/label pairs. Our digital economy has provided many problems for which such data exists or can be obtained cheaply relative to the benefits. There are many other fields that would get significant benefit from DL, but where data collection or labelling is expensive. This is very common for medical and military applications where data collection and/or labeling is expensive, individualized, and protected by very strong privacy or classification laws. Many applications will benefit from a learning framework with deep structure that works with limited labeled training samples, integrates domain-knowledge into the model, and maximizes the generalization of learning across domains. In this talk, I introduce our proposed 3-step framework that enables training of accurate and robust learning models under data limitation constraints based the use of simulation as its generative models. This framework includes: (i) employment of physics-based computational models referred to as simulation; (ii) design and analysis of unsupervised domain adaption techniques to close the gap between the simulated and real-world data distributions through a low-dimensional subspace transformation; (iii) development of learning techniques in the projected subspace to train an initial weak labeler; (iv) combined use of the weak labeler and a generative-adversarial framework to refine the simulated datasets by employing on a set of unlabeled real-world dataset in order to train a strong labeler; and (v) development and analysis of active learning techniques to select the most informative datasets to refine and adapt the strong labeler into a novel case with small data in the target application.
Inner Space Preserving Generative Pose Machine
Background Subtraction via Fast Robust Matrix Completion
Professor Ostadabbas is an assistant professor in the Electrical and Computer Engineering Department of Northeastern University (NEU), Boston, Massachusetts, USA. Professor Ostadabbas joined NEU in 2016 from Georgia Tech, where she was a post-doctoral researcher following completion of her PhD at the University of Texas at Dallas in 2014. At NEU, Professor Ostadabbas is the director of the Augmented Cognition Laboratory (ACLab) with the goal of enhancing human information-processing capabilities through the design of adaptive interfaces via physical, physiological, and cognitive state estimation. These interfaces are based on rigorous models adaptively parameterized using machine learning and computer vision algorithms. In particular, she has been integrating domain knowledge with machine learning by using physics-based simulation as generative models for bootstrapping deep learning recognizers. Professor Ostadabbas is the co-author of more than 50 peer-reviewed journal and conference articles and her research has been awarded by the National Science Foundation (NSF), Mathworks, Amazon AWS, and NVIDIA. She is the co-organizer of the Multimodal Data Fusion (MMDF2018) workshop, an NSF PI mini-workshop on Deep Learning in Small Data, and will be the program chair of the upcoming Machine Learning in Signal Processing (MLSP2019) conference. Prof. Ostadabbas is an associate editor of the IEEE Transactions on Biomedical Circuits and Systems, on the Editorial Board of the IEEE Sensors Letters and Digital Biomarkers Journal, and has been serving in several signal processing and machine learning conferences as a technical chair or session chair.
(2/15, 12:00pm): Olga Papaemmanouil (Brandeis University): Deep Learning meets Query Optimization details)TITLE
Deep Learning meets Query Optimization
Query optimization remains one of the most important and well studied problems in database systems. However, traditional query optimizers are complex, heuristically-driven systems that do not to learn from past experiences: they plan the execution of a query, but are ignorant of the actual performance of the picked plan. Because of the lack of feedback, a query optimizer may select the same bad query plan repeatedly, never learning from its previous good or bad choices.
In this talk, I will argue that a new type of query optimizer, one that integrates deep learning with query optimization, can drastically improve on the state-of-the-art. Towards this direction, I will discuss ReJOIN, a proof-of-concept join enumerator that relies on deep reinforcement learning. ReJOIN leverages prior experience, and learns how to optimize future queries more effectively (i.e., discovers better query plans) and efficiently (i.e., spending less time on optimization) compared with traditional optimizers. I will discuss potential challenges for future research and describe deep learning approaches that can lead the way to end-to-end learning-based query optimizers.
Olga Papaemmanouil is an Associate Professor in the Department of Computer Science at Brandeis University. She received her undergraduate degree in Computer Science and Informatics at the University of Patras, Greece in 1999. In 2001, she received her Sc.M. in Information Systems at the University of Economics and Business, Athens, Greece. She then joined the Computer Science Department at Brown University, where she completed her Ph.D in Computer Science at Brown University in 2008. Her research interests are in databases and distributed data management. She is the recipient of an NSF Career Award (2013) and a Paris Kanellakis Fellowship from Brown University (2002).
(3/5, 3:30pm): Laura Di Rocco (Northeastern): The role of geographical knowledge in subcity level geolocation algorithms details)ABSTRACT
Geolocation of microblog messages has been largely investigated in the literature. Many solutions exist, most of whichare data-driven (i.e., they rely on a training phase) and achieve good results at city-level. The development of algorithmsfor geolocation at sub-city level is still an open problem. In this paper, we investigate the role that external geographic knowledge can play in achieving reasonably accuratesub-city level geolocation. We propose a knowledge-base method, called Sherloc, to accurately geolocate messages at sub-city level, by exploiting the presence of toponyms in the message. Sherloc extracts the semantics of toponyms from geographic gazetteers and embeds them into a metric space that captures the semantic distance among them. This way we can identify the semantically closest toponyms to a microblog message and, at the same time, cluster them withrespect to their spatial locations. In contrast to the state of the art methods, Sherloc requires no prior training, it is not limited to geolocating on a fixed spatial grid, and is able to infer the location at sub-city level with higher accuracy.
(3/29, 11:00am): Yongjoo Park (University of Michigan): Bringing Statistical Tradeoffs to Data Systems details)TITLE
Bringing Statistical Tradeoffs to Data Systems
Despite advances in computing power, the cost of large-scale analytics and machine learning remains daunting to small and large enterprises alike. This has created a pressing demand for reducing infrastructure costs and query latencies. To meet these goals, data analysts and applications are in many cases willing to tolerate a slight—but controlled—degradation of accuracy in exchange for substantial gains in cost and performance, which we refer to as statistical tradeoffs. This is particularly true in the early stages of data exploration and is in stark contrast to traditional tradeoffs where the infrastructure costs must increase for higher performance.
My research builds large-scale data systems that can make these statistical tradeoffs in a principled manner. In this talk, I will focus on two specific directions. First, I will present VerdictDB, a system that enables quality-guaranteed, statistical tradeoffs without any changes to backend infrastructure; thus, it offers a universal solution for off-the-shelf query engines. Second, I will introduce Database Learning, a new query execution paradigm that allows existing query engines to constantly learn from their past executions and become “smarter” over time without any user intervention. I will conclude by briefly discussing other promising directions with emerging workloads beyond SQL, including visualization and machine learning.
(4/2, 10:00am): WVH 155/168: Volker Markl & members of his research group details)Four students from Prof Volker Markl's group in TU Berlin will present their work within 15min each (after general introductions by Renee and Volker).
Bio: Volker Markl is a Full Professor and Chair of the Database Systems and Information Management (DIMA) Group at the Technische Universität Berlin (TU Berlin). At the German Research Center for Artificial Intelligence (DFKI), he is both a Chief Scientist and Head of the Intelligent Analytics for Massive Data Research Group. In addition, he is Director of the Berlin Big Data Center (BBDC) and Co- Director of the Berlin Machine Learning Center (BzMl). Earlier in his career, he was a Research Staff Member and Project Leader at the IBM Almaden Research Center in San Jose, California, USA and a Research Group Leader at FORWISS, the Bavarian Research Center for Knowledge-based Systems located in Munich, Germany. Dr. Markl has published numerous research papers on indexing, query optimization, lightweight information integration, and scalable data processing. He holds 18 patents, has transferred technology into several commercial products, and advises several companies and startups. He has been both the Speaker and Principal Investigator for the Stratosphere Project, which resulted in a Humboldt Innovation Award as well as Apache Flink, the open-source big data analytics system. He serves as the President-Elect of the VLDB Endowment and was elected as one of Germany's leading Digital Minds (Digitale Köpfe) by the German Informatics (GI) Society. Most recently, Volker and his team earned an ACM SIGMOD Research Highlight Award 2016 for their work on “Implicit Parallelism Through Deep Language Embedding.” Volker Markl and his team earned an ACM SIGMOD Research Highlight Award 2016 for their work on implicit parallelism through deep language embedding.
LINKS: Prof Markl's group website: http://www.dima.tu-berlin.de
Speaker: Sebastian Breß
Title: Efficient Data Processing on Modern Hardware
Abstract: Processor manufacturers build increasingly specialized processors to mitigate the effects of the power wall in order to deliver improved performance. Currently, database engines have to be manually optimized for each processor which is a costly and error prone process. In this talk, we provide an overview of our research on automatic performance tuning in Hawk, a hardware-tailored code generator. Our key idea is to create processor-specific code variants and to learn a well- performing code variant for each processor. These code variants leverage various parallelization strategies and apply both generic and processor-specific code transformations. We observe that performance of different code variants may diverge up to two orders of magnitude. Thus, we need to generate custom code for each processor for peak performance. To this end, Hawk automatically finds efficient code variants for CPUs, GPUs, and MICs.
Bio: Sebastian Breß received his PhD (Dr.-Ing.) from University of Magdeburg, Germany in 2015, under the supervision of Gunter Saake (University of Magdeburg) and Jens Teubner (TU Dortmund). He is the the initiator and system architect of the research database system CoGaDB and the Hawk Code Generator. Currently, Sebastian is a Senior Researcher at German Research Center for Artificial Intelligence (DFKI) and a PostDoc at Technische Universität Berlin, working with Prof. Dr. Volker Markl and Prof. Dr. Tilmann Rabl. Sebastian‘s research interests include data management on modern hardware, query compilation, stream processing, and optimizing data management systems for heterogeneous processors. Sebastian has been selected as a Distinguished Program Committee Member at SIGMOD 2018.
Speaker: Jonas Traub
Title: On-Demand Data Stream Gathering in the Internet of Things
Abstract: In the Internet of Things (IoT), billions of sensor nodes form a sensor cloud and offer data streams to analysis systems. However, it is impossible to transfer all available data with maximal frequencies to all applications. In this talk, we show how we gather streaming data efficiently from a huge number of sensor nodes and how we make it available to a huge number of applications. Our key-idea is to gather data streams based on the data demand of streaming queries. The data demand of a query is the minimum number of data points which allows for answering the query with the desired precision. We present a solution which shares sensor nodes among many concurrent streaming queries by multiplexing their data demands. Our technique shares sensor reads and corresponding network traffic among all queries to save costs. Our experiments with real-world sensor data show that our approach saves up to 87% in data transmissions.
Bio: Jonas is a Research Associate at Technische Universität Berlin and the German Research Center for Artificial Intelligence (DFKI). His research interests include data stream processing, sensor data analysis, and data acquisition from sensor nodes. Jonas authored several publications related to data stream gathering, processing and transmission in the Internet of Things and will complete his PhD in March 2019 under the supervision of Volker Markl. Before he started his PhD, Jonas wrote his master thesis at the Royal Institute of Technology (KTH) and the Swedish Institute of Computer Science (SICS) / RISE in Stockholm under supervision of Seif Haridi and Volker Markl and advised by Paris Carbone and Asterios Katsifodimos. Prior to that, he received his B.Sc. degree at Baden-Württemberg Cooperative State University (DHBW Stuttgart) and worked several years at IBM in Germany and the USA. Jonas is an alumnus of "Software Campus", "Studienstiftung des deutschen Volkes" and "Deutschlandstipendium". All publication and supplementary material are available on http://www.user.tu-berlin.de/powibol/.
Speaker: Andreas Kunft
Title: Efficient Matrix Partitioning Through Joins
Abstract: End-to-end machine learning pipelines often consist of (i) relational operators to join the input data, (ii) user defined functions for feature extraction and vectorization, and (iii) linear algebra operators for model training and cross-validation. Often, these pipelines need to scale out to large datasets. In this case, these pipelines are usually implemented on top of dataflow engines like Hadoop, Spark, or Flink. Dataflow engines implement relational operators on row-partitioned datasets. However, efficient linear algebra operators use block-partitioned matrices. As a result, pipelines combining both kinds of operators require rather expensive changes to the physical representation, in particular re-partitioning steps. In this talk, I investigate the potential of reducing shuffling costs by fusing relational and linear algebra operations into specialized physical operators. I present BlockJoin, a distributed join algorithm which directly produces block-partitioned results. To minimize shuffling costs, BlockJoin applies database techniques known from columnar processing, such as index-joins and late materialization, in the context of parallel dataflow engines.
Bio: Andreas Kunft is a PhD student/research associate at Technische Universität Berlin in the Database Systems and Information Management Group (DIMA) led by Volker Markl. His research interests include massive parallel processing frameworks, databases, and compilers, with focus on the integration of database and compiler techniques for holistic optimization of analytics pipelines.
Speaker: Martin Kiefer
Title: Approximate Data Analysis using Modern Hardware
Abstract: Data summaries are an effective tool to balance efficiency and accuracy in data analysis tasks: Most relational query optimizers rely heavily on selectivity estimates computed from histograms or samples. Sketches have become an increasingly popular technique to perform high-bandwidth stream analysis tasks such as change or heavy-hitter detection. Using modern hardware such as GPUs and FPGAs reduces the cost of approximate
(4/5, 12:00pm): Tingjian Ge (UMass Lowell): Dense regions: a different type of event in graph streams details)TITLE
Dense regions: a different type of event in graph streams
A graph stream, as a special data stream, consists of a sequence of edge insertions and deletions over a dynamic graph. Analogous to general data streams, complex events have been studied on graph streams which incorporate both time order and structural relationship over basic edge events. In this talk, I will propose a new type of complex event – the occurrence of a dense region that also lasts for a long time. This turns out to be very useful in applications of various domains including telecommunication hotspot detection, road traffic control, spam network filtering, and dynamic community detection.
To efficiently discover dense regions, we propose an algorithmic framework called the Expectation-Maximization with a Utility function (EMU), a novel stochastic approach that nontrivially extends the conventional EM. We validate our EMU approach by showing that it converges to the optimum—by proving that it is a specification of the general Minorization-Maximization (MM) framework with convergence guarantees. We then devise EMU algorithms for the densest lasting subgraph problem. Using real-world graph streams, we experimentally verify the effectiveness and efficiency of our techniques.
Tingjian Ge is an associate professor in Computer Science at the University of Massachusetts, Lowell. He received a Ph.D. from Brown University in 2009. Prior to that, he got his Bachelor’s and Master’s degrees in Computer Science from Tsinghua University and UC Davis, respectively, and worked at Informix and IBM for six years. His research areas are in data management and analytics, with a recent focus on applying machine learning, AI, and algorithmic techniques in data management and mining. He is a recipient of the NSF CAREER Award in 2012, and a Teaching Excellence Award at UMass Lowell in 2014.
(9/13, 10:40am): WVH 366: Anshumali Shrivastava (Rice University): Hashing Algorithms for Extreme Scale Machine Learning details)TITLE
Hashing Algorithms for Extreme Scale Machine Learning.
In this talk, I will discuss some of my recent and surprising findings on the use of hashing algorithms for large-scale estimations. Locality Sensitive Hashing (LSH) is a hugely popular algorithm for sub-linear near neighbor search. However, it turns out that fundamentally LSH is a constant time (amortized) adaptive sampler from which efficient near-neighbor search is one of the many possibilities. Our observation adds another feather in the cap for LSH. LSH offers a unique capability to do smart sampling and statistical estimations at the cost of few hash lookups. Our observation bridges data structures (probabilistic hash tables) with efficient unbiased statistical estimations. I will demonstrate how this dynamic and efficient sampling beak the computational barriers in adaptive estimations where, for the first time, it is possible that we pay roughly the cost of uniform sampling but get the benefits of adaptive sampling. We will demonstrate the power of one simple idea for three favorite problems 1) Partition function estimation for large NLP models such as word2vec, 2) Adaptive Gradient Estimations for efficient SGD and 3) Sub-Linear Deep Learning with Huge Parameter Space.
In the end, if time permits, we will switch to memory cost show a simple hashing algorithm that can shrink memory requirements associated with classification problems exponentially! Using our algorithms, we can train 100,000 classes with 400,000 features, on a single Titan X while only needing 5% or less memory required to store all the weights. Running a simple logistic regression on this data, the model size of 320GB is unavoidable.
Anshumali Shrivastava is an assistant professor in the computer science department at Rice University. His broad research interests include randomized algorithms for large-scale machine learning. He is a recipient of National Science Foundation (NSF) CAREER Award, a Young Investigator Award from Air Force Office of Scientific Research (AFOSR), and machine learning research award from Amazon. His research on hashing inner products has won Best Paper Award at NIPS 2014 while his work on representing graphs got the Best Paper Award at IEEE/ACM ASONAM 2014. Anshumali got his PhD in 2015 from Cornell University.
(9/26, 12:00pm): Spyros Blanas (Ohio State University): Scaling database systems to high-performance computers details)TITLE
Scaling database systems to high-performance computers
We are witnessing the increasing use of warehouse-scale computers to analyze massive datasets quickly. This poses two challenges for database systems. The first challenge is interoperability with established analytics libraries and tools. Massive datasets often consist of images (arrays) in file formats like FITS and HDF5. To analyze such datasets users turn to domain-specific libraries and deep learning frameworks, and thus write code that directly manipulates files. We will first present ArrayBridge, an open-source I/O library that allows SciDB, TensorFlow and HDF5-based programs to co-exist in a pipeline without converting between file formats. With ArrayBridge, users benefit from the optimizations of a database system without sacrificing the ability to directly manipulate data through the existing HDF5 API when they want to.
The second challenge is scalability, as warehouse-scale computers expose communication bottlenecks in foundational data processing operations. This talk will focus on data shuffling and parallel aggregation. We will first present an RDMA-aware data shuffling algorithm that transmits data up to 4X faster than MPI. This is achieved by switching to a connectionless, datagram-based network transport layer that scales better but requires flow control in software. We will then present a parallel aggregation algorithm for high-cardinality aggregation that carefully schedules data transmissions to avoid unscaleable all-to-all communication. The algorithm leverages similarity to transmit less data over congested network links. We will conclude by highlighting additional challenges that need to be overcome to scale database systems to massive computers.
Spyros Blanas is an assistant professor in the Department of Computer Science and Engineering at The Ohio State University. His research interest is high performance database systems, and his current goal is to build a database system for high-end computing facilities. He has received the IEEE TCDE Rising Star award and a Google Research Faculty award. He completed his Ph.D. at the University of Wisconsin–Madison where part of his Ph.D. dissertation was commercialized in Microsoft SQL Server as the Hekaton in-memory transaction processing engine.
(9/28, 12:00pm): Stratos Idreos (Harvard): The periodic table of data structures details)TITLE
The periodic table of data structures.
Data structures are critical in any data-driven scenario, and they define the behavior of modern data systems and data-driven algorithms. However, they are notoriously hard to design due to a massive design space and the dependence of performance on workload and hardware which evolve continuously.
What if we knew how many and which data structures are possible to design? What if we could compute the expected performance of a data structure design on a given workload and hardware without having to implement it and without even having access to the target machine? We will discuss our quest for 1) the first principles of data structures, 2) design continuums that make it possible to automate design, and 3) self-designing systems that can morph between what we now consider fundamentally different structures. We will draw examples from the NoSQL key-value store design space and discuss how to accelerate them and balance space-time tradeoffs.
Stratos Idreos is an assistant professor of Computer Science at Harvard University where he leads DASlab, the Data Systems Laboratory. Stratos was awarded the 2011 ACM SIGMOD Jim Gray Doctoral Dissertation award and the 2011 ERCIM Cor Baayen award. He is also a recipient of an IBM zEnterpise System Recognition Award, a VLDB Challenges and Visions best paper award and an NSF Career award. In 2015 he was awarded the IEEE TCDE Rising Star Award from the IEEE Technical Committee on Data Engineering for his work on adaptive data systems.
Prior recorded talk at UW:
(10/4, 4:30pm): Angelika Kimmig (Cardiff University): A Collective, Probabilistic Approach to Schema Mapping details)TITLE
A Collective, Probabilistic Approach to Schema Mapping
We propose a probabilistic approach to the problem of schema mapping. Our approach is declarative, scalable, and extensible. It builds upon recent results in both schema mapping and probabilistic reasoning and contributes novel techniques in both ﬁelds. We introduce the problem of mapping selection, that is, choosing the best mapping from a space of potential mappings, given both metadata constraints and a data example. As selection has to reason holistically about the inputs and the dependencies between the chosen mappings, we deﬁne a new schema mapping optimization problem which captures interactions between mappings. We then introduce Collective Mapping Discovery (CMD), our solution to this problem using state-of-the-art probabilistic reasoning techniques, which allows for inconsistencies and incompleteness. Using hundreds of realistic integration scenarios, we demonstrate that the accuracy of CMD is more than 33% above that of metadata-only approaches already for small data examples, and that CMD routinely ﬁnds perfect mappings even if a quarter of the data is inconsistent.
Angelika Kimmig is a Lecturer at Cardiff University, UK. She obtained her Ph.D. from KU Leuven, Belgium, and was a postdoctoral fellow at KU Leuven and the University of Maryland, College Park. Her research interests include symbolic AI, reasoning under uncertainty, machine learning, logic programming, and especially combinations thereof such as probabilistic programming and statistical relational learning. She is a key contributor to the probabilistic logic programming language ProbLog.
PAPER from ICDE 2017:
(10/26, 12:00pm): Mania Abdi (Northeastern): D3N: A multi-layer cache for improving big-data applications’ performance in data centers with imbalanced networks details)TITLE
D3N: A multi-layer cache for improving big-data applications’ performance in data centers with imbalanced networks
Caching methods for improving the performance of datalakes for throughput-bound big-data jobs assume unlimited bandwidth across the datacenter. However, most enterprise and academic datacenters grow organically and have heavily oversubscribed networks between different computer clusters. This paper describes D3N, an architecture that caches data on the access side of potential network bottlenecks and uses a multilayer approach. In the case of a two layer cache, reuse distances are dynamically computed along with cache miss costs to determine the partitioning of the cache into L1 (used for local accesses) and L2 (used for remote access, but within same access side) layers. We have implemented a prototype of D3N by modifying Ceph’s RADOS gateway. Micro and macro evaluations of the prototype demonstrate that the implementation is performant enough to saturate the (40,Gbit/s) NICs and (5 GB/s read) SSD of our caching server and deliver substantial storage bandwidth improvements for real workloads. Numerical models show that a multi-level, dynamic cache can have substantial advantages over today’s single-level caches when bandwidth is constrained.
Mania Abdi is a PhD student at the Northeastern University Solid State Storage research group. Prior to that, she was a software engineer. She received her MSc. in Computer Engineering at the Sharif University of Technology and B.Eng. in Computer Engineering at the Amirkabir University of Technology. Mania is a computer systems researcher with a storage focus and has worked on a broad set of topics, including distribted storage, caching, data center debuging, and end-to-end tracing.
(11/7, 12:00pm): Laurel Orr (University of Washington): Probabilistic Database Summarization for Interactive Data Exploration details)TITLE:
Probabilistic Database Summarization for Interactive Data Exploration
A fundamental assumption of traditional DBMSs is that the database contains all information necessary to answer a query; i.e., the database contains the entire universe of data. Many data scientists, however, do not have access to the universe of data and instead rely on samples to answer queries. These data scientists need to use tools outside of the database or alter their queries to correctly reweight and debias their samples to get a more accurate query answer. This talk presents preliminary research into building the first open world database system (OWDB) that inherently assumes relations are samples from some universe, even if the sampling mechanism is unknown. We will discuss two different approaches for building an OWDB: probabilistic modeling of the universe and sample reweighting. We then present EntropyDB, a system that takes the former approach to summarize a database, and we lastly discuss ongoing research into the later approach using Bayesian Networks. We conclude with discussing the open challenges in developing an OWDB.
Laurel Orr is a 6th year PhD student in the Database Group at the Paul G Allen School for Computer Science and Engineering. Her research interests are data summarization, approximate query processing, data exploration, and the general development of tools and techniques to aid data scientists in their data investigation and analysis pipeline. Her current research goal is to develop a prototype open world database system that inherently treats relations a samples drawn from some larger universe of data, even when the sampling mechanism is unknown. She was a 2018 NCWIT Collegiate Award Honorable Mention and was awarded a NSF Graduate Research Fellowship in 2015.
(11/9, 12:00pm): Raul Castor Fernandez (MIT): Aurum: A Data Discovery System details)TITLE
Aurum: A Data Discovery System
Organizations store data in hundreds of different data sources, including relational databases, files, and large data lake repositories. These data sources contain valuable information and insights that can be beneficial to multiple aspects of modern data-driven organizations. However, as more data is produced, our ability to use it reduces dramatically, as no single person knows about all the existent data sources. One big challenge is to discover the data sources that are relevant to answer a particular question. Aurum is a data discovery system to answer "discovery queries" on large volumes of data. In this talk, I'll motivate the data discovery problem with use cases from different industries. I will describe Aurum's design and I will talk a bit about a new research project that aims to discover data beyond tables.
In my research, I build high-performance and scalable systems to discover, prepare and process data. I'm a postdoctoral researcher at MIT, working with Sam Madden and Mike Stonebraker. Before, I completed my PhD at Imperial College London with Peter Pietzuch.
(11/14, 12:00pm): Lawson Wong (Northeastern): Abstraction in robotics details)TITLE
Abstraction in robotics
Robotics is a big data problem. To make sense of the physical world, perform tasks well, and generalize across environments, robots need to represent and understand the world at the "correct" level of abstraction. What "correct" should mean remains to be seen.
In this talk, I will describe two lines of work that attempt to answer this question from very different perspectives. I will first discuss work on grounding natural language instructions to robot behavior, where we demonstrate that having the right representations can enable human-robot communication. This is an important problem for robotics, since we envision users using natural language to instruct robots to perform a wide variety of tasks. In the second half, I will discuss recent preliminary work on the theoretical foundations of state abstraction in reinforcement learning, a common framework used in robot learning problems. In particular, we view state abstraction as data compression, and apply results in information theory (rate-distortion theory) to the reinforcement learning setting.
Time permitting, I will describe some extensions to the above work, as well as other abstraction-related problems, that I envision my group will pursue at Northeastern.
Lawson L.S. Wong is an assistant professor in the College of Computer and Information Science at Northeastern University. His research focuses on learning, representing, and estimating knowledge about the world that an autonomous robot may find useful. Prior to Northeastern, Lawson was a postdoctoral fellow at Brown University. He completed his PhD at the Massachusetts Institute of Technology. He has received a Siebel Scholarship, AAAI Robotics Student Fellowship, and Croucher Foundation Fellowship for Postdoctoral Research.
Sequence-to-Sequence Language Grounding of Non-Markovian Task Specifications
Nakul Gopalan, Dilip Arumugam, Lawson L.S. Wong, Stefanie Tellex
Robotics: Science and Systems (2018)
Grounding Natural Language Instructions to Semantic Goal Representations for Abstraction and Generalization
Dilip Arumugam, Siddharth Karamcheti, Nakul Gopalan, Edward C. Williams, Mina Rhee, Lawson L.S. Wong, Stefanie Tellex
Autonomous Robots (in press; 2018)
Extended journal version of Robotics: Science and Systems (2017) paper
State Abstraction as Compression in Apprenticeship Learning
David Abel, Dilip Arumugam, Kavosh Asadi, Yuu Jinnai, Michael L. Littman, Lawson L.S. Wong
Preprint available on request (by e-mail)
To appear in AAAI Conference on Artificial Intelligence (2019)
(11/16, 12:00pm): Alexandra Meliou (UMass Amherst): Creating a Higher-Quality Data World details)TITLE
Diagnoses and Explanations: Creating a Higher-Quality Data World
The correctness and proper function of data-driven systems and applications relies heavily on the correctness of their data. Low quality data can be costly and disruptive, leading to revenue loss, incorrect conclusions, and misguided policy decisions. Improving data quality is far more than purging datasets of errors; it is critical to improve the processes that produce the data, to collect good data sources for generating the data, and to address the root causes of problems.
Our work is grounded on an important insight: While existing data cleaning techniques can be effective at purging datasets of errors, they disregard the fact that a lot of errors are systemic, inherent to the process that produces the data, and thus will keep occurring unless the problem is corrected at its source. In contrast to traditional data cleaning, we focus on data diagnosis: explaining where and how the errors happen in a data generative process. I will describe our work on Data X-Ray and QFix, two diagnostic frameworks for large-scale extraction systems and relational data systems. I will also provide a brief overview of new results on knowledge augmentation and explanations for dataset differences, building towards a vision for toolsets that assist the exploration of information in a varied, diverse, and highly non-integrated data world.
Alexandra Meliou is an Assistant Professor in the College of Information and Computer Sciences, at the University of Massachusetts, Amherst. Prior to that, she was a Post-Doctoral Research Associate at the University of Washington. Alexandra received her PhD degree from the Electrical Engineering and Computer Sciences Department at the University of California, Berkeley. She has received recognitions for research and teaching, including a CACM Research Highlight, an ACM SIGMOD Research Highlight Award, an ACM SIGSOFT Distinguished Paper Award, an NSF CAREER Award, a Google Faculty Research Award, and a Lilly Fellowship for Teaching Excellence. Her research focuses on data provenance, causality, explanations, data quality, and algorithmic fairness.
(12/7, 11:30am): Fei Chiang (McMaster University): Contextual and Spatio-temporal Data Cleaning details)TITLE
Contextual and Spatio-temporal Data Cleaning
It is becoming increasingly difficult for organizations to reap value from their data due to poor data quality. This is motivated by the observation that real data is rarely error free, containing incomplete, inconsistent, and stale values. This leads to inaccurate, and out-of-date data analysis downstream. Addressing data inconsistency requires not only reconciling differing syntactic references to an entity, but it is often necessary to include domain expertise to correctly interpret the data. For example, understanding that a reference to ‘jaguar’ may be interpreted as an animal or as a vehicle. Secondly, having up-to-date (or current) data is important for timely data analysis. Cleaning stale values goes beyond just relying on timestamps, especially when timestamps may be missing, inaccurate or incomplete.
In this talk, I will present our work towards achieving consistent and up-to-date data. First, I will discuss contextual data cleaning that uses a new class of data integrity constraints that tightly integrate domain semantics from an ontology. Second, we argue that data currency is a relative notion based on individual spatio-temporal update patterns, and these patterns can be learned and predicted. I will present our framework to achieve these two objectives, and provide a brief overview of recent extensions with applications to knowledge fusion.
Fei Chiang is an Assistant Professor in the Department of Computing and Software at McMaster University. She is a Faculty Fellow at the IBM Centre for Advanced Studies, and served as an inaugural Associate Director of the McMaster MacData Institute. She received her M. Math from the University of Waterloo, and B.Sc and PhD degrees from the University of Toronto, all in Computer Science. Her research interests are in data quality, data cleaning, data privacy and text mining. She holds four patents for her work in self-managing database systems. Her work has been featured in the Southern Ontario Smart Computing Impact Report. She is a recipient of the Dean’s Teaching Honour Roll, and a 2018 Ontario Early Researcher Award.
(12/12, 12:00pm): Erkang Zhu (University of Toronto): Get Your Data Together! Algorithms for Managing Data Lakes details)TITLE
Get Your Data Together! Algorithms for Managing Data Lakes
Data lakes (e.g., enterprise data catalogs and Open Data portals) are data dumps if users cannot find and utilize the data in them. In this talk, I present two problems in massive, dynamic data lakes: 1) searching for joinable tables without a precomputed join graph, and 2) joining tables from different sources through auto-generated syntactic transformation on join values. I will also present two algorithmic solutions that can be used for data lakes that are large both in the number of tables (millions) and table sizes. The presented work has been published in SIGMOD and VLDB.
Erkang (Eric) Zhu is a 5th year computer science PhD candidate at University of Toronto. His supervisor is Prof. Renée J. Miller. His research focuses on data discovery, large-scale similarity search, and randomized algorithms (data sketches).
Erkang Zhu, Fatemeh Nargesian, Ken Q. Pu, Renée J. Miller: LSH Ensemble: Internet-Scale Domain Search. PVLDB 9(12): 1185-1196 (2016)
Erkang Zhu, Ken Q. Pu, Fatemeh Nargesian, Renée J. Miller: Interactive Navigation of Open Data Linkages. PVLDB 10(12): 1837-1840 (2017)
Spring / Summer 2018
- (7/25, noon): Wolfgang Gatterbauer: Oblivious Bounds for the probability of Non-monotone Boolean functions (pdf)
- (7/24, noon): Renée Miller: Open Data Integration (pdf), (pdf)
- (7/13, noon): Arjit Khan: Data management for emerging problems in large networks
- (6/22, noon): Magy Seif El-Nasr: Modeling Player Behaviors through Game Data
- (6/1, 10:30am, 366 WVH): Raymond Wong: Big data analytics on big spatial databases
- (4/27, 10am): Clemens Heitzinger: Computational Bayesian Estimation with Applications in Sensors and Tomography
- (3/28, noon): Ravi Sundaram : no free lunch, succinct data structures, information theory lower bounds
- (3/21, noon): Xiaofeng and Rundong: SIGMOD and WWW practice talks
- (3/14, 10am): Ravi Sundaram : A case for learned index structures (pdf)
- (2/28, noon): Ehsan Elhamifar : Subset Selection and Summarization in Sequential Data (pdf)
- (2/14, noon): Ruiyang Xu: Evaluating Player Skill and Position Difficulty in Sequential Two-Person Games with Game Outcome Prediction demonstrated with the Gamification of an Optimization Problem
- (1/31, noon): Casper Harteveld : Studycrafter / Wolfgang Gatterbauer : Bootstrapping Virtuous Learning Cycles (Youtube)
- (1/19, all day): Northeast Database day 2018 : Come and see our talk and posters!
- (1/18, 3pm, Forsyth #97): Dan Suciu : Rethinking Query Execution on Big Data
- (1/12, noon, 366 WVH): Guoliang Li : Human-in-the-Loop Data Integration
This semester the data lab seminar is combined with in our special topics class (CS 7290: Special topics: Foundations in scalable data management) and takes place every Tuesday 11:45am-1:25pm and Thursday 2:50-4:30pm in Ryder Hall 126 (map).
- (11/16): Stratis Ioannidis : Distributing Frank-Wolfe via Map-Reduce (pdf)
- (11/9): Niccolo Meneghetti : Beta Probabilistic Databases: A Scalable Approach to Belief Updating and Parameter Learning (pdf)
- (10/12): Georgia Koutrika : User analytics for recommender systems
- (9/28): Jon Ullman : Differential privacy and data exploration
- (9/26): Cibele Freire : The complexity of resilience and responsibility for self-join-free conjunctive queries (pdf)