# Activities

## DATA Lab Seminar

For summer and fall 2022 we continue our seminars mostly virtually via Zoom, yet occasionally in WVH 462 (map) and then for brown bag lunch. If you are interested to join the seminars, please subscribe to our DATA Lab talks email list or DATA Lab talks calendar. Zoom links are only announced via email. When you sign up to the DATA Lab talks email list, please use your name or write us a short note so we know who you are before we add you to the announcements.

### Fall 2022 (Nikos)

• (10/7, 1:00pm): Philippe Cudre-Mauroux (University of Fribourg) details)

• (11/4, 1:00pm): Julia Stoyanovich (NYU): Fairness in Ranking: from values to technical choices and back details)

TITLE
Fairness in Ranking: from values to technical choices and back

ABSTRACT
Algorithmic rankers take a collection of candidates as input and produce a ranking (permutation) of the candidates as output. In the past few years, there has been much work on incorporating fairness requirements into rankers, with contributions from the data management, algorithms, information retrieval, and recommender systems communities. Many fair ranking methods establish their own understanding of fairness, yet they stop short of categorizing themselves with respect to the normative frameworks they embed. This makes it difficult to understand what values these methods encode, in what specific scenarios they are applicable, and how they relate to one another. Further, methods differ on which technical choices they make, including, notably, their data representation choices. These technical differences are not value-neutral, and have profound implications on the applicability of the methods. In my talk, I will give an overview of the field of fairness in raking, offering a perspective that connects formalizations and algorithmic approaches across subfields. My perspective will be based on the interplay between the value frameworks that motivate specific fairness-enhancing interventions, and the technical choices that impact the properties of the methods and of their results. I will compare and contrast several representative fair ranking methods, and will provide concrete recommendations for those wishing to incorporate fairness objectives into algorithmic rankers. My talk will be based on a two-part survey that I co-authored with Meike Zehlike and Ke Yang.

RELATED PAPERS
The survey was recently published in ACM Computing Surveys, and is available at these links:
1. Part I https://dl.acm.org/doi/abs/10.1145/3533379
2. Part II https://dl.acm.org/doi/10.1145/3533380

BIO
Julia Stoyanovich is an Institute Associate Professor of Computer Science & Engineering at the Tandon School of Engineering, Associate Professor of Data Science at the Center for Data Science, and Director of the Center for Responsible AI at New York University (NYU). Her research focuses on responsible data management and analysis: on operationalizing fairness, diversity, transparency, and data protection in all stages of the data science lifecycle. She established the "Data, Responsibly" consortium and served on the New York City Automated Decision Systems Task Force, by mayoral appointment. Julia developed and has been teaching courses on Responsible Data Science at NYU, and is a co-creator of an award-winning comic book series on this topic. In addition to data ethics, Julia works on the management and analysis of preference and voting data, and on querying large evolving graphs. She holds M.S. and Ph.D. degrees in Computer Science from Columbia University, and a B.S. in Computer Science and in Mathematics & Statistics from the University of Massachusetts at Amherst. She is a recipient of an NSF CAREER award and a Senior Member of the ACM.

### Summer 2022 (Wolfgang)

• (5/17, 1:00pm): Roee Shraga (Northeastern University): HumanAL: Calibrating Human Matching Beyond a Single Task details)

TITLE:
HumanAL: Calibrating Human Matching Beyond a Single Task

ABSTRACT
This work offers a novel view on the use of human input as labels, acknowledging that humans may err. We build a behavioral profile for human annotators which is used as a feature representation of the provided input. We show that by utilizing black-box machine learning, we can take into account human behavior and calibrate their input to improve the labeling quality. To support our claims and provide a proof-of-concept, we experiment with three different matching tasks, namely, schema matching, entity matching and text matching. Our empirical evaluation suggests that the method can improve the quality of gathered labels in multiple settings including cross-domain (across different matching tasks).

RELEVANT PAPER:
https://arxiv.org/abs/2205.03209

SHORT BIO
Roee Shraga is a Postdoctoral fellow at the Khoury College of Computer Science at Northeastern University in Boston. He received his PhD degree in 2020 in the area of Data Science. Roee has published more than a dozen papers in leading journals and conferences and is a recipient of several PhD fellowships including the Miriam and Aaron Gutwirth Memorial Fellowship (2020).

• (6/3, 12:00pm): Mike Cafarella (MIT): Data Systems for Economic Discovery details)

TITLE
Data Systems for Economic Discovery

ABSTRACT
Macroeconomics resembles a data science task that predates modern data science, and arguably predates digital computing as well. It is possible to make economics easier and to make novel discovery through application of modern data systems and data science methods. This talk will describe several areas for data management to make contributions to economics, in particular recent work on the crucial task of creating high-quality price indices, which describe inflation and consumption.

BIO
Michael Cafarella is a Principal Research Scientist at MIT CSAIL. His research interests include databases, information extraction, data integration, and data mining. He has published extensively in venues such as SIGMOD, VLDB, and elsewhere. Mike received his PhD from the University of Washington, Seattle, in 2009 with advisors Oren Etzioni and Dan Suciu. From 2010 to 2020, he was on the faculty of Computer Science and Engineering at the University of Michigan. His academic awards include the NSF CAREER award, the Sloan Research Fellowship, and the 2018 VLDB Ten-Year Best Paper award. In addition to his academic work, he costarted (with Doug Cutting) the Hadoop open-source project, which is widely used at Facebook, Yahoo!, and elsewhere. In 2015 he cofounded (with Chris Re and Feng Niu) Lattice Data, Inc, which is now part of Apple.
https://www.csail.mit.edu/person/michael-cafarella

• (7/7, 12:00pm): Ioanna Tsalouchidou (Amazon): Scalable Dynamic Graph Summarization details)

TITLE
Scalable Dynamic Graph Summarization

ABSTRACT
Large-scale dynamic interaction graphs can be challenging to process and store, due to their size and the continuous change of communication patterns between nodes. In this work, we address the problem of summarizing large-scale dynamic graphs, while maintaining the evolution of their structure and interactions. Our approach is based on grouping the nodes of the graph in supernodes according to their connectivity and communication patterns. The resulting summary graph preserves the information about the evolution of the graph within a time window. We propose two online algorithms for summarizing this type of graphs. Our baseline algorithm kC based on clustering is fast but rather memory expensive. The second method we propose, named /LC, reduces the memory requirements by introducing an intermediate step that keeps statistics of the clustering of the previous rounds. Our algorithms are distributed by design, and we implement them over the Apache Spark framework, so as to address the problem of scalability for large-scale graphs and massive streams. We apply our methods to several dynamic graphs, and show that we can efficiently use the summary graphs to answer temporal and probabilistic graph queries.

BIO
Ioanna Tsalouchidou received the PhD degree from the Department of Information and Communication Technologies, University Pompeu Fabra, in 2018. She is currently working at Redshift AWS as a software engineer. Her main interests are data bases, geospatial systems, spatial algorithms and data mining, with an emphasis on large-scale and dynamic graphs.
https://www.amazon.science/author/ioanna-tsalouchidou

• (7/8, 12:00pm): Emanuel Sallinger (TU-Vienna): Fantastic Knowledge Graph Embeddings and How to Find the Right Space for Them details)

TITLE
Fantastic Knowledge Graph Embeddings and How to Find the Right Space for Them:
The Quest for Hybrid Logic- and ML-based Reasoning on Knowledge Graphs

ABSTRACT
In this talk, we are going to start by exploring “fantastic knowledge graph embeddings and how to find the right space for them” [1]. The starting point of our quest, namely that of hybrid logic- and ML-based reasoning in Knowledge Graphs (KGs), is simple:
During the last few years, a large number of KG embedding (KGE) models have been devised for handling machine learning problems for KGs. Interestingly, many models capable of inferring relational patterns, such as symmetry or transitivity, show lower performance in practice than those not able to, with the factors contributing to this largely unknown. In this part, we are going to show how to solve this puzzle, and practically improve models [1].
This brings us to the second part of our quest: considering e.g. symmetry or transitivity mentioned above, we of course know reasoning methodologies that are very effective at handling them, namely logic-based KG reasoners (building upon the scalability of techniques from the data management community). Yet can we bring KGE-based and logic-based methods together in meaningful ways? This is a challenge that a large part of our community is currently looking at, and we present interesting parts of this quest including how to improve logic-based reasoning using embeddings in the Vadalog system [2].
We round up the talk by looking at the general picture of Knowledge Graph Management Systems (KGMSs), with a focus on the Vadalog system, and its various applications including anti-money laundering, hostile takeover prevention during crises, etc., as time allows [3].

RELATED PAPERS
1. ”Fantastic Knowledge Graph Embeddings and How to Find the Right Space for Them”
2. http://ceur-ws.org/Vol-3135/EcoFinKG_2022_paper8.pdf
3. https://www.sciencedirect.com/science/article/abs/pii/S0306437920300351

BIO
Emanuel Sallinger is Assistant Professor at TU Wien and Departmental Lecturer at Oxford University. He leads the Knowledge Graph Lab at TU Wien, the SIG Knowledge Graphs of the Center for AI and ML, and is a member of the Databases and AI group (DBAI). At Oxford University, he is lecturer of the courses “Knowledge Graphs” and “Database Design”. Before joining TU Wien, he directed the VADA (Value-Added Data Systems and Architecture) Laboratory at Oxford University for over six years. The Knowledge Graph Lab (formally “Scalable Reasoning in Knowledge Graphs”) is a WWTF Vienna Research Group, an ERC-sized funding program for creating long-term research groups. The VADA project was an EPSRC programme grant under Georg Gottlob, bringing together the universities of Oxford (directed by Emanuel Sallinger), Manchester (headed by Norman Paton), and Edinburgh (headed by Leonid Libkin).
https://kg.dbai.tuwien.ac.at/

• (7/19, 12:00pm): Matthias Lanzinger (Oxford University): RAISON DATA: Some Perspectives on the Intersection of Data, AI, and Logic details)

TITLE
RAISON DATA: Some Perspectives on the Intersection of Data, AI, and Logic

ABSTRACT
In the RAISON DATA project, we investigate possible interactions between data, AI, and logic. After a brief overview of motivations and goals for the project, followed by the presentation of two recent publications and ongoing follow-up work.

Part I: Deriving value and insights from data is a central task in industry and extensively studied in data management research. While most work in this area has focused on analysis using complex Machine Learning models and other forms of statistical analysis, there is a renewed interest in methods for deriving symbolic rather than statistical knowledge from data. We investigate the complexity of inductively (exactly) learning guarded rules from examples and discuss ongoing work on applying outcomes from this analysis to relational databases.

Part II: Current trends in data management introduce various data from sources that label data points with a degree of certainty. Common examples include data obtained by crowd-sourced data, AI-generated observations, and KG-completion/link detection techniques. These developments raise new challenges for data management and the need for formalisms that can account for degrees of certainty in data. To this end, we investigate Datalog with fuzzy logic semantics. In the accompanying paper, we show the possibility of semantics based on linear programming for one variant of fuzzy logic Datalog and how this provides one way of introducing fuzzy Datalog+/-. Additionally, we discuss ongoing work on extending the framework from the paper and connections to other recent developments in rule-based reasoning.

RELATED PAPERS
* On the Complexity of Inductively Learning Guarded Clauses (AAAI'22)
https://www.aaai.org/AAAI22Papers/AAAI-1260.DraghiciA.pdf
* MV-Datalog+-: Effective Rule-based Reasoning with Uncertain Observations (ICLP'22)
https://arxiv.org/abs/2202.01718

BIO
Matthias Lanzinger is a senior postdoctoral researcher at the University of Oxford. He obtained his Ph.D. from TU Wien under the supervision of Reinhard Pichler. His main research interests are the complexity of reasoning, with a focus on database query languages, as well as interactions between logic and machine learning systems.

• (7/22, 10:00am): Paolo Papotti (EURECOM): Explainable Checking for Factual Claims details)

TITLE
Explainable Checking for Factual Claims

ABSTRACT
Misinformation is an important problem but fact checkers are overwhelmed by the amount of false content that is produced online every day. To support fact checkers in their efforts, we are creating data driven verification methods that assess textual claims and explain their decisions. However, using structured data for facts expressed in natural language is a complex task. In this talk, we first discuss how logical rules can be used to verify claims in a checking task modeled as a probabilistic inference problem. We show that this form of reasoning can also be “learned” by transformer-based language models. Experiments show that both methods enable the labeling of claims with interpretable explanations.

RELATED WORK
* Naser Ahmadi, Joohyung Lee, Paolo Papotti, Mohammed Saeed:
Explainable Fact Checking with Probabilistic Answer Set Programming. TTO 2019
https://arxiv.org/pdf/1906.09198
* Mohammed Saeed, Naser Ahmadi, Preslav Nakov, Paolo Papotti:
RuleBERT: Teaching Soft Rules to Pre-Trained Language Models. EMNLP (1) 2021: 1460-1476
https://arxiv.org/pdf/2109.13006

BIO
Paolo Papotti is an Associate Professor at EURECOM, France since 2017. He got his PhD from Roma Tre University (Italy) in 2007, and had research positions at the Qatar Computing Research Institute (Qatar) and Arizona State University (USA). His research is focused on data integration and information quality. He has authored more than 100 publications and his work has been recognized with two "Best of the Conference" citations (SIGMOD 2009, VLDB 2016), two best demo paper awards at SIGMOD (2015, 2022), and two Google Faculty Research Awards (2016, 2020). He is associate editor for the ACM Journal of Data and Information Quality (JDIQ).
https://www.eurecom.fr/~papotti/

• (7/26, 10:30am): Anton Tsitsulin (Google): Benchmarking graph learning algorithms with synthetic data details)

TITLE
Benchmarking graph learning algorithms with synthetic data

ABSTRACT
Graph learning algorithms have attained state-of-the-art performance on many graph analysis tasks such as node classification, link prediction, and clustering. It has, however, become hard to track the field's burgeoning progress. One reason is due to the very small number of datasets used in practice to benchmark the performance of graph learning algorithms. This shockingly small sample size (~10) allows for only limited scientific insight into the problem. In this talk, we try to address this deficiency with synthetic data. We generate synthetic graphs and study the behaviour of graph learning algorithms in controlled scenarios. We develop a fully-featured synthetic graph generator that allows deep inspection of different models. We argue that synthetic graph generation allows for thorough investigation of algorithms and provides more insights than overfitting on three citation datasets. In the case study, we show how our framework provides insight into unsupervised and supervised graph neural network models. In this talk, we will explore the benefits, challenges, and insights from synthetic data benchmarking.

RELATED PAPERS
* Tsitsulin, Rozemberczki, Palowitch, Perozzi: Synthetic Graph Generation to Benchmark Graph Learning. 2022
https://arxiv.org/abs/2204.01376
* Palowitch, Tsitsulin, Mayer, Perozzi: GraphWorld -- fake graphs bring real insights for GNNs. 2022.
https://arxiv.org/abs/2203.00112
* Tsitsulin, Palowitch, Perozzi, Mueller: Graph Clustering with Graph Neural Networks. 2020
https://arxiv.org/abs/2006.16904

BIO
Anton Tsitsulin has been a research scientist at Google's graph mining team since 2021. His research is in scalable graph learning algorithms, particularly in unsupervised learning and axiomatic representations. Before joining Google, he received his PhD degree from the university of Bonn advised by Prof. Emmanuel Müller.
http://tsitsul.in/

• (7/28, 12:00pm): Bas Ketsman (VUB: Vrije Universiteit Brussel): Datalog with Negation and Monotonicity details)

TITLE
Datalog with Negation and Monotonicity

ABSTRACT
Positive Datalog has several nice properties that are lost when the language is extended with negation. One example is that ﬁxpoints of positive Datalog programs are robust w.r.t. the order in which facts are inserted, which facilitates eﬃcient evaluation of such programs in distributed environments. A natural question to ask, given a (stratiﬁed) Datalog program with negation, is whether an equivalent positive Datalog program exists.

In this context, it is known that positive Datalog can express only a strict subset of the monotone queries, yet the exact relationship between the positive and monotone fragments of semi-positive and stratiﬁed Datalog was previously left open. In this paper, we complete the picture by showing that monotone queries expressible in semi-positive Datalog exist which are not expressible in positive Datalog. To provide additional insight into this gap, we also characterize a large class of semi-positive Datalog programs for which the dichotomy ‘monotone if and only if rewritable to positive Datalog’ holds. Finally, we give best-eﬀort techniques to reduce the amount of negation that is exhibited by a program, even if the program is not monotone.

RELATED WORK
* Ketsman, Koch. Datalog with Negation. ICDT 2020.
https://drops.dagstuhl.de/opus/volltexte/2020/11943/pdf/LIPIcs-ICDT-2020-19.pdf
* Ameloot, Ketsman, Neven, Zinn. Weaker Forms of Monotonicity for Declarative Networking: A More Fine-Grained Answer to the CALM-Conjecture, TODS 2016.
https://doi.org/10.1145/2809784

BIO
Bas Ketsman is assistant professor at the Vrije Universiteit Brussel (VUB). His research focuses on formal aspects of large-scale data systems. He obtained his Ph.D. from Hasselt University in 2017 under the advice of Frank Neven and was a postdoctoral researcher at the Swiss Federal Institute of Technology in Lausanne (EPFL) where he worked with Christoph Koch. Bas' research has been awarded best paper awards at the ACM PODS Conference (the year 2014 and 2015), the 2018 European Association for Theoretical Computer Science distinguished dissertation award, and also the 2019 SIGMOD Jim Gray Doctoral Dissertation Award Honorable Mention.

• (7/29, 12:00pm): Atri Rudra (University of Buffalo): Worst-case Optimal Binary Join Algorithms under General Lp Constraints details)

TITLE
Worst-case Optimal Binary Join Algorithms under General L_p Constraints

ABSTRACT
Worst-case optimal join algorithms have so far been studied in two broad contexts – (1) when we are given input relation sizes [Atserias et al., FOCS 2008, Ngo et al., PODS 2012, Velduizhen et. al, ICDT 2014] (2) when in addition to size, we are given a degree bound on the relation [Abo Khamis et al., PODS 2017]. To the best of our knowledge, this problem has not been studied beyond these two statistics even for the case when input relations have arity (at most) two.

In this paper, we present a worst-case optimal join algorithm when are given  p -norm size bounds on input relations of arity at most two for p ∈ (1,2]. (p = 1 corresponds to relation size bounds and p = ∞ correspond to the degree bounds.) The worst-case optimality holds any ﬁxed p ∈ (2,∞) as well (as long as the join query graph has large enough girth). Our algorithm is simple, does not depend on p (or) the  p -norm bounds and avoids the (large) poly-log factor associated with the best known algorithm PANDA [Abo Khamis et al., PODS 2017] for the size and degree bounds setting of the problem. In this process, we (partially) resolve two open question from [Ngo, 2018 Gems of PODS]. We believe our algorithm has the potential to pave the way for practical worst-case optimal join algorithms beyond the case of size bounds.

RELATED WORK:
https://arxiv.org/pdf/2112.01003

BIO
Atri Rudra is a Professor of Computer Science and Engineering at University at Buffalo, SUNY. Atri received his bachelor's degree from the Indian Institute of Technology, Kharagpur, India in 2000, and his Ph.D. from the University of Washington in 2007. From 2000 to 2002, he was a Research Staff Member at IBM India Research Lab, New Delhi, India.
His current research interests include structured linear algebra, issues at the intersection of society and computing and database algorithms. He is a recipient of an NSF CAREER Award (2009), an HP Labs Innovation Research Award (2010), an ESA Best Paper Award (2010), a UB Exceptional Scholars-Young Investigator Award (2011), PODS Best Paper Awards (2012 and 2016), an IBM Faculty Award (2013), SIGMOD research highlights (2016), ACM PODS Alberto O. Mendelzon Test-of-Time Award (2022) and an ICML Outstanding Paper Runner Up Award (2022). He is a co-editor of the Mozilla Teaching Responsible Computing playbook and has won a UB Teaching Innovation award (2021) and a SUNY Chancellor’s Award for Excellence in Teaching (2022).
https://cse.buffalo.edu/faculty/atri/

• (8/5, 12:00pm): Abolfazl Asudeh (University of Illinois Chicago): Scalable Signal Reconstruction for a Broad Range of Applications details)

TITLE
Scalable Signal Reconstruction for a Broad Range of Applications

ABSTRACT
Signal reconstruction problem (SRP) is an important opti- mization problem where the objective is to identify a solu- tion to an underdetermined system of linear equations that is closest to a given prior. It has a substantial number of applications in diverse areas, such as network traffic engineering, medical image reconstruction, acoustics, astronomy, and many more. Unfortunately, most of the common approaches for solving SRP do not scale to large problem sizes. We propose a novel and scalable algorithm for solving this critical problem. Specifically, we make four major contributions. First, we propose a dual formulation of the problem and develop the Direct algorithm that is significantly more efficient than the state of the art. Second, we show how adapting database techniques developed for scalable similarity joins provides a substantial speedup over Direct. Third, we describe several practical techniques that allow our algorithm to scale—on a single machine—to settings that are orders of magnitude larger than previously studied. Finally, we use the database techniques of materialization and reuse to extend our result to dynamic settings where the input to the SRP changes. Extensive experiments on real- world and synthetic data confirm the efficiency, effective- ness, and scalability of our proposal.

RELATED PAPERS
* Asudeh, Abolfazl, et al. Scalable algorithms for signal reconstruction by leveraging similarity joins. The VLDB Journal 29.2 (2020): 681-707.
* Abolfazl Asudeh, et al. Scalable Signal Reconstruction for a Broad Range of Applications. Communications of the ACM (CACM), Vol. 64(2), pages 106–115, 2021, ACM.
https://dl.acm.org/doi/pdf/10.1145/3441689
* Ives, Zachary G. "Technical perspective: Solving the signal reconstruction problem at scale." Communications of the ACM 64.2 (2021): 105-105.
https://dl.acm.org/doi/pdf/10.1145/3441688

BIO
Abolfazl Asudeh is an assistant professor of Computer Science at the University of Illinois Chicago and the director of Innovative Data Exploration Laboratory (InDeX Lab). He is a VLDB Ambassador, and a regular PC member of Database flagship venues.
His research spans to different aspects of Big Data and Data Science, including Data Management, Information Retrieval, and Data Mining, for which he designs efficient, accurate, and scalable algorithmic solutions.
Data Equity Systems, Algorithmic Fairness, and Data-centric AI are his major focus in research. His research interest also includes: Ranking, Social Networks Analysis, Large-Scale Computation on Limited Resources, Computational Fact Checking, Data management for Machine Learning, Web Databases, and applied Computational Geometry.
https://asudeh.github.io/

• (8/12, 12:00pm): Cheng Tan (Northeastern University): Constructing Certified Neural Networks for Computer Systems details)

TITLE
Constructing Certified Neural Networks for Computer Systems

ABSTRACT
Deep learning and neural networks are powerful tools. Applying them in computer systems---operating systems, databases, and networked systems---attracts much attention. However, neural networks are complicated black boxes that sometimes produce unexpected results. It is risky to use networks with uncertain behaviors in computer systems that require strict safety rules.

To tame networks' uncertainty, we introduce Ouroboros, a system that constructs certified neural networks. Certified neural networks are neural networks that satisfy user-defined safety properties, named specifications. Ouroboros achieves this through a training-verification loop that combines deep learning training and neural network verification. In addition, Ouroboros supports new categories of specifications that enable developers to specify safety properties for systems applications. With Ouroboros, system developers can have faith in their neural networks.

RELATED WORK
* APSys'21: http://naizhengtan.github.io/doc/papers/building21tan.pdf

SHORT BIO
Cheng Tan is an assistant professor of Khoury College of Computer Sciences at Northeastern University. His research interests are in systems and security, focusing on building verifiable outsourced services and certified neural networks for systems. His work has won the SOSP’17 best paper award and Janet Fabri Prize for Outstanding Dissertation.
https://naizhengtan.github.io/

### Spring 2022 (Mirek / part of cs7280)

• (1/24, 3:00pm): Phil Bernstein (Microsoft Research): Adding Data Management to Orleans – A Journey details)

TITLE
Adding Data Management to Orleans – A Journey

ABSTRACT
In this talk, I'll describe my work over eight years to add database features to the Orleans object-oriented programming framework: replication, geo-distribution, transactions, and indexing. The challenge is how to do it when storage is a plug-in service that you don’t control. In this talk, I’ll describe the journey, summarizing the main technical ideas and recounting the ups and downs of a friendly collaboration with a product group. Along the way, I’ll discuss where database system research problems come from and the special challenges of industrial research.

BIO
Philip A. Bernstein is a Distinguished Scientist at Microsoft Research. Over the past 40 years, he has been a product architect at Microsoft and Digital Equipment Corp., a professor at Harvard University, and a VP Software at Sequoia Systems. From 1994-1996 he was architect of Microsoft Repository, a metadata store for Visual Studio and SQL Server. He has published over 150 papers and two books on the theory and implementation of database systems, especially on transaction processing and data integration, as well as many other database topics. He is an ACM Fellow, a winner of the ACM SIGMOD Innovations Award, a member of the Washington State Academy of Sciences, and a member of the National Academy of Engineering. He received a B.S. degree from Cornell and M.Sc. and Ph.D. from University of Toronto.

• (1/31, 3:00pm): Jonathan Goldstein (Microsoft Research): A.M.B.R.O.S.I.A. - Conferring Immortality on Distributed Applications details)

TITLE
A.M.B.R.O.S.I.A. - Conferring Immortality on Distributed Applications

ABSTRACT
When writing today's distributed programs, which frequently span both devices and cloud services, programmers are faced with complex decisions and coding tasks around coping with failure, especially when these distributed components are stateful. If their application can be cast as pure data processing, they benefit from the past 40-50 years of work from the database community, which has shown how declarative database systems can completely isolate the developer from the possibility of failure in a performant manner. Unfortunately, while there have been some attempts at bringing similar functionality into the more general distributed programming space, a compelling general-purpose system must handle non-determinism, be performant, support a variety of machine types with varying resiliency goals, and be language agnostic, allowing distributed components written in different languages to communicate. This talk describes the first system, publicly available on GitHub, called Ambrosia, to satisfy all these requirements. We coin the term "virtual resiliency", analogous to virtual memory, for the platform feature which allows failure oblivious code to run in a failure resilient manner. We also introduce a programming construct, the "impulse", which resiliently handles non-deterministic information originating from outside the resilient component. Of further interest to our community is the effective reapplication of much database performance optimization technology to make Ambrosia more performant than many of today's non-resilient cloud solutions.

BIO
Over the last 20 years, I have worked at Microsoft in a combination of research and product roles. In particular, I've spent about 20 years as a researcher at MSR, doing fundamental research in streaming, big data processing, databases, and distributed computing. My style of working is to attack difficult problems, and through fundamental understanding and insight, create new artifacts that enable important problems to be solved in better ways. For instance, my work on streaming data processing enabled people with real time data processing problems to specify their processing logic in new, powerful ways, and also resulted in an artifact called Trill, which was orders of magnitude more performant than anything which preceded it. Within the academic community, I have published many papers, some with best paper awards (e.g. Best Paper Award at ICDE 2012), and two with test of time awards (e.g. SIGMOD 2011 Test of Time award and ICDT 2018 Test of Time award), and have also taken many organizational roles in database conferences. My research has also had significant impact on many Microsoft products, including SQL Server, Office, Windows, Bing, and Halo, as well as leading to the creation of entirely new products like Microsoft StreamInsight, Azure Streaming Analytics, Trill, and most recently, Ambrosia. I spent 5 years building Microsoft StreamInsight, serving as a founder and architect for the product. Trill has become the de-facto standard for temporal and stream data processing within Microsoft, and years after creation, is still the most expressive and performant general purpose stream data processor in the world. I am also an inventor of 30+ patents.

• (2/16, 3:00pm): Soheil Behnezhad (Stanford): Large-Scale Graph Algorithms: Massively Parallel Computation details)

TITLE
Large-Scale Graph Algorithms: Massively Parallel Computation

ABSTRACT
Graphs today are massive. Social network graphs, the web graph, and models of the brain are just a few examples of graphs with billions of vertices and trillions of edges. Graphs of this sheer size do not fit a single machine's memory, invalidating many assumptions of traditional algorithms. Massively Parallel Computation (à la MapReduce) is a popular method of processing such large-scale problems which has been extremely successful in practice. The idea is to distribute the workload into several machines running in parallel that communicate over rounds. In this talk, I will give an introduction to two algorithms for maximal matching and graph connectivity in the MPC model. The maximal matching algorithm takes O(log log n) rounds, which exponentially improves upon previous logarithmic-round algorithms known from the 1980s. I will also present a conditionally near-optimal algorithm for graph connectivity in this model.

BIO
Soheil Behnezhad is currently a Motwani Postdoctoral Fellow at Stanford and will join the Khoury College of Computer Sciences at Northeastern as an Assistant Professor in Fall 2022. He obtained his PhD from the University of Maryland. Soheil's research focus is on the theoretical foundations of big data algorithms. He studies algorithms in a variety of large-scale computation settings, including massively parallel computation (MapReduce), graph sparsification, streaming algorithms, and dynamic algorithms.

• (2/23, 3:00pm): Divy Agrawal (UC Santa Barbara): Big Data, Small Summaries details)

ABSTRACT
During the past two decades we have seen an unprecedented increase in the amount of data that is being generated from numerous internet-scale applications. As hundreds of millions to billions of users interact with these applications, there is a continuous flow of interaction or log data that is collected by internet companies hosting these applications. Before this data can be subject to modeling and analysis, it becomes necessary to obtain summary statistics such as the cardinality of unique visitors, frequency counts of users from different states or countries, and in general finding the quantile and median information from the dataset. Efficient algorithms exist for computing the exact information over the data. Unfortunately, these algorithms require a considerable amount of time, scanning the data multiple times, or require additional storage that is linear to the size of the dataset itself. Approximation methods, with guaranteed error bounds, developed in the context of streaming data are extremely effective to extract useful and relatively accurate knowledge from big data. In this talk, we will start with a review of one such approximation technique for estimating frequency counts over streaming data where only insert operations are permitted. We will then present refinement of this technique in the context of large datasets that need to handle both insert and delete operations. We will conclude the presentation with an application of the frequency counting technique for an innovative internet caching architecture in which a small front-end caching at the client side significantly mitigates the load imbalance problem at the back-end cache servers. The meta-objective of this talk is to demonstrate the usefulness of approximation techniques in the context of real applications.

SPEAKER BIOGRAPHY
Divyakant Agrawal is a Professor of Computer Science at the University of California at Santa Barbara. His research expertise is in the areas of distributed systems and databases. He is a fellow of the AAAS, ACM, and IEEE. He has also had visiting appointments at IBM Almaden Research Labs, NEC Research, ASK.com, Google, Qatar Computing Research Institute, and National University of Singapore over the course of his professional career.

• (2/28, 3:00pm): Floris Geerts (U of Antwerp): Matrix query languages details)

TITLE
Matrix query language

ABSTRACT
Imagine that instead of the relational data model and relational algebra,
data would be stored as matrices and queries would consist of linear algebra
operators. How would such a matrix query language look like? What can we
do and cannot do in such a language? And how does it relate to query languages
based on first-order logic or the relational algebra? How do these languages
connect to arithmetic circuits, which are often said to "capture" linear algebra?
In this talk we answer some of these questions.

BIO
Floris Geerts holds a Professor position at the University of Antwerp, Belgium. Before that, he was a senior research fellow in the database group at the University of Edinburgh, and held a postdoc position in the data mining group at the University of Helsinki. He received his PhD in 2001 at the University of Hasselt, Belgium. His research interests include the theory and practice of databases, the study of data quality, and more recently, the interaction between linear algebra, relational databases and graph neural networks.

PAPERS

[2] Floris Geerts, Thomas Muñoz, Cristian Riveros, Domagoj Vrgoc: Expressive Power of Linear Algebra Query Languages. PODS 2021: 342-354. https://dl.acm.org/doi/10.1145/3452021.3458314.

[3] Robert Brijder, Floris Geerts, Jan Van den Bussche, Timmy Weerwag: On the Expressive Power of Query Languages for Matrices. ACM Trans. Database Syst. 44(4): 15:1-15:31 (2019). https://dl.acm.org/doi/10.1145/3331445.

[4] Floris Geerts: On the Expressive Power of Linear Algebra on Graphs. Theory Comput. Syst. 65(1): 179-239 (2021). https://drops.dagstuhl.de/opus/volltexte/2019/10309/pdf/LIPIcs-ICDT-2019-7.pdf.

• (3/2, 3:00pm): Floris Geerts (U of Antwerp): Graph Neural Networks through the eyes of matrix query languages details)

ABSTRACT
Graph neural networks (GNNs) are commonly used to compute graph/vertex
embeddings. The capabilities of GNNs are often measured in terms of their distinguishing
power, that is, their ability to distinguish graphs/vertices based on the computed embeddings.
In this talk we view GNNs as queries in a matrix query language, dubbed "Tensor Language".
We show that by using this language, we can obtain insights on the distinguishing power
of GNNs and that it provides an easy toolbox for GNN designers to analyze their own GNNs.

BIO
Floris Geerts holds a Professor position at the University of Antwerp, Belgium. Before that, he was a senior research fellow in the database group at the University of Edinburgh, and held a postdoc position in the data mining group at the University of Helsinki. He received his PhD in 2001 at the University of Hasselt, Belgium. His research interests include the theory and practice of databases, the study of data quality, and more recently, the interaction between linear algebra, relational databases and graph neural networks.

PAPERS
[1] Floris Geerts, Juan L Reutter: Expressiveness and Approximation Properties of Graph Neural Networks, ICLR 2022.
[2] Martin Grohe: The Logic of Graph Neural Networks. LICS 2021: 1-17.
[3] Martin Grohe: Word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings of Structured Data. PODS 2020: 1-16.

• (3/7, 3:00pm): Manos Athanassoulis (Boston U): Building Robust LSM-based Key-Value Stores details)

ABSTRACT
Log-structured merge (LSM) trees are becoming the de facto standard for write-intensive storage layers for both production NoSQL data stores and relational systems. As LSM-based systems are used by various applications and deployed in shared infrastructure (e.g., public or private cloud), they are tasked to support a number of requirements varying from performance, to cost, and privacy, being robust to external requirements and the inherent workload unpredictability. At the heart of any LSM-based engine, we have the background re-organization mechanism (or compaction), the behavior of which affects essentially every aspect of the LSM-tree including write amplification, write throughput, point and range lookup performance, space amplification, and delete performance. In this presentation, we will first introduce in detail the design space of LSM compactions and discuss their tradeoffs, we will then take a deep dive on a new family of delete-aware compactions. We will define as a design goal the delete persistence latency, and discuss how to bound it. Finally, we will discuss the importance of tuning in the presence of uncertainty and present a sneak-peek of a new methodology for near-optimal LSM tuning in the presence of uncertainty of the expected workload vs. the observed one.

BIO
Manos Athanassoulis is an Assistant Professor of Computer Science at Boston University, Director and Founder of the BU Data-intensive Systems and Computing Laboratory and co-director of the BU Massive Data Algorithms and Systems Group. His research is in the area of data management focusing on building data systems that efficiently exploit modern hardware (computing units, storage, and memories), are deployed in the cloud, and can adapt to the workload both at setup time and, dynamically, at runtime. Before joining Boston University, Manos was a postdoc at Harvard University, earlier he obtained his PhD from EPFL, Switzerland, and spent one summer at IBM Research, Watson. Manos’ work has been recognized by awards like “Best of SIGMOD” in 2016, “Best of VLDB” in 2010 and 2017, and “Most Reproducible Paper” at SIGMOD in 2017, and has been supported by NSF and industry funds including two Red Hat Incubation Awards, a Facebook Faculty Research Award and gifts from Meta, Cisco, and Red Hat.

RELEVANT PAPERS (especially 1 & 3)
1. Constructing and Analyzing the LSM Compaction Design Space
Proceedings of the VLDB Endowment, Vol. 14(11), 2021
Subhadeep Sarkar, Dimitris Staratzis, Zichen Zhu, Manos Athanassoulis
2. Reducing Bloom Filter CPU Overhead in LSM-Trees on Modern Storage Devices
Proceedings of the International Workshop on Data Management on New Hardware (DaMoN), 2021
Zichen Zhu, Ju Hyoung Mun, Aneesh Raman, Manos Athanassoulis
3. Lethe: A Tunable Delete-Aware LSM Engine
Proceedings of the ACM SIGMOD International Conference on Management of Data, 2020
Subhadeep Sarkar, Tarikul Islam Papon, Dimitris Staratzis, Manos Athanassoulis
4. And a paper under review: https://arxiv.org/abs/2110.13801

• (3/21, 3:00pm): Yael Amsterdamer (Bar-Ilan University): Automated Selection of Multiple Datasets for Extension by Integration details)

ABSTRACT
Organizations often seek to extend their data by integration with available datasets originating from external sources. While there are many tools that recommend how to perform the integration for given datasets, the selection of what datasets to integrate is often challenging in itself. First, the relevant candidates must be efficiently identified among irrelevant ones. Next, relevant datasets need to be evaluated according to issues such as low quality or poor matching to the target data and schema. Last, jointly integrating multiple datasets may have significant benefits such as increasing completeness and information gain, but may also greatly complicate the task due to dependencies in the integration process.
To assist administrators in this task, we quantify to what extent an integration of multiple datasets is valuable as an extension of an initial dataset and formalize the computational problem of finding the most valuable subset to integrate by this measure. We formally analyze the problem, showing that it is NP-hard; we nevertheless introduce heuristic efficient algorithms, which our experiments show to be near-optimal in practice and highly effective in finding the most valuable integration.

SHORT BIO
Yael Amsterdamer is a Senior Lecturer at the Department of Computer Science, Bar-Ilan University (Ramat-Gan, Israel), and the head of the Data Management Lab. She received her Ph.D. in Computer Science from Tel-Aviv University, and has been a visiting Scholar at the University of Pennsylvania, Philadelphia, PA and jointly at Télécom Paris and INRIA institute (Paris, France). Her research is in the field of data management spanning topics such as crowd-powered data management, provenance and interactive summarization. She has won competitive grants such as the Israeli Science Foundation and Ministry of Science and regularly serves in program committees of top conferences.

RELEVANT PAPER
https://dl.acm.org/doi/10.1145/3459637.3482322

• (3/23, 3:00pm): Avigdor Gal (Technion): Loch Data and Other Monsters: on Creating Data Ecosystems, the Intelligent Way details)

TITLE
Loch Data and Other Monsters: on Creating Data Ecosystems, the Intelligent Way

ABSTRACT
A data ecosystem offers an alliance-driven infrastructure that enables the interaction of different stakeholders and the resolution of interoperability issues among shared data. Despite years of research in data governance and management, trustability is still affected by the absence of transparent and traceable data-driven pipelines. Data integration is the main facilitator of such data-driven pipelines and matching is a task at the heart of any data integration process, aimed at identifying correspondences among data elements. Matching problems were traditionally performed in a semi-automatic manner, with correspondences being generated by matching algorithms and outcomes subsequently validated by human experts. Human-in-the-loop data integration has been recently challenged by the introduction of big data and recent studies have analyzed obstacles to effective human matching and validation. In this talk, we focus on the tension between human and machine matching. We propose a novel data ecosystem architecture that relies on both human knowledge and machine learning and offer a concrete algorithmic solution for effective data integration within this architecture. In particular, we shall present the limitations of human matching and offer a method for learning to characterize reliable and valuable matching experts.

SHORT BIO
Avigdor Gal is The Benjamin and Florence Free Chaired Professor of data science at the Technion - Israel Institute of Technology, where he heads the Big Data Integration laboratory. He specializes in various aspects of data management and mining with about 150 publications in leading journals, books, and conference proceedings. In the current age of big data, his research is focused on developing novel models and algorithms for data integration. In the past he gave keynotes and tutorials in leading conferences in the areas of data and process management. Avigdor Gal is a recipient of the prestigious Yannai award for excellence in academic education, and multiple best paper and test-of-time awards.

• (3/28, 3:00pm): Magda Balazinska (U Washington): Video Data Management details)

Abstract:

The proliferation of inexpensive high-quality cameras coupled
with recent advances in machine learning and computer vision have enabled new applications on video data.
This in turn has renewed interest in video data management systems (VDMSs).
In this talk, we explore how to build a modern data management system
for video data. We focus, in particular, on the storage manager
and present several techniques to store video data in a way that accelerates
queries over that data. We then move up the stack and discuss different
types of data models that can be exposed to applications. Finally, we
discuss the additional challenges of the end-to-end video
analytics pipeline and how a VDMS can support applications throughout
that pipeline.

Bio:

Magdalena Balazinska is Professor and Director of the Paul G. Allen School of Computer Science & Engineering at the University of Washington. Magdalena's research interests are in the field of database management systems. Her current research focuses on data management for data science, big data systems, cloud computing, and image and video analytics. Prior to her leadership of the Allen School, Magdalena was the Director of the eScience Institute, the Associate Vice Provost for Data Science, and the Director of the Advanced Data Science PhD Option. She also served as Co-Editor-in-Chief for Volume 13 of the Proceedings of the Very Large Data Bases Endowment (PVLDB) journal and as PC co-chair for the corresponding VLDB'20 conference. Magdalena is an ACM Fellow. She holds a Ph.D. from the Massachusetts Institute of Technology (2006). Shortly after her arrival at the University of Washington, she was named a Microsoft Research New Faculty Fellow (2007). Magdalena received the inaugural VLDB Women in Database Research Award (2016) for her work on scalable distributed data systems. She also received an ACM SIGMOD Test-of-Time Award (2017) for her work on fault-tolerant distributed stream processing and a 10-year most influential paper award (2010) from her earlier work on reengineering software clones.

• (3/30, 3:00pm): Michael Benedikt (Oxford University): New ideas and old tasks for reasoning and data management details)

TITLE
New ideas and old tasks for reasoning and data management

ABSTRACT
Reasoning is relevant to many aspects of data management, including schema design, data integration, and query optimization. In this talk I give a quick tour of a few major innovative techniques developed in the last decades for applying reasoning to data management. I'll then discuss some of the roadblocks that have been encountered in applying these tools, and mention -- briefly -- some of our current projects to overcome these issues. I'll touch upon work on extending reasoning-based query optimization to complex data types, applying data management-for-reasoning to reasoning-for-data management, mapping rule learning, and applying RL-guided theorem-proving to query optimization.

RELATED PAPERS
1. M. Benedikt. How Can Reasoners Simplify Database Querying (And Why Haven't They Done It Yet)? PODS 2018.
https://doi.org/10.1145/3196959.3196989
2. M. Benedikt, P. Pradic. Generating collection transformations from proofs. POPL 2021.
https://dl.acm.org/doi/10.1145/3434295

BIO
Michael Benedikt is Professor of Computer Science at Oxford University. Prior to that he was distinguished member of technical staff at Bell Laboratories.

• (4/11, 3:00pm): Joe Hellerstein (UC Berkeley): A Programmable Cloud: CALM Foundations and Open Challenges details)

ABSTRACT:
The public cloud emerged a decade ago, yet distributed systems are still programmed using models from sequential computing. All the traditional challenges of distributed programming and data are still present in the cloud, only they are now faced by the general population of software developers. Added to these challenges are new desires for "serverless" computing, including consumption-based pricing and autoscaling.

This talk will highlight principles for cloud programming that I have explored with colleagues over the past decade, including the CALM Theorem and languages like Dedalus and Bloom that encourage monotonic coordination-free consistency via logic and lattices. The Anna "any-scale" KVS will be presented as a petri dish for the potential of these ideas and many remaining challenges.

Time permitting, I will conclude by overviewing new work in the Hydro project, which is aimed at bringing research ideas to programmers in an practical, evolutionary fashion. Key to our approach is a separation of distributed programs into a PACT of four facets: Program semantics, Availablity, Consistency and Trust. We propose to migrate developers gradually to PACT programming by lifting familiar code into our more declarative level of abstraction. This agenda raises challenges across multiple areas including language design, query optimization, transactions, distributed consistency, compilers and program synthesis.

BIO:
Joe Hellerstein is the Jim Gray Professor of Computer Science at the University of California, Berkeley, and currently on leave as a Faculty Fellow at Sutter Hill Ventures. His research focuses on data-centric systems and the way they drive computing. In addition to his academic work, Hellerstein has been involved in a number of startup companies including Trifacta, which brought academic research on data wrangling to market.

Relevant papers:

• (4/13, 3:00pm): Reinhard Pichler (TU Vienna): Hypertree Decompositions and all that... details)

Abstract:
In the first part of the talk, I will revisit various notions of decompositions and widths and recapitulate some of their crucial properties. In particular, we thus mention:
*) hypertree-width (hw)
*) generalized hypertree-width (ghw)
*) fractional hypertree-width (fhw)

Time permitting, I will then present some recent results on the task of actually computing such decompositions:
*) NP-hardness of checking if fhw or ghw is at most 2
*) Tractable cases of ghw and fhw computation
*) Parallel approach to computing ghw and hw

Curriculum Vitae:
I got a master's degree from the University of Innsbruck (Mag.rer.nat. in Mathematics, 1991), another master's degree from the University of London (MSc in Mathematical Computation, 1992, at the QMW College), and a PhD from the TU Wien (Dr.techn. in Computer Science, 2000). In May 2001, I received my "habilitation" in Theoretical Computer Science.

From 1992 to 2005, I worked in the Program and Systems Engineering Department (PSE) of the Siemens AG Österreich. Since July 2005, I have been a professor in the Databases and Artificial Intelligence Group (DBAI) of the TU Wien.

Relevant publications:

Wolfgang Fischl, Georg Gottlob, Reinhard Pichler:
General and Fractional Hypertree Decompositions: Hard and Easy Cases. PODS 2018: 17-32.
(full version in J. ACM, 2021)

Wolfgang Fischl, Georg Gottlob, Davide Mario Longo, Reinhard Pichler: HyperBench: A Benchmark and Tool for Hypergraphs and Empirical Findings. PODS 2019: 464-480
(full version in ACM J. Exp. Alg., 2021)

Georg Gottlob, Matthias Lanzinger, Reinhard Pichler, Igor Razgon: Fractional Covers of Hypergraphs with Bounded Multi-Intersection. MFCS 2020: 41:1-41:14

Hubie Chen, Georg Gottlob, Matthias Lanzinger, Reinhard Pichler: Semantic Width and the Fixed-Parameter Tractability of Constraint Satisfaction Problems. IJCAI 2020: 1726-1733

Georg Gottlob, Cem Okulmus, Reinhard Pichler:
Fast and Parallel Decomposition of Constraint Satisfaction Problems. IJCAI 2020: 1155-1162.

Georg Gottlob, Matthias Lanzinger, Cem Okulmus, Reinhard Pichler: Fast Parallel Hypertree Decompositions in Logarithmic Recursion Depth. CoRR abs/2104.13793 (2021)
To appear at PODS 2022

• (4/25, 3:00pm): Huan Sun (Ohio State): Self-supervised Pre-training on Tabular Data details)

Abstract:
Pre-training/fine-tuning paradigms have transformed the natural language processing field. For tabular data-based tasks, however, their potential has been far less explored. In this talk, we will discuss our recent efforts: (1) TURL, a pre-training/fine-tuning paradigm on relational Web tables, which benefits a wide range of tasks for table understanding (e.g., relation extraction, entity linking). (2) StruG, a weakly supervised Structure-Grounded pretraining framework for text-to-SQL, which effectively learns to capture the text-table alignment essential for the task. (3) ReasonBERT, a pre-training method that augments language models for multi-step reasoning over hybrid contexts (textual and tabular). Among them, we will cover TURL in greater detail. Finally, we will conclude the talk with some of the open questions in the space.

Bio:
Huan Sun is an assistant professor in the Department of Computer Science and Engineering at the Ohio State University. Before joining OSU, she was a visiting scientist at the University of Washington (01-06/2016), received a Ph.D. in Computer Science from University of California, Santa Barbara (2015) and a B.S. in EEIS from the University of Science and Technology of China (2010). Her research interests lie in natural language processing, data mining and management, and artificial intelligence, with emphasis on building various kinds of natural language interfaces, task-oriented dialogue and conversational AI systems. Huan received the 2022 SIGMOD Research Highlight Award, 2021 BIBM Best Paper Award, Google Research Scholar Award (2022), NSF CAREER Award (2020), OSU Lumley Research Award (2020), SIGKDD Ph.D. Dissertation Runner-Up Award (2016), among others.

Relevant papers:
Turl: https://arxiv.org/abs/2006.14806 Please feel free to check out more papers on learning neural representations for tables mentioned in this tutorial. Unfortunately, we won’t be able to cover them in this talk.
StruG: https://arxiv.org/abs/2010.12773
ReasonBert: https://arxiv.org/abs/2109.04912

• (4/27, 3:00pm): Yanlei Diao (UMass): UDAO: A Next-Generation Cloud Data Analytics Optimizer via Large-Scale Machine Learning details)

ABSTRACT
Data analytics in the cloud has become an integral part of enterprise businesses. Big data analytics systems, however, still lack the ability to take task objectives such as user performance goals and budgetary constraints and automatically configure an analytical job to achieve these objectives. This talk presents UDAO, a Unified Data Analytics Optimizer, that can automatically determine a cluster configuration with a suitable number of cores as well as other system parameters that best meet the task objectives. At a core of our work is a principled multi-objective optimization (MOO) approach that computes a Pareto optimal set of configurations to reveal tradeoffs between different objectives, recommends a new cluster configuration that best explores such tradeoffs, and employs novel optimizations to enable such recommendations within a second. Such optimization is further enabled by a Deep Learning-based modeling approach that can learn a model for each user objective as complex as necessary for the underlying computing environment. Evaluation using production workloads shows that our optimizer could reduce 36-67% latency and, at the same time, 37-75% cost in an existing cloud analytics system, while running under 200 msec.

BIO
Yanlei Diao is Professor of Computer Science at Ecole Polytechnique, France and the University of Massachusetts Amherst, USA. She recently joined Amazon as an Amazon Scholar. Her research interests lie in big data analytics and scalable intelligent information systems, with a focus on optimization in cloud analytics, data stream analytics, explanation discovery, interactive data exploration, and uncertain data management. She received her PhD in Computer Science from the University of California, Berkeley in 2005.

Prof. Diao was a recipient of the 2016 ERC Consolidator Award, 2013 CRA-W Borg Early Career Award (one female computer scientist selected each year for outstanding contributions), IBM Scalable Innovation Faculty Award, and NSF Career Award. She has given keynote speeches at the ACM DEBS Conference, the ExploreDB workshop, and the Distinguished Lecture Series at the IBM Almaden Research Center, the University of Texas at Austin and Technische Universitaet Darmstadt. She has served as Editor-in-Chief of the ACM SIGMOD Record, Associate Editor of ACM TODS, Chair of the ACM SIGMOD Research Highlight Award Committee, and member of the SIGMOD and PVLDB Executive Committees. She was PC Co-Chair of IEEE ICDE 2017 and ACM SoCC 2016, and served on the organizing committees of SIGMOD, PVLDB, and CIDR, as well as on the program committees of many international conferences and workshops.

### Fall 2021 (Wolfgang)

• (9/2, 12:00pm): Robin Walters (Northeastern University): Equivariant Neural Networks for Learning Spatiotemporal Dynamics details)

TITLE
Equivariant Neural Networks for Learning Spatiotemporal Dynamics

ABSTRACT
Applications such as climate science and transportation require learning complex dynamics from large-scale spatiotemporal data. Existing machine learning frameworks are still insufficient to learn spatiotemporal dynamics as they often fail to exploit the underlying physics principles. Representation theory can be used to describe and exploit the symmetry of the dynamical system. We will show how to design neural networks that are equivariant to various symmetries for learning spatiotemporal dynamics. Our methods demonstrate significant improvement in prediction accuracy, generalization, and sample efficiency in forecasting turbulent flows and predicting real-world trajectories. This is joint work with Rose Yu, Rui Wang, and Jinxi Li.

RELATED PAPERS
* Walters, R.*, Wang, R.*, Yu, R. (2021). Incorporating Symmetry into Deep Dynamics Models for Improved Generalization. International Conference on Learning Representations (ICLR). https://arxiv.org/abs/2002.03061
* Walters, R.*, Li, J.*, Yu R. (2021). Trajectory Prediction using Equivariant Continuous Convolution. International Conference on Learning Representations (ICLR). https://arxiv.org/abs/2010.11344

BIO
Robin Walters is a postdoctoral research fellow in the Khoury College of Computer Sciences. He joined Khoury in July 2020 through the Experiential AI program. Formerly, Robin was a Zelevinsky Research Instructor in the Mathematics Department at Northeastern. His research studies the connections between representation theory and differential equations both theoretically and practically using equivariant neural networks.

• (9/9, 12:00pm): Wolfgang Gatterbauer (Northeastern University): Structure-preserving diagrams for relational queries and first-order logic details)

TITLE
Structure-preserving diagrams for relational queries and first-order logic

ABSTRACT
The relative expressiveness of relational query languages has been studied extensively: the logical expressiveness of particular fragments of relational algebra, relational calculus, Datalog with negation and restricted SQL under set semantics is well known to be equivalent. Yet what does it take to represent logical patterns'' and compare languages by their abilities to represent particular structures of reasoning that are common across various syntactic conventions?
We describe a complete diagrammatic representation system that preserves the logical structure for non-disjunctive safe relational calculus. It also solves a problem that has vexed the logical community for over 100 years: finding an unambiguous and complete representation system for first-order logic sentences.
Topics to discuss: 1. Why are disjunctions inherently more difficult to visualize than conjunctions? 2. Why do visual query languages based on relational algebra lack in their abilities to represent certain structural patterns? 3. A discussion of 3 common abuses of the line" that have created conceptual difficulties in past related work (with a particular focus on the influential and widely studied Peirce's existential beta graphs). 4. Ideas on how to solve the representation problem for disjunctions.

RELATED WORK
https://queryvis.com/

SHORT BIO
Wolfgang Gatterbauer is an Associate Professor in the Khoury College of Computer Sciences at Northeastern University. A major focus of his research is to extend the capabilities of modern data management systems in generic ways and to allow them to support novel functionalities that seem hard at first.
https://gatterbauer.name

• (9/16, 12:00pm): Amir Ilkhechi (Brown University): DeepSqueeze: Deep Semantic Compression for Tabular Data details)

TITLE
DeepSqueeze: Deep Semantic Compression for Tabular Data

ABSTRACT
With the rapid proliferation of large datasets, efficient data compression has become more important than ever. Columnar compression techniques (e.g., dictionary encoding, run-length encoding, delta encoding) have proved highly effective for tabular data, but they typically compress individual columns without considering potential relationships among columns, such as functional dependencies and correlations. Semantic compression techniques, on the other hand, are designed to leverage such relationships to store only a subset of the columns necessary to infer the others, but existing approaches cannot effectively identify complex relationships across more than a few columns at a time.
In this talk, I will describe DeepSqueeze, a novel semantic compression framework that can efficiently capture these complex relationships within tabular data by using autoencoders to map tuples to a lower-dimensional representation. DeepSqueeze also supports guaranteed error bounds for lossy compression of numerical data and works in conjunction with common columnar compression formats. Our experimental evaluation uses real-world datasets to demonstrate that DeepSqueeze can achieve over a 4x size reduction compared to state-of-the-art alternatives.

PAPER and recorded video from SIGMOD 2010:
http://cs.brown.edu/people/acrotty/pubs/3318464.3389734.pdf
https://dl.acm.org/doi/abs/10.1145/3318464.3389734

BIO
Amir Ilkhechi is a Ph.D. student in the Computer Science Department at Brown University, where he is a member of the Database Group advised by Professor Ugur Cetintemel. His research explores applications of deep learning to fundamental data management problems, most recently focusing on novel approaches to compression.

• (9/23, 12:00pm): Roee Shraga (Northeastern University): PoWareMatch: a Quality-aware Deep Learning Approach to Improve Human Schema Matching details)

TITLE
PoWareMatch: a Quality-aware Deep Learning Approach to Improve Human Schema Matching

ABSTRACT
Schema matching is a core task of any data integration process. Being investigated in the fields of databases, AI, Semantic Web and data mining for many years, the main challenge remains the ability to generate quality matches among data concepts (e.g., database attributes). In this work, we examine a novel angle on the behavior of humans as matchers, studying match creation as a process. We analyze the dynamics of common evaluation measures (precision, recall, and f-measure), with respect to this angle and highlight the need for unbiased matching to support this analysis. Unbiased matching, a newly defined concept that describes the common assumption that human decisions represent reliable assessments of schemata correspondences, is, however, not an inherent property of human matchers. In what follows, we design PoWareMatch that makes use of a deep learning mechanism to calibrate and filter human matching decisions adhering the quality of a match, which are then combined with algorithmic matching to generate better match results. We provide an empirical evidence, established based on an experiment with more than 200 human matchers over common benchmarks, that PoWareMatch predicts well the benefit of extending the match with an additional correspondence and generates high quality matches. In addition, PoWareMatch outperforms state-of-the-art matching algorithms.

RELATED WORK
https://arxiv.org/abs/2109.07321

BIO
Roee Shraga is a Postdoctoral fellow at the Khoury College of Computer Sciences at Northeastern University. He received his PhD degree in 2020 from the Technion – Israel Institute of Technology in the area of Data Science. Roee has published more than a dozen papers in leading journals and conferences on the topics of data integration, human-in-the-loop, machine learning, process mining, and information retrieval. He is also a recipient of several PhD fellowships including the Leonard and Diane Sherman Interdisciplinary Fellowship (2017), the Daniel Excellence Scholarship (2019), and the Miriam and Aaron Gutwirth Memorial Fellowship (2020).

• (9/30, 12:00pm): Can Qin (Northeastern University): Neural Pruning via Growing Regularization details)

TITLE
Neural Pruning via Growing Regularization

ABSTRACT
Regularization has long been utilized to learn sparsity in deep neural network pruning. However, its role is mainly explored in the small penalty strength regime. In this work, we extend its application to a new scenario where the regularization grows large gradually to tackle two central problems of pruning: pruning schedule and weight importance scoring. (1) The former topic is newly brought up in this work, which we find critical to the pruning performance while receives little research attention. Specifically, we propose an L_2 regularization variant with rising penalty factors and show it can bring significant accuracy gains compared with its one-shot counterpart, even when the same weights are removed. (2) The growing penalty scheme also brings us an approach to exploit the Hessian information for more accurate pruning without knowing their specific values, thus not bothered by the common Hessian approximation problems. Empirically, the proposed algorithms are easy to implement and scalable to large datasets and networks in both structured and unstructured pruning. Their effectiveness is demonstrated with modern deep neural networks on the CIFAR and ImageNet datasets, achieving competitive results compared to many state-of-the-art algorithms.

RELATED WORK
ICLR 2021: https://openreview.net/pdf?id=o966_Is_nPA

BIO
Can Qin has received the B.E. degree from the Xidian University (XDU), China, in 2018. Currently, he is a Ph.D. candidate in the Department of Electrical and Computer Engineering, Northeastern University, under the supervision of Prof. Yun Raymond Fu. He has been awarded the Best Paper Award in ICCV Workshop on Real-World Recognition from Low-Quality Images and Videos 2019. He also has some top-tier conference papers accepted at NeurIPS, AAAI, ECCV, ICLR, et al. His research interests broadly include the transfer learning and deep learning.

• (10/7, 12:00pm): Senjuti Basu Roy (New Jersey Institute of Technology): Optimization Opportunities in Human-in-the-loop Systems details)

TITLE
Optimization Opportunities in Human-in-the-loop Systems

ABSTRACT
An emerging trend is to leverage an under-explored and richly heterogeneous pool of human knowledge inside machine algorithms, a practice popularly termed as human-in-the-loop (HIL) processes. A wide variety of applications, starting from sentiment analysis to image recognition, query processing to text translation, or even feature engineering stand to benefit from such synergistic man-machine collaboration. This talk will explore optimization opportunities inside such HIL systems, considering the roles and responsibilities of three key stakeholders - humans (workers), machines (algorithms), and platforms (online infrastructure where the work takes place). Optimization inside such HIL systems investigates judicious involvement of workers inside machine algorithms, as well as the desired functionality of the platforms to satisfy a variety of goals pertinent to the aforementioned stakeholders. Following that, the talk will specifically discuss both modeling as well as algorithmic challenges in task design and deployment inside large-scale HIL systems.

RELATED WORK
Multi-Session Diversity to Improve User Satisfaction in Web Applications. WWW 2021.
https://dl.acm.org/doi/10.1145/3442381.3450046
Recommending Deployment Strategies for Collaborative Tasks. SIGMOD 2020.
https://dl.acm.org/doi/10.1145/3318464.3389719
Making AI Machines Work for Humans in FoW. SIGMOD Record 2020.
https://dl.acm.org/doi/10.1145/3442322.3442327

BIO
Senjuti Basu Roy is the Panasonic Chair in Sustainability and an Associate Professor at the Department of Computer Science at the New Jersey Institute of Technology. Her broader research interests lie in the area of large scale data management with the focus on designing principled algorithms for "human-in-the-loop" systems. She is the tutorial co-chair of The Web Conference 2022, has served as the Mentorship co-chair of SIGMOD 2018, PhD workshop co-chair of VLDB 2018, co-chair of SEADATA Workshop 2021 (colocated with VLDB 2021), and HMData Workshops 2017-2021 (colocated with IEEE BigData conference). She is a recipient of the NSF CAREER Award, and one of the 100 invited early career engineers to attend the National Academy of Engineering’s 2021 US Frontiers of Engineering Symposium.

• (10/14, 12:00pm): Steven Holtzen (Northeastern University): Exploiting Symmetry for Scaling Discrete Factor Graph Inference details)

TITLE
Exploiting Symmetry for Scaling Discrete Factor Graph Inference

ABSTRACT
A key goal in the design of probabilistic inference algorithms is identifying and exploiting properties of the distribution that make inference tractable. One such property is symmetry, which is characterized by points in the distribution that are guaranteed to have the same probability. In this talk I will describe two inference algorithms for discrete factor graphs that scale in the degree of symmetry of the distribution. The first inference algorithm, called orbit generation, is the first exact inference algorithm for factor graphs that scales in the degree of symmetry of the distribution. The second inference algorithm is a Markov-Chain Monte-Carlo algorithm that mixes rapidly in the degree of symmetry.

RELATED WORK
Generating and Sampling Orbits for Lifted Probabilistic Inference. Steven Holtzen, Todd Millstein, and Guy Van den Broeck. In Uncertainty in Artificial Intelligence (UAI), 2019.
https://arxiv.org/abs/1903.04672

BIO
Steven Holtzen is an assistant professor at Northeastern University. His research focuses on programming languages, artificial intelligence, and machine learning. In particular he is interested in probabilistic programming languages, foundations of probabilistic inference, tractable probabilistic modeling, automated reasoning, and probabilistic verification. His work has been recognized by an ACM SIGPLAN distinguished paper award.

• (10/21, 12:00pm): Nesime Tatbul (MIT): Towards Explainable Time Series Anomaly Detection details)

TITLE
Towards Explainable Time Series Anomaly Detection

ABSTRACT
Time series is a ubiquitous data type with a growing range of applications from telemetry to finance. A key capability that lies at the core of managing time series data is the identification of unusual patterns of interest called anomalies. Detecting and explaining anomalies not only finds use in many mission-critical domains, but also empowers systems and users with the ability to handle large data volumes by guiding attention and resources to information that matters the most. Despite years of effort, diversity of time series applications, often noisy nature of datasets, and contextual variations in anomaly types and instances challenge the creation of robust and generalizable solutions. In this talk, I will present a collection of novel data science tools and techniques to help overcome such challenges in practice, including Exathlon -- the first public benchmark for explainable anomaly detection over high-dimensional time series.

Exathlon: http://vldb.org/pvldb/vol14/p2613-tatbul.pdf, https://github.com/exathlonbenchmark/exathlon
Metro-Viz: https://dl.acm.org/doi/pdf/10.1145/3299869.3320247

SHORT BIO
Nesime Tatbul is a senior research scientist at Intel Labs and MIT, currently serving as an industry co-PI for MIT's Data Systems and AI Lab jointly funded by Intel, Google, and Microsoft. Previously, she served on the computer science faculty of ETH Zurich after receiving a Ph.D. degree from Brown University. Her research interests are broadly in large-scale data management systems and modern data-intensive applications, with a current focus on time series analytics and learned data systems. She has been an active member of the database research community for 20+ years, serving in various roles for the VLDB Endowment, ACM SIGMOD, and other organizations.

• (10/28, 12:00pm): Azza Abouzied (NYU): Scalable Prescriptive Analytics in Database Systems details)

TITLE
Scalable Prescriptive Analytics in Database Systems

ABSTRACT
Currently, database systems do not natively support the many data processing needs of data-driven decision making, leaving experts to develop their own custom, ad hoc application-level solutions that are difficult to scale and may produce sub-optimal results. While many systems provide support for scalable descriptive analytics (like statistics and summaries of the raw data) and even some predictive analytics (such as forecasts), there is little support for prescriptive analytics, which searches for the best course of action given the available data. As we move from "what is the data?" to "what to do with it?,” we need to augment database systems with efficient computational problem-solving capabilities that take into consideration the inherent uncertainty of data and models. In this talk, I will explore some of the systems we built to integrate state-of-the-art solvers within the DBMS to scalably solve stochastic constrained optimization problems. I will also describe our work in building a scalable system to support sequential decision-making, and how this system is being used to help public health policy makers construct cost-effective policies that curb epidemics like COVID-19.

https://dl.acm.org/doi/10.1145/3472749.3474794
https://dl.acm.org/doi/10.1145/3318464.3389765
http://packagebuilder.cs.umass.edu/papers/p576-brucato.pdf

BIO
Azza Abouzied is an associate professor of computer science at New York University Abu Dhabi. Her research focuses on designing intuitive data querying tools and combines techniques from various research fields such as HCI, machine learning, and database systems. In 2019, she won a VLDB test of time award for her work on HadoopDB. Her work on integrating decision-making support in database systems received a Best of VLDB recognition, a SIGMOD Research Highlight, a CACM research highlight and a Best VLDB demo award. She earned her doctoral degree from Yale in 2013. She spent a year as a visiting scholar at UC Berkeley.

• (11/4, 12:00pm): Brian Hentschel (Harvard): Cerebral Data Structures details)

TITLE
Cerebral Data Structures

ABSTRACT
Data structures form the basis of all data-driven software applications and operations on data structures form a critical part of the overall cost of systems. For instance, just a singular data structure, hash tables, and in just a single language, C++, accounts for 2% of total CPU usage and 5% of total RAM usage across all of Google.
We make the case for cerebral data structures, which use machine learning and statistical modeling at a high level to redesign base data structures in computer science. This redesign keeps the core structure of classical data structures; by keeping this core, the operations remain simple and therefore efficient as well as robust to shifting workloads. Additionally, keeping this core allows for theoretical arguments about data structure performance. At the same time, machine learning and statistics are used to transfer properties of the workload and data into the data structure through a redesign, achieving better expected performance than classic designs.
As well as discussing the general approach, we present in detail two applications of this approach. First, we present Stacked Filters, which uses workload skew to produce 100X lower false positive rates for filter data structures. Second, we present Entropy-Learned Hashing, which separates hashing speed from input size, producing 10X faster hash function evaluation and 4X faster hash tables than state-of-the-art approaches from Facebook and Google.

RELATED WORK
https://stratos.seas.harvard.edu/files/stratos/files/stackedfilters_vldb2021_extended_version.pdf

SHORT BIO
Brian Hentschel is a Ph.D. candidate at Harvard University advised by Stratos Idreos. He is interested in blending statistics and machine learning with classical techniques to improve computer systems. His research has been awarded the SIGMOD research highlight award as well a best paper from EDBT. He earned his BA in mathematics and computer science at Pomona College and has previously spent time at Microsoft, Amazon, LinkedIn, and IBM.

• (11/5, 12:00pm): Kuldeep Meel (National University of Singapore): Statistical Learning and Symbolic Reasoning: Better Together for Software 2.0 details)

TITLE
Statistical Learning and Symbolic Reasoning: Better Together for Software 2.0

ABSTRACT
The advent of personal computing has made manual tabling of data obsolete in today's context. What about the future where software engineers would balk at the prospect of the data-driven manual design of heuristics?. Well, we think such a future is on the horizon: in this talk, I will discuss our ambitious project, CrystalBall, that seeks to automate the design of heuristics in modern SAT solvers. The past two decades have witnessed how SAT went from the dreaded Non-deterministic Polytime (NP)-complete problem to Not a Problem (NP) for formal methods and AI community. Such progress was fueled by experts' careful design of heuristics, and every year's SAT competition sees experts continue to spend hundreds of hours tuning these heuristics. I will discuss how such heuristics can be learned automatically by statistical techniques and glimpses of the future where experts only focus on high-level ideas.

RELEVANT PAPER
https://www.comp.nus.edu.sg/~meel/Papers/sat19skm.pdf

BLOG
https://www.msoos.org/2019/06/crystalball-sat-solving-data-gathering-and-machine-learning/

BIO
Kuldeep Meel holds the NUS Presidential Young Professorship in the School of Computing at the National University of Singapore. His research interests lie at the intersection of Formal Methods and Artificial Intelligence. He is a recipient of the 2019 NRF Fellowship for AI (accompanied with S$2.5 million funding) and was named AI’s 10 to Watch by IEEE Intelligent Systems in 2020. His work received the 2020 Amazon Research Award, the 2018 Ralph Budd Award for Best PhD Thesis in Engineering, 2014 Outstanding Masters Thesis Award from Vienna Center of Logic and Algorithms and Best Student Paper Award at CP 2015. His CP-18 paper, CAV-20, and PODS-21 papers received IJCAI-19 Sister conferences best paper award track invitation, "Best of PODS-21" invite from from ACM TODS, and "Best Papers of CAV-20" invite from FMSD journal respectively. • (11/11, 12:00pm): Xiao Hu (Duke): Enumeration Algorithms for Conjunctive Queries with Projection details) TITLE Enumeration Algorithms for Conjunctive Queries with Projection ABSTRACT We investigate the enumeration of query results for an important subset of CQs with projections, namely star and path queries. The task is to design data structures and algorithms that allow for efficient enumeration with delay guarantees after a preprocessing phase. Our main contribution is a series of results based on the idea of interleaving precomputed output with further join processing to maintain delay guarantees, which may be of independent interest. In particular, we design combinatorial algorithms that provide instance-specific delay guarantees in nearly linear preprocessing time. These algorithms improve upon the currently best known results. Further, we show how existing results can be improved upon by using fast matrix multiplication. We also present new results involving tradeoff between preprocessing time and delay guarantees for enumeration of path queries that contain projections. CQs with projection where the join attribute is projected away is equivalent to boolean matrix multiplication. Our results can therefore be also interpreted as sparse, output-sensitive matrix multiplication with delay guarantees. RELATED WORK [1] Enumeration Algorithms for Conjunctive Queries with Projection, Shaleen Deep, Xiao Hu, Paraschos Koutris, ICDT 2021. https://arxiv.org/abs/2101.03712 [2] Trade-offs in Static and Dynamic Evaluation of Hierarchical Queries, Ahmet Kara, Milos Nikolic, Dan Olteanu, Haozhe Zhang, PODS 2020. https://arxiv.org/abs/1907.01988 [3] On acyclic conjunctive queries and constant delay enumeration, Guillaume Bagan, Arnaud Durand, and Etienne Grandjean, CSL 2007. https://grandjean.users.greyc.fr/Recherche/PublisGrandjean/EnumAcyclicCSL07.pdf [4] Structural tractability of enumerating csp solutions, G. Greco and F. Scarcello, Constraints 2013. https://arxiv.org/abs/1005.1567 SHORT BIO Xiao Hu is a postdoctoral associate in the Department of Computer Science at Duke University, co-supervised by Prof. Pankaj Agarwal and Prof. Jun Yang. Prior to that, she received her Ph.D. in Computer Science and Engineering from HKUST, and a BE degree in Computer Software from Tsinghua University. Her research has focused on studying fundamental problems in database theory and their implications to practical systems. Her work on massively parallel join algorithms has been invited to ACM Transactions on Database Systems as a research paper, as well as a feature article in the Database Principles Column in SIGMOD Record. • (11/18, 12:00pm): Stijn Vansummeren (Hasselt University): General Dynamic Yannakakis: Conjunctive Queries with Theta Joins Under Updates details) TITLE General Dynamic Yannakakis: Conjunctive Queries with Theta Joins Under Updates ABSTRACT The ability to efficiently analyze changing data is a key requirement of many real-time analytics applications. In database terms, such analysis is closely related to the problem of Incremental View Maintenance (IVM) where we are asked to maintain the result Q(db) of a query Q on a database db under updates. In this talk I will summarize selected algorithmic ideas of the IVM literature and illustrate that they inherently rely on either (1) recomputation of query subresults or (2) materialization of subresults. Both have their drawbacks: recomputation is detrimental to update latency while materialization may waste both memory and be detrimental to latency. By moving to the framework of query evaluation with small delay, introduced by the seminal paper of Baghan, Durand, and Grandjean in 2007, we are able to circumvent both drawbacks. In particular, for the so called q-hierarchical queries, it is possible to (1) represent query results and subresults succinctly, in space at most linear in the database; (2) enumerate the results from this representation efficiently: with constant delay; and (3) maintain the representation efficiently: in constant time for updates of constant size. I will illustrate how we can obtain these features by modifying Yannakakis' seminal algorithm for evaluating Acyclic Conjunctive Queries. The resulting algorithm, which is called Dynamic Yannakakis, can be generalized to not only apply to Acyclic Conjunctive Queries (which only contain equality joins by definition), but also to such queries endowed with theta-joins (in particular: inequality predicates like A < B). This talk summarizes our results published in SIGMOD 2017, VLDB 2018, and VLDB Journal 2020, which received the SIGMOD Research Highlights Award 2018. RELATED WORK * On acyclic conjunctive queries and constant delay enumeration, Guillaume Bagan, Arnaud Durand, and Etienne Grandjean, CSL 2007. https://grandjean.users.greyc.fr/Recherche/PublisGrandjean/EnumAcyclicCSL07.pdf * General dynamic Yannakakis: conjunctive queries with theta joins under updates. Muhammad Idris, Martín Ugarte, Stijn Vansummeren, Hannes Voigt, Wolfgang Lehner. VLDB J. 29(2-3): 619-653 (2020). https://link.springer.com/article/10.1007/s00778-019-00590-9 * Efficient Query Processing for Dynamically Changing Datasets. Muhammad Idris, Martín Ugarte, Stijn Vansummeren, Hannes Voigt, Wolfgang Lehner. SIGMOD Rec. 48(1): 33-40 (2019). https://sigmodrecord.org/?smd_process_download=1&download_id=3073 * Conjunctive Queries with Inequalities Under Updates. Muhammad Idris, Martín Ugarte, Stijn Vansummeren, Hannes Voigt, Wolfgang Lehner. Proc. VLDB Endow. 11(7): 733-745 (2018). http://www.vldb.org/pvldb/vol11/p733-idris.pdf * The Dynamic Yannakakis Algorithm: Compact and Efficient Query Processing Under Updates. Muhammad Idris, Martín Ugarte, Stijn Vansummeren. SIGMOD Conference 2017: 1259-1274. https://martinugarte.com/media/pdfs/main_pDxeVno.pdf SHORT BIO Stijn Vansummeren is research professor of Data Management and Data Wrangling at the Data Science Institute of Hasselt University, Belgium. His research focuses on large scale data management, with as overarching theme the study, development, and application of formal approaches to wrangling, querying, and analyzing data at scale. Stijn Vansummeren obtained his PhD in 2005 from Hasselt University, where he was a PhD fellow of the Research Foundation Flanders (FWO) in the Databases and Theoretical Computer Science group, advised by Jan Van den Bussche. From 2005-2009, he was a postdoctoral fellow of the FWO in the same group. In 2009, he joined the Université Libre de Bruxelles (ULB), Belgium where he was associate professor of Computer Science at the Engineering faculty, until September 2020. His research has been awarded with the ACM SIGMOD Research Highlights Award (2018), a best paper award at WebDB (2016), and a best paper nomination award at the World Wide Web (WWW) conference (2008). • (12/2, 12:00pm): Asterios Katsifodimos (TU Delft): Valentine: Evaluating Matching Techniques for Dataset Discovery details) TITLE Valentine: Evaluating Matching Techniques for Dataset Discovery ABSTRACT Data scientists today search large data lakes to discover and integrate datasets. In order to bring together disparate data sources, dataset discovery methods rely on some form of schema matching: the process of establishing correspondences between datasets. This process has been traditionally taken care with schema matching techniques. After 20 years of research in schema matching, we are still missing a benchmark for schema matching, as well as proper datasets, and proper evaluation metrics! In this talk I will present Valentine, is an extensible open-source experiment suite to execute and organize large-scale automated matching experiments on tabular data. Valentine now includes implementations of 7 seminal schema matching methods that we either implemented from scratch (due to absence of open source code) or imported from open repositories. Finally, Valentine offers a data fabrication toolbox for constructing testing datasets with ground truth. I will conclude my talk with insights from a very large set of experiments we have been performing at TU Delft, focusing on the strengths and weaknesses of existing techniques, that can serve as a guide for employing schema matching in future dataset discovery methods. LINKS: Original Paper: https://ieeexplore.ieee.org/abstract/document/9458921 Demo: http://www.vldb.org/pvldb/vol14/p2871-koutras.pdf Project: https://delftdata.github.io/valentine/ Code: https://github.com/delftdata/valentine SHORT BIO Asterios Katsifodimos is an assistant professor the Delft University of Technology. Before TU Delft, Asterios worked at the SAP Innovation Center, and as a postdoc at TU Berlin. Asterios holds a PhD from INRIA & University of Paris 11 in France. Asterios currently works on scalable stream processing and data integration. • (12/3, 8:00am): Qichen Wang (Hong Kong UST): Maintaining Acyclic Foreign-Key Joins under Updates details) TITLE Maintaining Acyclic Foreign-Key Joins under Updates ABSTRACT In this paper, we study the problem of incrementally maintaining the query results of acyclic joins under updates, i.e., insertion and deletion of tuples to any of the relations. Prior work has shown that this problem is inherently hard, requiring at least$\Omega(|db|^{{1\over 2} -\epsilon})$time per update, where$|db|$is the size of the database, and$\epsilon > 0$can be any small constant. However, this negative result holds only on adversarially constructed update sequences; on the other hand, most real-world update sequences are "nice", nowhere near these worst-case scenarios. We introduce a measure$\lambda$, which we call the enclosureness of the update sequence, to more precisely characterize its intrinsic difficulty. We present an algorithm to maintain the query results of any acyclic join in$O(\lambda)$time amortized, on any update sequence whose enclosureness is$\lambda$. This is complemented with a lower bound of$\Omega(\lambda^{1-\epsilon})$, showing that our algorithm is essentially optimal with respect to$\lambda$. Moreover, the new measurement also recovers prior lower bounds on static as well as dynamic query evaluation. Next, using this algorithm as the core component, we show how all the 22 queries in the TPC-H benchmark can be supported in$\tilde{O}(\lambda)$time. Finally, based on the algorithms developed, we built a continuous query processing system on top of Flink, and experimental results show that our system outperforms previous ones significantly. RELATED WORKS * Wang, Qichen, and Ke Yi. "Maintaining Acyclic Foreign-Key Joins under Updates." In SIGMOD 2020. https://www.cse.ust.hk/~yike/sigmod20.pdf * Chirkova, Rada, and Jun Yang. "Materialized views." Foundations and Trends in Databases 4, no. 4, 2011. http://db.cs.duke.edu/papers/fntdb12-ChirkovaYang-mat_views.pdf * Idris, Muhammad, Martín Ugarte, and Stijn Vansummeren. "The dynamic Yannakakis algorithm: Compact and efficient query processing under updates." In PVLDB 2017. https://dl.acm.org/doi/pdf/10.1145/3035918.3064027 * Ahmad, Yanif, Oliver Kennedy, Christoph Koch, and Milos Nikolic. "DBToaster: Higher-order Delta Processing for Dynamic, Frequently Fresh Views," PVLDB 2012. http://vldb.org/pvldb/vol5/p968_yanifahmad_vldb2012.pdf * Berkholz, Christoph, Jens Keppeler, and Nicole Schweikardt. "Answering conjunctive queries under updates." In PODS 2017. https://dl.acm.org/doi/pdf/10.1145/3034786.3034789 SHORT BIO Qichen Wang is a PhD candidate in the Department of Computer Science and Engineering at Hong Kong University of Science and Technology, supervised by Prof. Ke Yi. Prior to that, he received a BE degree in Computer Science from Zhejiang University. His research has focused on studying optimization techniques for query evaluation, from theoretical aspects to practical implementations. He is also interested in parallel and distributed algorithms, and algorithms for data streams. • (12/16, 12:00pm): Bill Howe (University of Washington): Data-Centric AI: Reuse, Integration, and Synthesis of Weakly Structured Data details) TITLE Data-Centric AI: Reuse, Integration, and Synthesis of Weakly Structured Data ABSTRACT What good is a collection of 1000s of loosely related tables? Repositories of weakly structured datasets -- datasets with rows and columns but few other guarantees -- have been built by organizations, cities, and in scientific domains, but the anticipated value of these systems has been difficult to realize. Urban open data repositories, data marketplaces, enterprise intranets, and scientific repositories are motivated by the idea that the data can be reused and integrated, but technical friction, limited provenance, data quality issues, limited search and discovery, and governance restrictions make widespread reuse difficult. Across a number of projects in the last decade, we have considered ways to make these repositories more valuable by enabling global reasoning over collections of locally defined datasets. I'll discuss a few projects in this space, including SQLShare, where we aimed to share and reuse the verbs (SQL queries) as well as the nouns (tables), ClaimJumper, where we aimed to enable repository-wide claim verification and statistical inference, and equitensors, where we aim to learn integrated, reusable, and unbiased representations from data with a shared time and space domain. Ultimately, our goal is to understand the limits of using diverse, undercurated, weakly structured datasets to train better models. I'll finish with some ideas for synthesizing data to enable better training and evaluation of ML models in specialized application domains where data is scarce. RELATED WORK https://faculty.washington.edu/billhowe/publications/pdfs/jain2016sqlshare.pdf https://faculty.washington.edu/billhowe/publications/pdfs/grechkin17ezlearn.pdf https://dl.acm.org/doi/abs/10.1145/3448016.3452777 (sorry for the paywall) bonus: https://faculty.washington.edu/billhowe//publications/pdfs/jain_cidr_2019.pdf BIO Bill Howe is Associate Professor in the Information School and Adjunct Associate Professor in the Allen School of Computer Science & Engineering and the Department of Electrical Engineering. His research interests are in data management, machine learning, and visualization, particularly as applied in the physical and social sciences. As Founding Associate Director of the UW eScience Institute, Dr. Howe played a leadership role in the Moore-Sloan Data Science Environment program through a$32.8 million grant awarded jointly to UW, NYU, and UC Berkeley, and founded UW’s Data Science for Social Good Program. With support from the MacArthur Foundation, NSF, and Microsoft, Howe directs UW’s participation in the Cascadia Urban Analytics Cooperative. He founded the UW Data Science Masters Degree, serving as its inaugural Program Chair, and created a first MOOC on data science that attracted over 200,000 students. His research has been featured in the Economist and Nature News, and he has authored award-winning papers in conferences across data management, machine learning, and visualization. He has a Ph.D. in Computer Science from Portland State University and a Bachelor’s degree in Industrial & Systems Engineering from Georgia Tech.

• (12/17, 12:00pm): Luana Ruiz (UPenn): Graphon Signal Processing details)

TITLE
Graphon Signal Processing

ABSTRACT
Graphons are inﬁnite-dimensional objects that represent the limit of convergent sequences of graphs as their number of nodes goes to inﬁnity. This paper derives a theory of graphon signal processing centered on the notions of graphon Fourier transform and linear shift invariant graphon ﬁlters, the graphon counterparts of the graph Fourier transform and graph ﬁlters. It is shown that for convergent sequences of graphs and associated graph signals: (i) the graph Fourier transform converges to the graphon Fourier transform when the graphon signal is bandlimited; (ii) the spectral and vertex responses of graph ﬁlters converge to the spectral and vertex responses of graphon ﬁlters with the same coefﬁcients. These theorems imply that for graphs that belong to certain families, i.e., that are part of sequences that converge to a certain graphon, graph Fourier analysis and graph ﬁlter design have well deﬁned limits. In turn, these facts extend applicability of graph signal processing to graphs with large number of nodes — since signal processing pipelines designed for limit graphons can be applied to ﬁnite graphs — and to dynamic graphs — since we can relate the result of SP pipelines designed for different graphs from the same convergent graph sequence.

RELATED WORK
* Graphon Signal Processing, IEEE Transactions on Signal Processing ( Volume: 69), 2021.
https://arxiv.org/pdf/2003.05030
* Graphon Neural Networks and the Transferability of Graph Neural Networks
https://proceedings.neurips.cc/paper/2020/file/12bcd658ef0a540cabc36cdf2b1046fd-Paper.pdf
* Transferability Properties of Graph Neural Networks
https://arxiv.org/pdf/2112.04629.pdf

SHORT BIO
Luana Ruiz received the B.Sc. degree in electrical engineering from the University of São Paulo, Brazil, and the M.Sc. degree in electrical engineering from the École Supérieure d'Electricité (now CentraleSupélec), France, in 2017. She is currently a Ph.D. candidate with the Department of Electrical and Systems Engineering at the University of Pennsylvania. Her research interests are in the areas of large-scale graph machine learning and the mathematical foundations of deep learning. She was awarded an Eiffel Excellence scholarship from the French Ministry for Europe and Foreign Affairs between 2013 and 2015, nominated an iREDEFINE fellow in 2019 and a MIT EECS Rising Star in 2021, and received best student paper awards at the European Signal Processing Conference (EUSIPCO) in 2019 and 2021.

### Summer 2021 (Wolfgang)

• (5/6, 12:00pm): Ioannis Liagouris (Boston University): Secrecy: Secure collaborative analytics on secret-shared data details)

TITE
Secrecy: Secure collaborative analytics on secret-shared data

ABSTRACT
In this talk I will present Secrecy, a new relational framework for secure collaborative analytics. Secrecy is based on replicated secret sharing and allows composing and executing end-to-end oblivious queries under secure multi-party computation (MPC). Secrecy’s core novelty is a set of logical and physical optimizations that effectively reduce query execution costs while retaining the full security guarantees of MPC. We evaluate Secrecy using real and synthetic queries from several application areas. Our experiments demonstrate that the optimizations we propose can improve query performance by orders of magnitude, enabling Secrecy to outperform state-of-the-art frameworks and scale to much larger datasets than those reported in prior works.

RELATED WORK
Secrecy: Secure collaborative analytics on secret-shared data John Liagouris, Vasiliki Kalavri, Muhammad Faisal, Mayank Varia
https://arxiv.org/abs/2102.01048

BIO
John Liagouris is a research scientist at the Hariri Institute for Computing and an adjunct assistant professor at Boston University. His research interests lie in distributed systems and databases. Before joining BU, he was a visiting scholar at the RISELab, UC Berkeley, a senior researcher at the Systems Group, ETH Zurich, a visiting research fellow at the University of Hong Kong (HKU), and a research assistant at the “Athena” Research Center, Greece. John obtained his PhD from NTU Athens, Greece.

• (6/2, 1:00pm): Fabrizio Silvestri (Sapienza University of Rome): Neural databases: Natural language generation meets databases details)

TITE
Neural databases: Natural language generation meets databases

ABSTRACT
We asked ourselves a question: can we build a database management system that doesnâ€™t rely on the fundamental concept of a schema. In recent years, neural networks have shown impressive performance gains on long-standing AI problems, and in particular, answering queries from natural language text. These advances raise the question of whether they can be extended to a point where we can relax the fundamental assumption of database management, namely, that our data is represented as fields of a predefined schema. We present a first step in answering that question and we describe NeuralDB: a database system with no predefined schema. In NeuralDB updates and (select) queries are given in natural language. We develop query processing techniques that build on the primitives offered by the state-of-the-art Natural Language Processing methods. We begin by demonstrating that at the core, recent NLP transformers, powered by pre-trained language models, can answer select-project-join queries if they are given the exact set of relevant facts. However, they cannot scale to non-trivial databases and cannot perform aggregation queries. Based on these findings, we describe a NeuralDB architecture that runs multiple Neural Select Project Join (SPJ) operators in parallel, each with a set of database sentences that can produce one of the answers to the query. The result of these operators is fed to an aggregation operator if needed. We describe an algorithm that learns how to create the appropriate sets of facts to be fed into each of the Neural SPJ operators. Importantly, this algorithm can be trained by the Neural SPJ operator itself. We experimentally validate the accuracy of NeuralDB and its components, showing that we can answer queries over thousands of sentences with very high accuracy.

RELATED WORK
VLDB 2021: From natural language processing to neural databases
James Thorne, Majid Yazdani, Marzieh Saeidi, Fabrizio Silvestri, Sebastian Riedel, Alon Halevy
http://www.vldb.org/pvldb/vol14/p1033-thorne.pdf
https://arxiv.org/pdf/2010.06973

BIO
Fabrizio Silvestri is a Full Professor at the Computer Engineering Dept. of the Sapienza University of Rome. Formerly a Research Scientist at Facebook AI in London, his interests are in AI applied to integrity-related problems and the application of Natural Language Processing. In the past, he has worked on web search research, and in particular, his specialization is building systems to better interpret queries s from search users. Prior to Facebook, Fabrizio was a principal scientist at Yahoo where he has worked on sponsored search and native ads within the Gemini project. Fabrizio holds a Ph.D. in Computer Science from the University of Pisa, Italy where he studied problems related to Web Information Retrieval with a particular focus on Efficiency-related problems like Caching, Collection Partitioning, and Distributed IR in general.

This is part of the the Experiential AI Distinguished Lectures Series
Register here: https://eaidistinguished.splashthat.com/

• (6/16, 1:00pm): Ricardo Baeza-Yates (Northeastern University): Ethics in AI: A Challenging Task details)

TITE
Ethics in AI: A Challenging Task

ABSTRACT
In the first part of this seminar, we will cover five current specific challenges through examples: (1) discrimination (e.g., facial recognition, justice, sharing economy, language models); (2) phrenology (e.g., biometric based predictions); (3) unfair digital commerce (e.g., exposure and popularity bias); (4) stupid models (e.g., Signal, minimal adversarial AI) and (5) indiscriminate use of computing resources (e.g., large language models). These examples do have a personal bias but set the context for the second part where we will address four generic challenges: (1) too many principles (e.g., principles vs. techniques), (2) cultural differences (e.g., Christian vs. Muslim); (3) regulation (e.g., privacy, antitrust) and (4) our cognitive biases. We will wrap up the seminar by discussing what we can do to address these challenges in the near future.

RELATED WORK
Towards intellectual freedom in an AI Ethics Global Community (Opinion paper). AI and Ethics. 2021.

BIO
Ricardo Baeza-Yates is a Research Professor at the Institute for Experiential AI of Northeastern University. Before he was the CTO of NTENT, a semantic search technology company based in California and prior to these roles, he was VP of Research at Yahoo Labs, based in Barcelona, Spain, and later in Sunnyvale, California, from 2006 to 2016. He is co-author of the best-seller Modern Information Retrieval textbook published by Addison-Wesley in 1999 and 2011 (2nd ed), that won the ASIST 2012 Book of the Year award. From 2002 to 2004 he was elected to the Board of Governors of the IEEE Computer Society and between 2012 and 2016 was elected to the ACM Council. In 2009 he was named ACM Fellow and in 2011 IEEE Fellow, among other awards and distinctions. He obtained a Ph.D. in CS from the University of Waterloo, Canada, in 1989, and his areas of expertise are web search and data mining, information retrieval, bias and ethics on AI, data science and algorithms in general.

Regarding the topic of the seminar, he is actively involved as an expert in many initiatives, committees, and advisory boards related to Responsible AI around the world including: Global AI Ethics Consortium, Global Partnership on AI, IADB's fAIr LAC Initiative (Latin America and the Caribbean), Council of AI (Spain) and ACM's Technology Policy Subcommittee on AI and Algorithms (USA). He is also a co-founder of OptIA in Chile - a NGO devoted to algorithmic transparency and inclusion - and a member of the editorial committee of the new AI and Ethics journal where he co-authored an article highlighting the importance of research freedom on ethical AI.

This is part of the the Experiential AI Distinguished Lectures Series
Register here: https://experientialaids2.splashthat.com/

• (7/22, 12:00pm): Saket Gurukar (Ohio State University): MILE: A multi-level framework for scalable graph embedding details)

TITLE
MILE: A Multi-Level Framework for Scalable Graph Embedding

ABSTRACT
Recently there has been a surge of interest in designing graph embedding methods. Few, if any, can scale to a large-sized graph with millions of nodes due to both computational complexity and memory requirements. In this paper, we relax this limitation by introducing the MultI-Level Embedding (MILE) framework -- a generic methodology allowing contemporary graph embedding methods to scale to large graphs. The proposed MILE framework is agnostic to the underlying graph embedding techniques and can be applied to many existing graph embedding methods without modifying them. We employ our framework on several popular graph embedding techniques and experimental results on five large-scale datasets demonstrate that MILE significantly boosts the speed (order of magnitude) of graph embedding while generating embeddings of better quality. MILE can comfortably scale to a graph with 9 million nodes and 40 million edges, on which existing methods run out of memory or take too long to compute on a modern workstation.

RELATED PAPERS
* Liang, Jiongqian, Saket Gurukar, and Srinivasan Parthasarathy. "Mile: A multi-level framework for scalable graph embedding." ICWSM '21.
https://arxiv.org/abs/1802.09612
* Gurukar, Saket, et al. "Network representation learning: Consolidation and renewed bearing." arXiv preprint arXiv:1905.00987 (2019).
https://arxiv.org/abs/1905.00987

SHORT BIO
Saket Gurukar is a Ph.D. candidate in CSE Department at The Ohio State University. His research interests include Network Representation Learning and Recommendation Systems. He has published papers in KDD, SIGMOD, ICWSM conferences. Prior to joining the Ph.D. program, he worked at IBM Research Labs, India as a Research Software Engineer and has completed his masters from IIT Madras.

• (7/23, 12:00pm): Nikos Tziavelis (Northeastern University): Beyond Equi-joins: Ranking, Enumeration and Factorization details)

TITLE
Beyond Equi-joins: Ranking, Enumeration and Factorization

ABSTRACT
We study theta-joins in general and join predicates with conjunctions and disjunctions of inequalities in particular, focusing on ranked enumeration where the answers are returned incrementally in an order dictated by a given ranking function. Our approach achieves strong time and space complexity properties: with 𝑛 denoting the number of tuples in the database, we guarantee for acyclic full join queries with inequality conditions that for every value of 𝑘, the 𝑘 top-ranked answers are returned in O(𝑛 polylog𝑛 + 𝑘 log𝑘) time. This is within a polylogarithmic factor of the best known complexity for equi-joins and even of O(𝑛 + 𝑘), the time it takes to look at the input and return 𝑘 answers in any order. Our guarantees extend to join queries with selections and many types of projections, such as the so-called free-connex queries. Remarkably, they hold even when the entire output is of size 𝑛^ℓ for a join of ℓ relations. The key ingredient is a novel O(𝑛 polylog𝑛)-size factorized representation of the query output, which is constructed on-the-fly for a given query and database. In addition to providing the first non-trivial theoretical guarantees beyond equi-joins, we show in an experimental study that our ranked-enumeration approach is also memory-efficient and fast in practice, beating the running time of state-of-the-art database systems by orders of magnitude.

"Beyond Equi-joins: Ranking, Enumeration and Factorization", Tziavelis, Gatterbauer, Riedewald, PVLDB 2021
https://arxiv.org/pdf/2101.12158
https://northeastern-datalab.github.io/anyk/
https://db.khoury.northeastern.edu/activities/

BIO
Nikolaos Tziavelis is a 3rd year PhD student at the Khoury College of Computer Sciences of Northeastern University. His research interests lie in algorithms for database query processing and distributed computing. He holds a MEng in Electrical and Computer Engineering from the National Technical University of Athens.

### Spring 2021 (Wolfgang)

• (1/20, 4:00pm): Anna Fariha (UMass Amherst): Enhancing Usability and Explainability of Data Systems details)

TITLE
Enhancing Usability and Explainability of Data Systems

ABSTRACT
The recent growth of data science expanded its reach to an ever-growing user base of non-experts, increasing the need for democratization and transparency in these systems. Democratization demands that a system can be used by people with different skills and backgrounds alike. Transparency requires explainability that helps the users understand and trust the system function, especially when unexpected behavior occurs. Unfortunately, most existing data systems offer limited usability and support for explanations: these systems are usable only by experts with sound technical skills, and even experts are hindered by the lack of transparency into the inner workings of the systems. In this talk, I will talk about usability-enhancing systems, built in the programming-by-example paradigm, in two different settings: querying relational databases and personalized document summarization. I will then talk about a new data-profiling primitive that can characterize data points for which an ML model is likely to produce untrustworthy predictions. I will then briefly touch upon an explanation framework that can explain causes of non-deterministic software failures. Finally, I will talk about my future research agenda on how I plan to expand my current research to support a diverse group of users—ranging from end users to data scientists.

PAPERS
1. Anna Fariha, Alexandra Meliou: Example-Driven Query Intent Discovery: Abductive Reasoning using Semantic Similarity. PVLDB 2019. https://afariha.github.io/papers/SQuID_VLDB_2019.pdf
2. Anna Fariha, Suman Nath, Alexandra Meliou: Causality-Guided Adaptive Interventional Debugging. SIGMOD 2020. https://afariha.github.io/papers/AID_SIGMOD_2020.pdf
3. Anna Fariha, Ashish Tiwari, Arjun Radhakrishna, Sumit Gulwani, Alexandra Meliou: Conformance Constraint Discovery: Measuring Trust in Data-Driven Systems. SIGMOD 2021 (to appear). https://afariha.github.io/papers/Conformance_Constraints_SIGMOD_2021.pdf

BIO
Anna Fariha is a Ph.D. candidate in the College of Information and Computer Sciences at the University of Massachusetts, Amherst, advised by Professor Alexandra Meliou. While her primary area of research revolves around data management, the application areas of her research have been interdisciplinary—spanning from program synthesis and software engineering to machine learning, natural language processing, and human-computer interaction. She is interested in designing mechanisms for enhancing system usability and developing intelligent tools towards boosting productivity and agility for a diverse group of users—ranging from end users to data scientists. The outcome of her research has been published in SIGMOD and VLDB, and her work has been recognized by several awards, including Microsoft research dissertation grant 2020 and VLDB 2020 best demonstration runner-up award.

• (1/21, 12:00pm): Prashant Pandey (Berkeley): Methods for Indexing and Searching Large-Scale Genomic Data details)

TITLE
Methods for Indexing and Searching Large-Scale Genomic Data

ABSTRACT
The ability to efficiently index and query large-scale genomic data is critical to achieving the full potential of many medical and scientific applications such as personalized medicine and population-level disease analysis. In this talk, I will describe two of my recent works on methods for large-scale indexing and searching on genomic data.
First, Order Min Hash (OMH), a locality-sensitive hash for edit distance. This method is a refinement of the minHash LSH used to approximate the Jaccard similarity, in that OMH, is sensitive not only to the k-mer (length-k subsequence) contents of the sequences but also to the relative order of the k-mers in the sequences. We show that OMH is a better proxy for edit distance compared to the minHash LSH.
Second, VariantStore, an index for large-scale genomic variant search. The genomic variant search involves querying variants across thousands of coordinate systems which makes this problem computationally intensive and hard to scale to a large number of samples. VariantStore is a system for efficiently indexing and querying genomic variants and their sequences in either the reference or sample-specific coordinate systems. It represents genomic variants as a variation graph and builds a light-weight position index on the graph to perform variant queries. We show the scalability of VariantStore by indexing variants from the TCGA project containing more than 8K samples while other systems could not scale beyond 2500 samples.

RELATED PAPERS
https://www.biorxiv.org/content/10.1101/2019.12.24.888297v2

BIO
Prashant Pandey is a Postdoctoral Research Fellow at Lawrence Berkeley Lab and University of California Berkeley working with Prof. Kathy Yelick and Prof. Aydin Buluc. Prior to that, he spent one year as a postdoc at Carnegie Mellon University (CMU) working with Prof. Carl Kingsford. He obtained his Ph.D. in 2018 in Computer Science at Stony Brook University and was co-advised by Prof. Michael Bender and Prof. Rob Johnson.
His research interests lie at the intersection of systems and algorithms. He designs and builds tools backed by theoretically well-founded data structures for large-scale data management problems across computational biology, stream processing, and storage. He is also the main contributor and maintainer of multiple open-source software tools that are used by hundreds of users across academia and industry.

• (1/28, 12:00pm): Paolo Boldi (University of Milan): Centralities in Network Analysis details)

ABSTRACT
Given a social network, which of its nodes are more central? This question was asked many times in sociology, psychology and computer science, and a whole plethora of _centrality measures_ were proposed to account for the importance of the nodes of a network. Also, many modern IR problems call for a mixture of classical text-retrieval methods with graph mining techniques, especially within the area of social-network search, where node centrality can play a crucial role. But what do existing centrality measures actually measure? To what extent do they agree? What is their meaning when they disagree? And further: which of them can be computed or approximated reasonably on the large (huge) graphs under observation? In my talk, I will try to give a comprehensive, historically and mathematically coherent account of the most important centrality measures from the literature, and provide the audience with a tentative taxonomy that can be of help in trying to understand what these indices attempt at measuring. Then, I will propose to assume an axiomatic approach, wherein I suggest some simple, basic properties a centrality measure should have, and again I will propose a new classification of such axioms. Finally, I will provide a sketch of a result we recently obtained about rank monotonicity of damped spectral indices, like PageRank or Katz’s index.

RELATED PAPERS
• Paolo Boldi, Sebastiano Vigna: Axioms for Centrality. Internet Math. 10(3-4): 222-262 (2014). https://arxiv.org/abs/1308.2140
• Paolo Boldi, Alessandro Luongo, Sebastiano Vigna: Rank monotonicity in centrality measures. Netw. Sci. 5(4): 529-550 (2017). https://air.unimi.it/retrieve/handle/2434/527940/947157/rank.pdf
• Steve Chien, Cynthia Dwork, Ravi Kumar, Daniel R. Simon, & D. Sivakumar (2004). Link evolution: Analysis and algorithms. Internet math., 1(3), 277–304 https://www.tandfonline.com/doi/abs/10.1080/15427951.2004.10129090

BIO
Paolo Boldi is full Professor at the Università degli Studi di Milano since 2015, where he is currently the co-ordinator of the PhD Program in Computer Science and of the Computer Science Degree. His main research topics are algorithms and data structures for big data, web crawling and indexing, graph compression, succinct and quasi-succinct data structures, distributed systems, anonymity and alternative models of computation. Recently, his works focused on problems related to complex networks (especially, the World-Wide Web, social networks and biological networks), a field where his research has also produced software tools used by many people working in the same area. He chaired many important conferences in this sector (e.g., WSDM, WWW, ACM WebScience), and published over one hundred papers; he was also recipient of three Yahoo! Faculty Awards and co-recipient of a Google Focused Award, and member of many EU research projects. He was keynote speaker at many conferences such as ECIR, SPIRE, MFCS, IIR and invited scholar at the Institut des Hautes Études Scientifiques.

• (2/4, 12:00pm): Pierre Bourhis (CNRS): How to extract subwords efficiently? An enumeration approach details)

TITLE:
How to extract subwords efficiently? An enumeration approach

ABSTRACT
The problem of extracting subwords through a query has been an important task in different fields: Information Extraction, Complex Event Processing, Bio-computer Sciences, NLP. It has also been studied formally in Theory of Computer Science through the notion of transducer. With a new formalism introduced in the Information Extraction setting: the spanners and its core formalized by a type of automata, VA automaton, there has been a new interest on extracting subwords, in particular on efficient evaluation of such formalism. In this context, the main question that has been studied is the problem of enumerating the solutions efficiently, i.e computing in a compact manner the solutions through a preprocessing and then finding the next solution through this structure.
In this talk, we present some recent results on enumeration for different formalisms where automata are non-deterministic [1] or when the answers are ranked via a cost function [2].

BIBLIOGRAPHY
[1] Antoine Amarilli, Pierre Bourhis, Stefan Mengel, Matthias Niewerth: Constant-Delay Enumeration for Nondeterministic Document Spanners. SIGMOD Rec. 49(1). https://arxiv.org/pdf/2003.02576
[2] Pierre Bourhis, Alejandro Grez, Louis Jachiet and Cristian Riveros. Ranked enumeration of MSO logic on words. ICDT 2021. https://arxiv.org/pdf/2010.08042

BIO
Pierre Bourhis is a tenured CNRS researcher at UMR 9189 CRIStAL at Lille and teaches Data Sciences at Ecole Polytechnique. He works at the team SPIRALS, a joined team of INRIA and the University of Lille. His main interests are data and knowledge management, reasoning on logical rules, query optimization, explanation in knowledge processing and security of data. He obtained his PhD in Computer Sciences in 2011 under the supervision of Serge Abiteboul. Before coming to CRIStAL in 2013, he did a post-doctorat at University of Oxford with Prof. Michael Benedikt.

• (2/18, 12:00pm): Aristotelis Leventidis (Northeastern University): DomainNet: Homograph Detection for Data Lake Disambiguation details)

TITLE
DomainNet: Homograph Detection for Data LakeDisambiguation

ABSTRACT
Modern data lakes are deeply heterogeneous in the vocabulary that is used to describe data. We study a problem of disambiguation in data lakes: how can we determine if a data value occurring more than once in the lake has different meanings and is therefore a homograph? While word and entity disambiguation have been well studied in computational linguistics, data management and data science, we show that data lakes provide a new opportunity for disambiguation of data values since they represent a massive network of interconnected values. We investigate to what extent this network can be used to disambiguate values. DomainNet uses network-centrality measures on a bipartite graph whose nodes represent values and attributes to determine, without supervision, if a value is a homograph. A thorough experimental evaluation demonstrates that state-of-the-art techniques in domain discovery cannot be re-purposed to compete with our method. Specifically, using a domain discovery method to identify homographs has a precision and a recall of 38% versus 69% with our method on a synthetic benchmark. By applying a network-centrality measure to our graph representation, DomainNet achieves a good separation between homographs and data values with a unique meaning. On a real data lake our top-200 precision is 89%.

RELATED PAPER
• DomainNet: Homograph Detection for Data LakeDisambiguation. Aristotelis Leventidis, Laura Di Rocco, Wolfgang Gatterbauer, Renée J. Miller, Mirek Riedewald. EDBT 2021. https://northeastern-datalab.github.io/table-as-query/download/EDBT21-DomainNet-Homograph-Detection.pdf
https://northeastern-datalab.github.io/table-as-query/

• (2/25, 12:00pm): Nikos Tziavelis (Northeastern): Beyond Equi-joins: Ranking, Enumeration and Factorization details)

TITLE
Beyond Equi-joins: Ranking, Enumeration and Factorization

ABSTRACT
How can we efficiently represent and enumerate the results of theta-join queries? Factorization techniques have found notable success as a compact representation scheme that allows for enumeration algorithms where the results are produced incrementally after a preprocessing phase. However, these factorized representations have mainly been limited so far to equi-joins. This talk will present factorization techniques for general theta-join queries where the join conditions between relations go beyond equality. These are particularly useful for the task of ranked enumeration, where the query results are returned in the order specified by a ranking function. Perhaps surprisingly, for acyclic queries with general DNFs of inequality predicates we give asymptotic guarantees that are within a polylogarithmic factor of the best-known guarantee for equi-joins. We also target cases that occur often in practice such as a single predicate or a band-join and further improve the succinctness of our representation.

PREPRINT
https://arxiv.org/abs/2101.12158
https://northeastern-datalab.github.io/anyk/

BIO
Nikolaos Tziavelis is a 3rd year PhD student at the Khoury College of Computer Sciences of Northeastern University. His research interests lie in algorithms for database query processing and distributed computing. He holds a MEng in Electrical and Computer Engineering from the National Technical University of Athens.

• (3/4, 12:00pm): Ariful Azad (Indiana University Bloomington): Computational Building Blocks for Machine Learning on Graphs details)

TITLE
Computational Building Blocks for Machine Learning on Graphs

ABSTRACT
A graph is a beautiful mathematical concept that can concisely model any interacting system such as protein interactions in organisms, chemical bonds in compounds, and friendships on social networks. Thus, machine learning on graphs for predicting edges, node features, and community structures plays an important role in computational biology, social science, neurology, and computational chemistry.
In this talk, I will discuss computational building blocks needed to design graph machine learning algorithms including graph embedding, graph neural networks (GNNs) and graph visualization. Computationally, major steps in these algorithms can be mapped to sampled dense-dense matrix multiplication (SDDMM) and sparse-dense matrix multiplication (SpMM). We can even fuse these two operations into a single kernel called FusedMM that captures almost all computational patterns needed by popular graph embedding and GNN approaches. I will then discuss parallel algorithms for these kernels. As the performances of these linear-algebraic operations are bound by the memory bandwidth, we develop algorithms that minimize data movements from the main memory and utilize cache and registers as much as possible. We also auto tune these operations so that the same code can perform equally well on Intel, AMD, IBM Power9 and ARM Processors. The kernel-based design and tuned parallel implementations speed up end-to-end graph embedding and GNN algorithms by up to 28x over existing approaches based on the message passing paradigm. The take home message of this talk is that developing graph machine learning algorithms using efficient building blocks provides a cleaner interface to application developers and boosts the performance of end-to-end graph learning applications.

RELATED WORK
* FusedMM: A Unified SDDMM-SpMM Kernel for Graph Embedding and Graph Neural Networks
https://arxiv.org/abs/2011.06391
* Force2Vec: Parallel force-directed graph embedding
https://arxiv.org/abs/2009.10035

BIO
Dr. Ariful Azad is an Assistant Professor of Intelligent Systems Engineering at Luddy School of Informatics, Computing, and Engineering in Indiana University (IU). He was a Research Scientist in the Computational Research Division at Lawrence Berkeley National Laboratory. Dr. Azad obtained Ph.D. from Purdue University and B.S. from Bangladesh University of Engineering and Technology. His research interests are in graph machine learning, sparse matrix algorithms, high-performance computing, and bioinformatics. His interdisciplinary research group strives to solve large-scale problems in genomics, earth science, and scientific computing.

• (3/11, 12:00pm): Bertram Ludaescher (University of Illinois Urbana-Champaign): From Provenance Polynomials to Provenance Patterns details)

TITLE
From Provenance Polynomials to Provenance Patterns (work-in-progress with Sahil Gupta)

ABSTRACT
It is well known that the commutative semiring of integer polynomials N[X] provides a unifying framework (so-called how-provenance) for many earlier provenance approaches in databases. A provenance-annotated answer A′ reveals how output tuples depend on input tuples and thus can be used to answer strictly more questions than the original answer A without provenance. We propose to use provenance patterns, a closely related notion implicit in earlier work, to support answering additional questions, not answerable by A′ alone. We show that the pattern-enhanced answer A* can be easily obtained as a by-product of computing A or A’. Using a detailed running example, we illustrate questions that can and cannot be answered using A, A′, and A*, respectively.

RELATED PAPERS
* Karvounarakis, G., & Green, T. J. (2012). Semiring-annotated data: queries and provenance. ACM SIGMOD Record, 41(3), 5-14. https://dl.acm.org/doi/10.1145/2380776.2380778
* Köhler, S., Ludäscher, B., & Smaragdakis, Y. (2012, September). Declarative Datalog Debugging for Mere Mortals. In International Datalog 2.0 Workshop (pp. 111-122). LNCS 7494, Springer, Berlin, Heidelberg. https://dl.acm.org/doi/abs/10.1007/978-3-642-32925-8_12

BIO
Bertram Ludäscher is a professor at the School of Information Sciences at the University of Illinois, Urbana-Champaign and a faculty affiliate with the National Center for Supercomputing Applications (NCSA) and the CS department at UIUC. Until 2014 he was a professor at the CS department at the University of California, Davis and at the UC Davis Genome Center. His research interests range from practical questions in scientific data and workflow management, to database theory, and knowledge representation & reasoning. Prior to his faculty appointments, he was a research scientist at the San Diego Supercomputer Center (SDSC) and an adjunct faculty at the CSE department at UCSD. He received his M.S. (Dipl.-Inform.) in CS from the University of Karlsruhe (K.I.T.), and his PhD (Dr. rer. nat.) from the University of Freiburg, both in Germany.

• (3/18, 12:00pm): Roee Shraga (Technion): Cross-Domain Schema Matching using Deep Similarity Matrix Adjustment and Evaluation details)

TITLE
Cross-Domain Schema Matching using Deep Similarity Matrix Adjustment and Evaluation

ABSTRACT

RELATED WORK
* Shraga, Gal, Roitman. ADnEV: cross-domain schema matching using deep similarity matrix adjustment and evaluation. PVLDB 2020. http://www.vldb.org/pvldb/vol13/p1401-shraga.pdf
* Gal, Roitman, Shraga. Learning to rerank schema matches. TKDE 2019. https://ieeexplore.ieee.org/abstract/document/8944172
* Gal, Shraga. Humans' role in-the-loop. ACM SIGMOD Blog. 2020. http://wp.sigmod.org/?p=3138

BIO
Roee Shraga is a Postdoctoral fellow at the Technion – Israel Institute of Technology, from which he received a PhD degree in 2020 in the area of Data Science. Roee has published more than a dozen papers in leading journals and conferences on the topics of data integration, human-in-the-loop, machine learning, process mining, and information retrieval. He is also a recipient of several PhD fellowships including the Leonard and Diane Sherman Interdisciplinary Fellowship (2017), the Daniel Excellence Scholarship (2019), and the Miriam and Aaron Gutwirth Memorial Fellowship (2020).

• (4/8, 12:00pm): Christoph Riedl (Northeastern University): Spite is contagious in dynamic networks details)

TITLE
Spite is contagious in dynamic networks

ABSTRACT
Spite, costly behavior that harms others, presents an evolutionary puzzle: given that both the actor and recipient do worse, how could it emerge? We show that dynamically evolving interaction networks provide a novel explanation for the evolution of costly harm. Previous work has shown that anti-correlated interaction (e.g., negative assortment or negative relatedness) among behavioral strategies in populations can lead to the evolution of costly harm. We show that these approaches are blind to important features of interaction brought about by a co-evolution of network and behavior and that these features enable the emergence of spite. We analyze a new model in which agents can inflict harm on others at a cost to themselves, and simultaneously learn how to behave and with whom to interact. We find spite emerges reliably under a wide range of conditions. Our model reveals that when interactions occur in dynamic networks the population can exhibit correlated and anti-correlated behavioral interactions simultaneously, something not possible in standard models. In dynamic networks spite evolves due to transient and partial anti-correlated interaction, even when other behaviors are positively correlated and average degree of correlated interaction in the population is low.

RELATED PAPERS
1) Spite is Contagious - https://www.nature.com/articles/s41467-020-20436-1
2) Avoiding the bullies: Resilience of cooperation among unequals - https://www.dropbox.com/s/3rto9cinnq3m3ia/FoleyEtAl2020-WP-Bullies-2020-09-07.pdf?dl=0
3) Conflict and Convention in Dynamic Networks - https://royalsocietypublishing.org/doi/10.1098/rsif.2017.0835

BIO
Christoph Riedl is associate professor for Information Systems and Network Science at the D’Amore-McKim School of Business at Northeastern University. He hold a joint appointment with the Khoury College of Computer Sciences and is a core faculty member at the Network Science Institute. He is a fellow at the Institute for Quantitative Social Science (IQSS) at Harvard and the Center for Collective Intelligence at MIT. His work has been funded by NSF, ARO, ONR, and DARPA, and has been published in leading journals including Science, Nature Communications, Organization Science, Management Science, Information Systems Research, and Academy of Management Discoveries.

• (4/15, 12:00pm): Arash Termehchy (Oregon State University): The Data Interaction Game details)

TITLE
The Data Interaction Game

ABSTRACT
As most users do not precisely know the structure and/or the content of databases, their queries do not exactly reflect their information needs. The database systems may interact with users and use their feedback on the returned results to learn the information needs behind their queries. Current query interfaces assume that users do not learn and modify the way they express their information needs during their interaction with the system. Using a real-world interaction workload, we show that users learn and modify how to express their information needs and their learning is accurately modeled by a well-known online learning scheme. We model the interaction between a user and data system as a game with identical interest between two rational agents whose goal is to establish a common language for representing information needs in form of queries. We propose an online query answering method that adapts to users' learning and prove that it improves the effectiveness of answering queries stochastically speaking. We provide two efficient implementations of this method over large relational databases and show that our algorithms are more effective than the state-of-the-art query answering method using extensive empirical studies.

RELATED WORK
* How Do Users and Data Systems Establish a Common Query Language? Ben McCamish, Vahid Ghadakchi, Arash Termehchy, Liang Huang and Behrouz Touri. SIGMOD Record on ACM SIGMOD Research Highlights 48 (1), ,2019
* The Data Interaction Game Ben McCamish, Vahid Ghadakchi, Arash Termehchy, Behrouz Touri and Liang Huang. The Proceedings of SIGMOD, June 2018.

BIO
Arash Termehchy is an Associate Professor at the School of Electrical Engineering & Computer Science in Oregon State University. He received his PhD from University Of Illinois at Urbana Champaign. His research interests are human-centric data systems, querying and learning on large & heterogeneous data, and using ML in complex systems. His work is recognized by the ACM SIGMOD Research Highlight Award, Best of Conference Selections of SIGMOD and ICDE, Best Student Paper Award of ICDE, and Yahoo! Key Scientific Challenges Award.

• (4/22, 12:00pm): Saravanan Thirumuruganathan (Qatar Computing Research): Graph Embeddings for Data Integration and Cardinality Estimation details)

TITLE
Graph Embeddings for Data Integration and Cardinality Estimation

ABSTRACT
Embeddings are a learned representation that characterizes an entity as a dense high-dimensional vector. There has been extensive work on learning embeddings for diverse types of entities such as words, documents, images, videos, graphs, sets, proteins, and many more. Despite the seeming diversity, most of the database applications have limited themselves to word embeddings.
In this talk, I will discuss two applications of graph embeddings in the disparate areas of data integration and cardinality estimation. The common idea is to construct an application-specific graph data structure to represent the underlying relational data and learn node embeddings using customized random walks. The learned embeddings have several appealing properties and can be used for various downstream tasks with excellent performance.

RELATED WORK
2. Suraj Shetiya, Saravanan Thirumuruganathan, Nick Koudas, Gautam Das. Astrid: Accurate Selectivity Estimation for String Predicates using Deep Learning. VLDB 2021. Link: http://www.vldb.org/pvldb/vol14/p471-shetiya.pdf

BIO
Dr. Saravanan (Sara) Thirumuruganathan is a Scientist at the Qatar Computing Research Institute (QCRI). Recently, he is working on applying deep learning techniques for problems in data integration, query optimization, and fake news. He received his Ph.D. in 2015 from the University of Texas at Arlington. His work has received several awards including Best demo at ICDE 2019, Best of VLDB 2018 and 2012, 2019 SIGMOD Research highlights and a CACM research highlight.

• (4/29, 12:00pm): Angela Bonifati (Université Lyon 1): Evaluating Top-k Queries with Inconsistency Degrees details)

TITLE
Evaluating Top-k Queries with Inconsistency Degrees

ABSTRACT
We study the problem of augmenting relational tuples with degrees of inconsistency, that reflect the satisfaction of denial constraints (DCs). We define a notion of inconsistent tuples with respect to a set of DCs and define two measures of inconsistency degrees, which consider single and multiple violations of constraints. In order to compute these measures, we leverage two models of provenance, namely why-provenance and provenance polynomials. We then investigate top-k queries that allow to rank the answer tuples by their inconsistency degrees. Since one of our measures is monotonic and the other non-monotonic, we design an integrated top-k algorithm to compute the top-k results of a query w.r.t. both inconsistency measures.
By means of an extensive experimental study, we gauge the effectiveness of inconsistency-aware query answering and the efficiency of our algorithm with respect to a baseline, where query results are fully computed and ranked afterwards

Paper (PVLDB 2020) http://www.vldb.org/pvldb/vol13/p2146-issa.pdf

BIO
Angela Bonifati is a Professor of Computer Science at Lyon 1 University and affiliated with the CNRS Liris research lab since 2015. She received her Ph.D. from Politecnico di Milano in 2002 and right after she was a postdoctoral researcher at INRIA Rocquencourt in Paris for one year. Her current research interests are on the interplay of relational and graph-oriented data paradigms, particularly on data integration and Big Data curation for life sciences, query processing and learning for structured and unstructured data models. She has co-authored several publications in first-rate venues of the data management field along with two books (edited by Springer in 2011 and Morgan&Claypool in 2018) and an invited paper in ACM Sigmod Record in 2018. She is the Program Chair of ACM Sigmod 2022 and was the Program Chair of EDBT 2020. She is Associate Editor in several database conferences including VLDB 2021, ICDE 2021 and ICDE 2018. She is also Associate Editor of the VLDB Journal, ACM TODS, Distributed and Parallel Databases and Frontiers in Big Data. She is the President of the EDBT Executive Committee and a member of the ICDT council. She holds many visiting scholar positions in foreign universities in both Europe and North America, the latest of which at the University of Waterloo (Canada) in 2019. She is currently Adjunct Professor in this university.

### Fall 2020 (Wolfgang)

• (9/10, 12:00pm): Bert Huang (Tufts University): Speeding up Markov Random Field Inference and Learning details)

TITLE
Speeding up Markov Random Field Inference and Learning

ABSTRACT
Markov random fields (MRFs) are powerful statistical models that can encode global distributions with local relationships over variables. Unfortunately, algorithms for learning and inference have historically suffered poor scalability and difficult software engineering. In this talk, I’ll cover two strategies we’ve developed to improve the scalability of MRF algorithms on modern hardware, both that build on aspects that have made deep learning successful. First, I’ll discuss an abstraction that enables multiple approximate inference algorithms to be computed via sparse matrix operations. Like the dense matrix backends that have made deep learning software highly portable, sparse matrix operations are low-level, general computations, they are often efficiently optimized on a variety of computing environments. This abstraction allows seamless implementation on many hardware backends, making MRF algorithms fast on current and future computers. Second, I’ll discuss provably convergent stochastic learning algorithms that exploit repeated structure in MRF models. Finally, I will discuss the big picture surrounding these modeling approaches and their role in the modern era.

RELATED WORK
* Sparse-Matrix Belief Propagation Reid Bixler, Bert Huang. Conference on Uncertainty in Artificial Intelligence, 2018.
http://berthuang.com/papers/bixler-uai18.pdf
* Block Belief Propagation for Parameter Learning in Markov Random Fields You Lu, Zhiyuan Liu, Bert Huang. AAAI Conference on Artificial Intelligence, 2019.
https://arxiv.org/abs/1811.04064

BIO
Bert Huang is an assistant professor in the Department of Computer Science and the Data Intensive Studies Center at Tufts University. He earned his Ph.D. from Columbia University in 2011, was a postdoctoral research associate at the University of Maryland, and previously was an assistant professor at Virginia Tech. His research addresses topics surrounding machine learning, including structured prediction, weakly supervised learning, and algorithmic fairness. His papers have been published at conferences including NeurIPS, ICML, UAI, and AISTATS, and he is an action editor for the Journal of Machine Learning Research.

• (9/16, 12:00pm): Andreas Loukas (EPFL Lausanne): Erdoes Goes Neural: an Unsupervised Learning Framework for Combinatorial Optimization on Graphs details)

TITLE
Erdoes Goes Neural: an Unsupervised Learning Framework for Combinatorial Optimization on Graphs

ABSTRACT
Combinatorial optimization (CO) problems are notoriously challenging for neural networks, especially in the absence of labeled instances. This work proposes an unsupervised learning framework for CO problems on graphs that can provide integral solutions of certiﬁed quality. Inspired by Erd˝os’ probabilistic method, we use a neural network to parametrize a probability distribution over sets. Crucially, we show that when the network is optimized w.r.t. a suitably chosen loss, the learned distribution contains, with controlled probability, a low-cost integral solution that obeys the constraints of the combinatorial problem. The probabilistic proof of existence is then derandomized to decode the desired solutions. We demonstrate the efﬁcacy of this approach to obtain valid solutions to the maximum clique problem and to perform local graph clustering. Our method achieves competitive results on both real datasets and synthetic hard instances.

RELATED WORK
https://arxiv.org/abs/2006.10643

BIO
I am a research scientist (Ambizione fellow) at the LTS2 lab in EPFL, Switzerland. Previously, I spent time at TU Berlin, Ben Gurion, and TU Delft. My research focuses on the foundations and applications of graph methods in machine learning and data science. I aim to find elegant explanations for phenomena associated with learning and to exploit them in order to design specialized learning machines. I am also interested in graph problems in signal processing and theoretical computer science, as well as in the analysis of neural networks.
https://andreasloukas.blog/

• (9/24, 12:00pm): Laurent Lessard (Northeastern): Automating the analysis and design of large-scale optimization algorithms details)

TITLE
Automating the analysis and design of large-scale optimization algorithms

ABSTRACT
Most complicated optimization problems, in particular those involving a large number of variables, are solved in practice using iterative algorithms. The problem of selecting a suitable algorithm is currently more of an art than a science; a great deal of expertise is required to know which algorithms to try and how to properly tune them. Moreover, there are seldom performance guarantees. In this talk, I will show how the problem of algorithm selection can be approached using tools from robust control theory. By solving simple semidefinite programs (that do not scale with problem size), we can derive robust bounds on convergence rates for popular algorithms such as the gradient method, proximal methods, fast/accelerated methods, and operator-splitting methods such as ADMM. The bounds derived in this manner either match or improve upon the best known bounds from the literature. The bounds also lead to a natural energy dissipation interpretation and an associated Lyapunov function. Finally, our framework can be used to search for algorithms that meet desired performance specifications, thus establishing a principled methodology for designing new algorithms. The talk will be broadly accessible; no prior control theory knowledge needed!

RELATED WORK
https://laurentlessard.com/public/siopt16_iqcopt.pdf

BIO
Laurent Lessard is a new Associate Professor of Mechanical and Industrial Engineering at Northeastern University (joined in Fall 2020), with courtesy appointments in ECE and Khoury. Before coming to Northeastern, Laurent was a Charles Ringrose Assistant Professor of Electrical and Computer Engineering at the UW–Madison and Faculty Member at the Wisconsin Institute for Discovery. He received the B.A.Sc. in Engineering Science from the University of Toronto, and received the M.S. and Ph.D. in Aeronautics and Astronautics at Stanford University. After completing his doctoral work, he was an LCCC Postdoc at Lund University in Sweden, and a postdoctoral researcher at the University of California, Berkeley. Dr. Lessard is a recipient of the Hugo Schuck best paper award and the NSF CAREER award. His research interests include: decentralized control, robust control, optimization, and machine learning. For more information, see https://laurentlessard.com/

• (9/30, 12:00pm): Xin Luna Dong (Amazon): Self-driving product understanding for thousands of categories details)

TITLE
Self-driving product understanding for thousands of categories

ABSTRACT
Knowledge graphs have been used to support a wide range of applications and enhance search results for multiple major search engines, such as Google and Bing. At Amazon we are building a Product Graph, an authoritative knowledge graph for all products in the world. The thousands of product verticals we need to model, the vast number of data sources we need to extract knowledge from, the huge volume of new products we need to handle every day, and the various applications in Search, Discovery, Personalization, Voice, that we wish to support, all present big challenges in constructing such a graph.
In this talk we describe our efforts for self-driving knowledge collection for products of thousands of types. The system includes a suite of novel techniques for taxonomy construction, product property identification, knowledge extraction, anomaly detection, and synonym discovery. Our system is a) automatic, requiring little human intervention, b) multi-scalable, scalable in multiple dimensions including many domains, products, and attributes, and c) integrative, exploiting rich customer behavior logs. We describe what we learned in building this product graph and applying it to support customer-facing applications.

PAPERS
- Xin Luna Dong, Xiang He, Andrey Kan, Xian Li, Yan Liang, Jun Ma, Yifan Ethan Xu, Chenwei Zhang, Tong Zhao, Gabriel Blanco Saldana, Saurabh Deshpande, Alexandre Michetti Manduca, Jay Ren, Surender Pal Singh, Fan Xiao, Haw-Shiuan Chang, Giannis Karamanolakis, Yuning Mao, Yaqing Wang, Christos Faloutsos, Andrew McCallum, Jiawei Han. AutoKnow: Self-Driving Knowledge Collection for Products of Thousands of Types. In KDD, 2020.
https://arxiv.org/abs/2006.13473
- Giannis Karamanolakis, Jun Ma, Xin Luna Dong. TXtract: Taxonomy-aware knowledge extraction for thousands of product categories. In ACL, 2020.
https://arxiv.org/abs/2004.13852
- Yuning Mao, Tong Zhao, Andrey Kan, Chenwei Zhang, Xin Luna Dong, Christos Faloutsos, Jiawei Han. Octet: Online catalog taxonomy enrichment with self-supervision. In SigKDD, 2020.
https://arxiv.org/abs/2006.10276

BIO
Xin Luna Dong is a Senior Principal Scientist at Amazon, leading the efforts of constructing Amazon Product Knowledge Graph. She was one of the major contributors to the Google Knowledge Vault project, and has led the Knowledge-based Trust project, which is called the “Google Truth Machine” by Washington’s Post. She has co-authored book “Big Data Integration”, was awarded ACM Distinguished Member, VLDB Early Career Research Contribution Award for “advancing the state of the art of knowledge fusion”, and Best Demo award in Sigmod 2005. She serves in VLDB endowment and PVLDB advisory committee, and is a PC co-chair for VLDB 2021, KDD'2020 ADS Invited Talk Series, ICDE Industry 2019, VLDB Tutorial 2019, and Sigmod 2018.​

• (12/2, 12:00pm): Yinjun Wu (UPenn): Rapid retraining of machine learning models details)

TITLE
Provenance-aided approaches to incrementally updating machine learning models

ABSTRACT
The ubiquitous use of machine learning algorithms brings new challenges to traditional database problems such as incremental view update on machine learning models. The problem of incrementally maintaining machine learning models as views occurs in many applications, e.g. better understanding the importance of training samples to the machine learning models, as well as identifying the effect of outliers in training datasets. However, this problem has not been thoroughly studied yet, especially when the updates of the models are triggered by the deletions of training samples. In this talk, first of all, I will present an efficient provenance-based approach, PrIU, for incrementally updating linear regression and logistic regression models. Plus, I will also introduce one extended solution for handling a more broader model class, i.e. strongly convex machine learning models. We prove the correctness and convergence of the incrementally updated model parameters by using those two approaches and validate this experimentally. Experimental results also show that significant speed-ups can be achieved by using those two approaches compared to simply retraining the model from scratch. More importantly, the incrementally updated models by using our approaches do not hurt the prediction performance on the test dataset.

RELATED WORK
* SIGMOD 2020: PrIU: A Provenance-Based Approach for Incrementally Updating Regression Models
https://dl.acm.org/doi/abs/10.1145/3318464.3380571
* ICML 2020: DeltaGrad: Rapid retraining of machine learning models
https://arxiv.org/abs/2006.14755

BIO
Yinjun Wu is a fifth-year PhD candidate at the department of computer and information science at University of Pennsylvania. His research interests lie at the intersection of data science and machine learning.

• (12/11, 12:00pm): Chris Jermaine (Rice University): The Tensor-Relational Algebra, and Other Ideas in Machine Learning System Design details)

TITLE
The Tensor-Relational Algebra, and Other Ideas in Machine Learning System Design

ABSTRACT
One area in which systems for machine learning are wanting (TensorFlow, PyTorch) is in their support for Big Models and Big Data. In contrast to modern relational systems which scale to large data sizes and multiple machines quite well out of the box, getting machine learning computations to work in a distributed setting or with large models is often very challenging. In this talk, I argue that the fundamental problem is lack of abstraction in these systems. I argue that it makes sense to re-design these systems from the ground up, applying many of the lessons from the heyday of relational database system design in the 1970's and 80's.

RELATED WORK
https://arxiv.org/abs/2009.00524

BIO
Chris Jermaine is a Professor of Computer Science at Rice University, and directs Rice's Data Science Initiative. He is the recipient of an Alfred P. Sloan Foundation Research Fellowship, a National Science Foundation CAREER award, and the George R. Brown School's Teaching & Research Excellence Award. He has received best paper/best paper runner-up awards from top journals/conferences in data mining and data management, including IEEE ICDE, ACM SIGMOD, ACM SIGKDD, and VLDB, as well as the IBM Pat Goldberg Award for the best IBM paper of 2008. He currently serves as the editor-in-chief of ACM Transactions on Database Systems, the ACM’s flagship journal for data management research.

• (12/16, 1:30pm): Sainyam Galhotra (UMass Amherst): Clustering with Applications to Data Integration details)

TITLE
Clustering with Applications to Data Integration

ABSTRACT
A growing number of data-based applications are used for decision-making that have far-reaching consequences and significant societal impact. Entity resolution, community detection and taxonomy construction are some of the building blocks of these applications and for these methods, clustering is the fundamental underlying concept. Therefore, the use of accurate, robust and fair methods for clustering cannot be overstated. We tackle the various facets of clustering with a multi-pronged approach described below.
(i) While identification of clusters that refer to different entities is challenging for automated strategies, it is relatively easy for humans. We study the robustness and scalability of these methods that leverage supervision through an oracle i.e an abstraction of crowdsourcing. We further design techniques to construct taxonomies capturing type-subtype relation over the identified entities.
(ii) In community detection applications, a common setback in evaluation of the quality of clustering techniques is the lack of ground truth data. We propose a generative model to capture interactions between records that belong to different clusters and devise techniques for efficient cluster recovery.
(iii) The manifestation of bias in data could arise due to discriminatory treatment of marginalized groups, sampling methods or even measurement errors in the data. We study the impact of this bias on generated clusters and develop techniques that guarantee fair representation from different groups. We prove the noise tolerance of our algorithms and back the theory by demonstrating the efficacy and efficiency on various real world datasets for these applications.

RELATED PAPERS
* Balancing the Tradeoff Between Clustering Value and Interpretability
https://doi.org/10.1145/3375627.3375843
* Fair Correlation Clustering
https://arxiv.org/abs/2002.03508

BIO
Sainyam Galhotra is a final year PhD student at University of Massachusetts Amherst. Before joining PhD, he was a researcher at Xerox Research and did undergraduate in computer science from IIT Delhi. His research is broadly in the area of data management with a specific focus on designing algorithms to not only be efficient but also transparent and equitable in their decision-making capabilities. He is a recipient of the Best Paper Award in FSE 2017, Most Reproducible Paper Award in SIGMOD 2017 and 2018. He is the first recipient of the Krithi Ramamritham Award at UMass for contribution to database research.

### Summer 2020 (Wolfgang)

• (5/8, 12:00pm): Panagiotis Mandros (Max Planck Institute): Functional dependencies and how to find them details)

TITLE
Functional Dependencies and how to find them

ABSTRACT
Functional dependencies come in many forms and shapes since they answer the fundamental question "what are the relevant features for a given target" that naturally occurs in many applications. These applications can range from database normalization, all the way to answering questions such as "does this bad habit cause this bad disease". In my talk, I will give a brief overview of these different FDs, and demonstrate how one can go from one end of the spectrum, i.e., database keys, to the other, i.e., causes and effects. By doing so, I hope to clarify a few things regarding which FD approach is appropriate for which application. At some point during the talk, I will present our solution to the FD discovery problem for exploratory data analysis. Our solution, DORA, finds the top FDs for a given target, is interpretable, i.e., facilitates the analysis of the results, is robust, i.e., the results can be trusted, and is very efficient in practice. Finally, I will end my talk by showing that DORA tackles the FD problem in a principled manner, contrary to a certain method that claims to have solved the FD problem once and for all.

RELATED PAPERS
* ICDM 2018: Discovering Reliable Dependencies from Data: Hardness and Improved Algorithms (best paper award)
Mandros, P, Boley, M, & Vreeken, J
http://eda.mmci.uni-saarland.de/pubs/2018/fedora-mandros,boley,vreeken.pdf
* KDD 2017: Discovering Reliable Approximate Functional Dependencies
Mandros, P, Boley, M, & Vreeken, J
http://eda.mmci.uni-saarland.de/pubs/2017/dora-mandros,boley,vreeken.pdf

BIO
Panagiotis Mandros is a PhD candidate at the Max Planck Institute for Informatics and the University of Saarland, Germany. He works on the intersection of Information Theory and Optimization, developing theory and algorithms for discovering associations in high-dimensional data. He is particularly interested in Exploratory Data Analysis, i.e., the principle of exploring data without assumptions on the underlying data generating process, and has contributed methods for finding "patterns" in both unsupervised and supervised scenarios (e.g., functional dependencies, Markov blankets). His work is published at KDD, ICDM, and is the recipient of the 2018 IEEE ICDM best paper award.

• (5/29, 12:30pm): Hung Ngo (Relational AI): Answering (Functional Aggregate) Queries via Tensor Decomposition details)

TITLE
Answering (Functional Aggregate) Queries via Tensor Decomposition

ABSTRACT
In this talk I will present a framework to think about and to design efficient algorithms for answering a generic class of queries called Functional Aggregate Queries (or FAQs). FAQs can be used to capture queries from a while range of application areas: databases, machine learning, CSPs, for instance. The tensor decomposition framework can be used to explain recent advances in CSP and database query processing such as the notions of fractional hypertree widths, submodular widths, worst-case optimal join algorithms, or the junction tree algorithm in graphical models.

The talk is based on several joint works with Mahmoud Abo Khamis, Ryan Curtin, Ben Moseley, Dan Olteanu, Atri Rudra, Maximilian Schleich, and Dan Suciu.

RELATED SLIDES
* FANDA: https://cse.buffalo.edu/~hungngo/papers/long-panda-slides.pdf
* FAQ: https://cse.buffalo.edu/~hungngo/papers/faq-insideout.pdf

BIO
Hung Ngo is currently VP of Algorithms at relationalAI. Before that, he was a professor at SUNY Buffalo, and a Computer Scientist with LogicBlox. His current research interests are in designing and implementing algorithms for solving computational problems defined declaratively.

• (6/5, 12:30pm): Nikos Tizavelis (Khoury): Optimal join algorithms meet top-k details)

TITLE
Optimal Join Algorithms Meet Top-k

ABSTRACT
Top-k queries have been studied intensively in the database community and they are an important means to reduce query cost when only the “best” or “most interesting” results are needed instead of the full output. While some optimality results exist, e.g., the famous Threshold Algorithm, they hold only in a fairly limited model of computation that does not account for the cost incurred by large intermediate results and hence is not aligned with typical database-optimizer cost models. On the other hand, the idea of avoiding large intermediate results is arguably the main goal of recent work on optimal join algorithms, which uses the standard RAM model of computation to determine algorithm complexity. This research has created a lot of excitement due to its promise of reducing the time complexity of join queries with cycles, but it has mostly focused on full-output computation. We argue that the two areas can and should be studied from a unified point of view in order to achieve optimality in the common model of computation for a very general class of top-k-style join queries. This tutorial has two main objectives. First, we will explore and contrast the main assumptions, concepts, and algorithmic achievements of the two research areas. Second, we will cover recent, as well as some older, approaches that emerged at the intersection to support efficient ranked enumeration of join-query results. These are related to classic work on k-shortest path algorithms and more general optimization problems, some of which dates back to the 1950s. We demonstrate that this line of research warrants renewed attention in the challenging context of ranked enumeration for general join queries.

RELATED WORK
Based on an upcoming SIGMOD 2020 tutorial and VLDB 2020 paper:
https://arxiv.org/pdf/2005.00448
https://arxiv.org/pdf/1911.05582
https://northeastern-datalab.github.io/anyk/
https://northeastern-datalab.github.io/topk-join-tutorial/

• (6/10, 12:00pm): Deepak Ajwani (UC Dublin): Average-case behavior of k-shortest path algorithms details)

TITLE
Average-case behavior of k-shortest path algorithms

ABSTRACT
The k-shortest path problem is a generalization of the fundamental shortest path problem, where the goal is to compute k simple paths from a given source to a target node, in non-decreasing order of their weight. With numerous applications modeling various optimization problems and as a feature in some learning systems, there is a need for efﬁcient algorithms for this problem. Unfortunately, despite many decades of research, the best directed graph algorithm still has a worst-case asymptotic complexity of ˜O(kn(n+m)). In contrast to the worst-case complexity, many algorithms have been shown to perform well on small diameter directed graphs in practice. In this paper, we prove that the average-case complexity of the popular Yen’s algorithm on directed random graphs with edge probability p = Ω(logn)/n in the unweighted and uniformly distributed weight setting is O(kmlogn), thus explaining the gap between the worst-case complexity and observed empirical performance. While we also provide a weaker bound of O(kmlog 4 n) for sparser graphs with p ≥ 4/n, we show empirical evidence that the stronger bound should also hold in the sparser setting. We then prove that Feng’s directed k-shortest path algorithm computes the second shortest path in expected O(m) time on random graphs with edge probability p = Ω(logn)/n. Empirical evidence suggests that the average-case result for the Feng’s algorithm holds even for k > 2 and sparser graphs.

RELATED PAPER
Complex networks 2018: https://researchrepository.ucd.ie/bitstream/10197/9887/2/ajwani_complex_networks18.pdf

BIO
Dr. Deepak Ajwani is an Assistant Professor at the School of Computer Science, University College Dublin. His research is at the confluence of diverse areas such as algorithms and data structures (with a focus on scalable graph algorithms), algorithm engineering, combinatorial optimization and machine learning.
He received his Ph.D. from Max Planck Institute for Informatics in 2008, for his work on I/O-efficient graph traversal algorithms. From 2008-2010, he worked as a Postdoctoral Researcher at MADALGO - Centre for Massive Data Algorithms - where he developed shared memory multicore algorithms for discrete optimization problems. From 2010-2012, he worked on an IRC funded project at University College Cork, in collaboration with IBM Research, for a project on designing graph partitioning and repartitioning techniques in the context of Exascale stream computing systems. From 2012-2018, he worked as a research scientist at Nokia Bell Labs, on the design and development of a cognitive computing tool for interpreting, organizing and navigating unstructured content.

• (6/26, 12:00pm): Pablo Barcelo (PUC Chile): The Logical Expressiveness of Graph Neural Networks details)

TITLE
The Logical Expressiveness of Graph Neural Networks

ABSTRACT
The ability of graph neural networks (GNNs) for distinguishing nodes in graphs has been recently characterized in terms of the Weisfeiler-Lehman (WL) test for checking graph isomorphism. This characterization, however, does not settle the issue of which Boolean node classifiers (i.e., functions classifying nodes in graphs as true or false) can be expressed by GNNs. We tackle this problem by focusing on Boolean classifiers expressible as formulas in the logic FOC2, a well-studied fragment of first order logic. FOC2 is tightly related to the WL test, and hence to GNNs. We start by studying a popular class of GNNs, which we call AC-GNNs, in which the features of each node in the graph are updated, in successive layers, only in terms of the features of its neighbors. We show that this class of GNNs is too weak to capture all FOC2 classifiers, and provide a syntactic characterization of the largest subclass of FOC2 classifiers that can be captured by AC-GNNs. This subclass coincides with a logic heavily used by the knowledge representation community. We then look at what needs to be added to AC-GNNs for capturing all FOC2 classifiers. We show that it suffices to add readout functions, which allow to update the features of a node not only in terms of its neighbors, but also in terms of a global attribute vector. We call GNNs of this kind ACR-GNNs.

RELEATED WORK
ICLR 2020 paper: https://openreview.net/forum?id=r1lZ7AEKvB
Previous work on the relationship between WL and GNNs:
a) Weisfeiler and Leman Go Neural: Higher-Order Graph Neural Networks https://aaai.org/ojs/index.php/AAAI/article/view/4384
b) How Powerful are Graph Neural Networks? https://openreview.net/forum?id=ryGs6iA5Km

BIO
Pablo Barcelo is Full Professor at Pontificia Universidad Católica de Chile, where he also acts as Director of the Institute for Mathematical and Computational Engineering, and Deputy Director of Millennium Institute for Foundational Research on Data. He is the author of more than 70 technical papers, has chaired ICDT 2019, will be chairing ACM PODS 2022, and is currently a member of the editorial committee of Logical Methods in Computer Science. From 2011 to 2014 he was the editor of the database theory column of SIGMOD Record. His areas of interest are database theory, logic in computer science, and the emerging relationship between these areas and machine learning.

• (7/1, 12:00pm): Jan Tönshoff (RWTH Aachen): Unsupervised Machine Learning for Constraint Satisfaction Problems details)

TITLE
Unsupervised Machine Learning for Constraint Satisfaction Problems

ABSTRACT
Constraint Satisfaction Problems (CSPs) are a powerful theoretical framework for modeling a wide range of combinatorial problems. I will present a novel neural network architecture for solving CSPs, which was developed by our research group. The architecture is generic and can be applied to any binary CSP. Training is unsupervised and it suffices to train on small random instances. The trained networks generalize well to much larger and structurally complex instances. Experiments on Max-2-Sat, Max-Cut and Max-IS demonstrate that our approach matches or surpasses conventional algorithms on several NP-hard problems, while being simple, generic and efficient.

RELATED PAPER
Graph Neural Networks for Maximum Constraint Satisfaction: https://arxiv.org/abs/1909.08387

BIO
I am a PhD student at RWTH Aachen University, where I work for Prof. Martin Grohe at the Chair of Logic and Theory of Discrete Systems. My research mainly focuses on applying machine learning to relational data structures, such as graphs, CSPs and databases.

• (7/3, 12:00pm): Leopoldo Bertossi (Adolfo Ibanez University, Santiago, Chile): Score-Based Explanations in Data and Machine Learning details)

TITLE
Score-Based Explanations in Data and Machine Learning

ABSTRACT
Explanations have become important in data management and machine learning. In this presentation we review some recent work on the use of counterfactual interventions, as they appear in actual causality, to provide scores that measure the contribution of database tuples to query answering, and that of feature values to results of classification models.

RELATED PAPERS
* Livshits, Bertossi, Kimelfeld, Sebag. The Shapley Value of Tuples in Query Answering. ICDT 2020.
http://arxiv.org/abs/1904.08679
* Bertossi, Li, Schleich, Suciu, Vagena. Causality-based Explanation of Classification Outcomes. DEEM@SIGMOD 2020.
http://arxiv.org/abs/2003.06868
* Bertossi. An ASP-Based Approach to Counterfactual Explanations for Classification. RuleML-RR 2020.
https://arxiv.org/abs/2004.13237

BIO
Leopoldo Bertossi is a Full Professor at the Faculty of Engineering and Sciences, "Universidad Adolfo Ibáñez" (Santiago, Chile), where he is the Director of the Graduate Program in Data Science. He is a Bachelor, Master and PhD in Mathematics by the "Pontificia Universidad Catolica de Chile" (PUC). His research has been concentrated in the areas of computational logic, knowledge representation, and data management. His research interests are related to Data Science and Artificial Intelligence in general. His is an Emeritus Professor of the School of Computer Science at Carleton University (Ottawa, Canada), Senior Computer Scientist at RelationalAI Inc., and Senior Researcher at the "Millenium Institute on Foundations of Data" (IMFD, Chile).

• (7/8, 12:00pm): Ji-Yong Shin (Yale / Northeastern): Isolation in cloud storage systems details)

TITLE
Isolation in cloud storage systems

ABSTRACT
Guaranteeing isolation in cloud storage systems is challenging due to heterogeneous user workloads and concurrent I/O requests. This talk focuses on improving two types of isolation support -- transactional isolation and fine-grained client-centric consistency control -- in cloud storage systems.
In the first part of the talk, the design and implementation of a block-level ACID transactional storage system called Isotope will be discussed. While transactional isolation has traditionally been implemented toward high levels of the storage stack, Isotope implements ACID transactions at the lowest common layer of the storage stack to fulfill the general need for concurrency control. Isotope demonstrates that the block-level transaction support trivializes the construction of transactional systems and even cross-application transactions.
The second part of the talk presents a new class of storage systems called StaleStore, which trades off consistency and performance within a storage server. StaleStore is designed based on the observation that modern servers are as powerful and parallel as distributed systems in the past and employs client-centric consistency (also known as distributed weak consistency) within a server. StaleStore accelerates server and distributed applications by returning an older version of data within a server if accessing the latest version is expected to be slower.

RELEATED PAPERS
- Isotope conference version: https://www.usenix.org/system/files/conference/fast16/fast16-papers-shin.pdf
- Isotope journal version: https://dl.acm.org/doi/pdf/10.1145/3032967
- StaleStore conference version: https://dl.acm.org/doi/pdf/10.1145/2987550.2987579

BIO
Ji-Yong Shin is an associate research scientist in the Department of Computer Science at Yale University and will be joining Khoury College of Computer Science in the coming fall. His research interests are in designing novel distributed systems and exploring practical formal verification methods that can be applied to system designs. He received his Ph.D. from Cornell University, where he designed cloud storage systems with enhanced isolation support and a completely wireless datacenter.

• (7/10, 12:00pm): Hongyang Zhang (Stanford / Wharton/ Northeastern): On the Generalization Effects of Linear Transformations in Data Augmentation details)

TITLE
On the Generalization Effects of Linear Transformations in Data Augmentation

ABSTRACT
Data augmentation is a powerful technique to improve performance in applications such as image and text classification tasks. Yet, there is little rigorous understanding of why and how various augmentations work. In this work, we consider a family of linear transformations and study their effects on the ridge estimator in an over-parametrized linear regression setting. First, we show that transformations which preserve the labels of the data can improve estimation by enlarging the span of the training data. Second, we show that transformations which mix data can improve estimation by playing a regularization effect. Finally, we validate our theoretical insights on MNIST. Based on the insights, we propose an augmentation scheme that searches over the space of transformations by how uncertain the model is about the transformed data. We validate our proposed scheme on image and text datasets. For example, our method outperforms RandAugment by 1.24% on CIFAR-100 using Wide-ResNet-28-10. Furthermore, we achieve comparable accuracy to the SoTA Adversarial AutoAugment on CIFAR datasets.

RELATED WORK
Based on joint work with Sen Wu, Greg Valiant, and Chris Re that appeared in ICML’20: https://arxiv.org/abs/2005.00695.

BIO
Hongyang Zhang is a postdoctoral researcher at the Statistics Department at Wharton School, University of Pennsylvania. He received a Ph.D. from Stanford University and a B.E. from Shanghai Jiao Tong University, both in computer science. His research interests lie in the intersection of machine learning, algorithms and statistics, including topics such as non-convex optimization, multi-task and transfer learning, weakly supervised learning, neural networks and its theory. He is a recipient of the best paper award at COLT 2018.

• (7/17, 12:00pm): Zach Ives (UPenn): Juneau: Managing and Guiding Data Science details)

TITLE
Juneau: Managing and Guiding Data Science

ABSTRACT
Many modern data science applications build on data lakes, schema-agnostic repositories of data files and data products that offer limited organization and management capabilities. There is a need to build data lake search capabilities into data science environments, so scientists and analysts can find tables, schemas, workflows, and datasets useful to their task at hand. We develop search and management solutions for the Jupyter Notebook data science platform, to enable scientists to augment training data, find potential features to extract, clean data, and find joinable or linkable tables. Our core methods also generalize to other settings where computational tasks involve execution of programs or scripts.

RELATED PAPERS
CIDR 2019: http://cidrdb.org/cidr2019/papers/p55-ives-cidr19.pdf
SIGMOD 2020: https://dl.acm.org/doi/pdf/10.1145/3318464.3389726

BIO
Zachary Ives is the Department Chair and Adani President's Distinguished Professor of Computer and Information Science at the University of Pennsylvania. He is a co-founder of Blackfynn, Inc., a company focused on enabling life sciences research and discovery through data integration. Zack's research interests include data integration and sharing, managing "big data," sensor networks, and data provenance and authoritativeness. He is a recipient of the NSF CAREER award, and an alumnus of the DARPA Computer Science Study Panel and Information Science and Technology advisory panel. He has also been awarded the Christian R. and Mary F. Lindback Foundation Award for Distinguished Teaching. He is a co-author of the textbook Principles of Data Integration, and has received an ICDE 2013 ten-year Most Influential Paper award as well as the 2017 SWSA Ten-Year Award at the International Semantic Web Conference. He has been an Associate Editor for Proceedings of the VLDB Endowment (2014) and a Program Co-Chair (2015) and Group Leader (2018) for the ACM SIGMOD conference.

• (7/22, 9:00am): Nofar Carmeli (Technion): Answering (Unions of) Conjunctive Queries using Random Access and Random-Order Enumeration details)

TITLE
Answering (Unions of) Conjunctive Queries using Random Access and Random-Order Enumeration

ABSTRACT
As data analytics becomes more crucial to digital systems, so grows the importance of characterizing the database queries that admit a more efficient evaluation. We consider the tractability yardstick of answer enumeration with a logarithmic delay after a lineartime preprocessing phase. Such an evaluation is obtained by constructing, in the preprocessing phase, a data structure that supports logarithmic-delay enumeration. In this talk, we seek a structure that supports the more demanding task of a “random permutation”: logarithmic-delay enumeration in truly random order. Enumeration of this kind is required if downstream applications assume that the intermediate results are representative of the whole result set in a statistically valuable manner. An even more demanding task is that of a “random access”: logarithmic-time retrieval of an answer whose position is given. We establish that the free-connex acyclic CQs are tractable in all three senses: enumeration, random-order enumeration, and random access; and in the absence of self-joins, it follows from past results that every other CQ is intractable by each of the three (under some fine-grained complexity assumptions). However, the three yardsticks are separated in the case of a union of CQs (UCQ): while a union of free-connex acyclic CQs has a tractable enumeration, it may (provably) admit no random access. For such UCQs we devise a random-order enumeration whose delay is logarithmic in expectation.

RELATED WORK
PODS 2020: https://dl.acm.org/doi/pdf/10.1145/3375395.3387662

BIO
Nofar Carmeli is a Ph.D. student in the Data and Knowledge group at Technion, Israel Institute of Technology, advised by Prof. Benny Kimelfeld. Her research focuses on query optimization with guarantees using enumeration techniques. Nofar completed her BSc in 2015 in the Lapidim excellence program of the Computer Science Department of Technion.

• (7/24, 12:00pm): Jorge Perez (Universidad de Chile): On the Turing Completeness of Modern Neural Network Architectures details)

TITLE
On the Turing Completeness of Modern Neural Network Architectures

ABSTRACT
Alternatives to recurrent neural networks, in particular, architectures based on attention or convolutions, have been gaining momentum for processing input sequences. In spite of their relevance, the computational properties of these alternatives have not yet been fully explored. We study the computational power of two of the most paradigmatic architectures exemplifying these mechanisms: the Transformer (Vaswani et al., 2017) and the Neural GPU (Kaiser & Sutskever, 2016). We show both models to be Turing complete exclusively based on their capacity to compute and access internal dense representations of the data. In particular, neither the Transformer nor the Neural GPU requires access to an external memory to become Turing complete. Our study also reveals some minimal sets of elements needed to obtain these completeness results.

RELATED WORK
ICLR 2019: https://openreview.net/forum?id=HyGBdo0qFm

BIO
Jorge Pérez is Associate Professor in the Department of Computer Science at Universidad de Chile, and Associate Researcher at the Millennium Institute for Foundational Research on Data. His research interests include data exchange and integration, Web data, and the theory of modern neural network architectures. He has received several awards for his research, including the best paper award in 5 international conferences, the Microsoft Research PhD Fellowship, and the SWSA Ten-Years Award for his work on query languages for Web data. His interests also include the analysis of social, medical and political text data, particularly in Spanish.

• (7/29, 12:00pm): Nima Dehmamy (Kellog @ Northwestern University): Understanding the representation power of graph neural networks in learning graph topology details)

TITLE
Understanding the representation power of graph neural networks in learning graph topology

ABSTRACT
To deepen our understanding of graph neural networks, we investigate the representation power of Graph Convolutional Networks (GCN) through the looking glass of graph moments, a key property of graph topology encoding path of various lengths. We find that GCNs are rather restrictive in learning graph moments. Without careful design, GCNs can fail miserably even with multiple layers and nonlinear activation functions. We analyze theoretically the expressiveness of GCNs, concluding a modular GCN design, using different propagation rules with residual connections could significantly improve the performance of GCN. We demonstrate that such modular designs are capable of distinguishing graphs from different graph generation models for surprisingly small graphs, a notoriously difficult problem in network science. Our investigation suggests that, depth is much more influential than width, with deeper GCNs being more capable of learning higher order graph moments. Additionally, combining GCN modules with different propagation rules is critical to the representation power of GCNs.

RELATED WORK
NeurIPS paper 2019: https://arxiv.org/pdf/1907.05008

BIO
Nima Dehmamy is a research assistant professor at Northwestern Institute for Complex Systems in Kellogg School of Management at Northwestern University, Evanston IL, USA. He is a physicist working on complex systems and theory of machine learning. His research involves AI in graph learning, using physics to understand optimization landscapes and neuroscience. He earned his PhD in physics with Eugene Stanley at Boston University in 2016 and he was a postdoctoral fellow at the Barabasi Lab at Northeastern University, Boston, MA.

• (7/31, 12:00pm): Donghui Zhang (Google): Spanner's SQL evolution details)

TITLE
Spanner's SQL Evolution

ABSTRACT
This talk introduces Google's distributed RDBMS called Spanner. It will cover Spanner's SQL interface, distributed query processing, and some lessons learned, as well as blockwise columnar storage.

Related publications
https://www.usenix.org/system/files/conference/osdi12/osdi12-final-16.pdf

BIO
Donghui Zhang has been working in the database field since he got a Ph.D. from University of California, Riverside in 2002. His first job was an Assistant Professor at Northeastern University. In 2009 he left academia and joined the Microsoft Jim Gray Systems Lab where he researched on the best usage of Flash SSDs inside Microsoft SQL Server, and on lock-free hash table. Between 2012 and 2017 he led the engineering in Paradigm4, a Stonebraker startup building SciDB, a distributed computational database system. After that, he worked in Facebook's data warehouse for two years before joining Google's Spanner team.

• (8/5, 12:00pm): Juan Sequeda (data.world): The Socio-Technical Phenomena of Data Integration details)

TITLE
The Socio-Technical Phenomena of Data Integration

ABSTRACT
Data Integration has been an active area of computer science research for over two decades. A modern manifestation is as Knowledge Graphs which integrates not just data but also knowledge at scale. Tasks such as Domain modeling and Schema/Ontology Matching are fundamental in the data integration process. Research focus has been on studying the data integration phenomena from a technical point of view (algorithms and systems) with the ultimate goal of automating this task.
In the process of applying scientific results to real world enterprise data integration scenarios to design and build Knowledge Graphs, we have experienced numerous obstacles. In this talk, I will share insights about these obstacles. I will argue that we need to think outside of a technical box and further study the phenomena of data integration with a human-centric lens: from a socio-technical point of view.

RELATED WORK
A Pay-as-you-go Methodology to Design and Build Enterprise Knowledge Graphs from Relational Databases, ISWC 2019: https://github.com/juansequeda/papers/raw/master/iswc2019.pdf

BIO
Juan F. Sequeda is the Principal Scientist at data.world. He joined through the acquisition of Capsenta, a company he founded as a spin-off from his research. He holds a PhD in Computer Science from The University of Texas at Austin. Juan is the recipient of the NSF Graduate Research Fellowship, received 2nd Place in the 2013 Semantic Web Challenge for his work on ConstituteProject.org, Best Student Research Paper at the 2014 International Semantic Web Conference and the 2015 Best Transfer and Innovation Project awarded by the Institute for Applied Informatics. Juan is on the Editorial Board of the Journal of Web Semantics, member of multiple program committees (ISWC, ESWC, WWW, AAAI, IJCAI). He was the General Chair of AMW2018, PC chair of ISWC 2017 In-Use track, co-creator of COLD workshop (7 years co-located at ISWC). He has served as a bridge between academia and industry as the current chair of the Property Graph Schema Working Group, member of the Graph Query Languages task force of the Linked Data Benchmark Council (LDBC) and past invited expert member and standards editor at the World Wide Web Consortium (W3C). Wearing his scientific hat, Juan's goal is to reliably create knowledge from inscrutable data. His research interests are on the intersection of Logic and Data for (ontology-based) data integration and semantic/graph data management, and what now is called Knowledge Graphs. Wearing his business hat, Juan is a product manager, does business development and strategy, technical sales and works with customers to understand their problems to translated back to R&D.

• (8/7, 12:00pm): Vikas Garg (MIT): Representation Limits and Generalization of Graph Neural Networks details)

TITLE
Representation Limits and Generalization of Graph Neural Networks

ABSTRACT
Graphs provide a natural abstraction to model relational and strategic data in domains as diverse as biology (e.g., molecules), multiagent settings (e.g., online vendors on ecommerce platforms), and distributed systems (e.g., Internet). Graphs also find much use as theoretical objects (e.g., probabilistic graphical models), and several important algorithms (e.g., max-flow for image segmentation) can be invoked when tasks are formulated in terms of graphs. Graph neural networks are naturally suited for making predictions based on graphs but they remain poorly understood in terms of what they can and cannot do. We analyze whether GNNs can distinguish graphs that differ in properties such as cycles, but have similar local structure. We also provide the first data dependent generalization bounds for message passing GNNs.

The talk will be based on recent work with Stefanie Jegelka and Tommi Jaakkola.

RELATED WORK
ICML 2020: https://people.csail.mit.edu/tommi/papers/GJJ_ICML2020.pdf

BIO
Vikas Garg is a PhD student in Computer Science at MIT. His research interests in machine learning include generative models, graphical models, theory of deep learning, and learning under uncertainty or resource constraints along with their intersections with optimization and game theory.

• (8/14, 12:00pm): Wang-Chiew Tan (Megagon Labs): Unleashing the Power of Subjective Data: Managing Experiences as First-Class Citizens details)

TITLE
Unleashing the Power of Subjective Data: Managing Experiences as First-Class Citizens

ABSTRACT
Subjective data refers to data that contains opinions and experiences. Such data is ubiquitous in product reviews, tweets, and discussion forums in social media. Consumers today spend considerable time sifting through subjective data to make informed decisions about purchases. At Megagon Labs, we are building technologies to synthesize knowledge from subjective data and to facilitate searching over them.
In this talk, I will overview some of the technologies we have developed in the directions of subjective data preparation and integration, experience discovery and search, subjective knowledge representation, and data analytics and exploration.

RELEATED PAPERS
* OpinionDigest: A Simple Framework for Opinion Summarization. https://arxiv.org/pdf/2005.01901.pdf. Yoshihiko Suhara, Xiaolan Wang, Stefanos Angelidis, Wang-Chiew Tan – ACL 2020
* ExtremeReader: An Interactive Explorer For Customizable And Explainable Review Summarization. https://megagon.ai/wp-content/uploads/2020/05/ExtremeReader.pdf. Xiaolan Wang, Yoshihiko Suhara, Natalie Nuno, Yuliang Li, Jinfeng Li, Nofar Carmeli, Stefanos Angelidis, Eser Kindogan, Wang-Chiew Tan – WWW 2020
* Deep Entity Matching with Pre-Trained Language Models. https://arxiv.org/pdf/2004.00584 Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, Wang-Chiew Tan

BIO
Wang-Chiew leads the research efforts at Megagon Labs with the goal of building advanced technologies to enhance search by experience. Her team conducts research on data integration, information extraction, text mining and summarization, knowledge base construction and commonsense reasoning, and data visualization. Prior to that, she was a Professor of Computer Science at University of California, Santa Cruz. She also spent two years at IBM Research - Almaden. She received her B.Sc. (First Class) in Computer Science from the National University of Singapore and her Ph.D. in Computer Science from the University of Pennsylvania.
https://wangchiew.github.io

• (8/21, 12:00pm): Predrag Radivojac (Khoury): Graph kernels details)

TITLE
Graph kernels

ABSTRACT
We will discuss basics of learning on graphs via graph kernels. We will first cover basics of kernel methods, as well as graph classification and node classification with graph kernels. We will then extend graph kernel methods to edge classification and link prediction problems in an attempt to unify these problems as vertex classification on hypergraphs, a problem we will solve by developing edit-distance hypergraphlet kernels. We will show results on both directed and undirected graphs from social sciences and computational biology, with primary focus on undirected graphs and biological applications.

BIO
Predrag Radivojac is a Professor of Computer Science at Northeastern University, where he recently moved from Indiana University. Prof. Radivojac received his Bachelor's and Master's degrees in Electrical Engineering from the University of Novi Sad and University of Belgrade, Serbia. His Ph.D. degree is in Computer Science from Temple University (2003) under the direction of Prof. Zoran Obradovic and co-direction of Prof. Keith Dunker. In 2004 he held a post-doctoral position in Keith Dunker's lab at Indiana University School of Medicine, after which he joined Indiana University Bloomington. Prof. Radivojac's research is in the areas of computational biology and machine learning with specific interests in protein function, MS/MS proteomics, genome interpretation, and precision health. He received the National Science Foundation (NSF) CAREER Award in 2007 and is an August-Wilhelm Scheer Visiting Professor at Technical University of Munich (TUM) as well as an honorary member of the Institute for Advanced Study at TUM. At Indiana University, he was Associate Chair of the Department of Computer Science and a co-Director of all of Informatics and Data Science for the multi-campus Precision Health Initiative. Prof. Radivojac's projects have been regularly supported by NSF and National Institutes of Health (NIH). He is currently an Editorial Board member for the journal Bioinformatics, Associate Editor for PLoS Computational Biology, and serves his third term (elected) on the Board of Directors of the International Society for Computational Biology (ISCB).

• (8/24, 12:00pm): Shantanu Jain (Khoury): Approaches and Algorithms for Positive Unlabeled Learning details)

TITLE
Approaches and Algorithms for Positive Unlabeled Learning

ABSTRACT
Positive Unlabeled (PU) learning is an extreme case of class imbalance in binary classification, where labeled data from the negative class is not available. Typically, unlabeled examples are used as a substitute for the negatives to train a non-traditional classifier that separates them from the positives. Surprisingly, the score function of a non-traditional classifier provides an optimal ranking of the inputs. However, for many other tasks such as 1) thresholding the score function to predict the hard class labels, 2) estimating the classifiers performance, and 3) obtaining well calibrated scores, treating the unlabeled examples as negatives, leads to significant bias. The bias can be corrected by the knowledge of the class proportion, making class proportion estimation the central task in PU learning. I will describe the theoretical issue of non-uniqueness (unidentifiability) in the estimation of the class proportion and identify key assumptions that make the estimation feasible. I will then present a nonparametric algorithm AlphaMax for practical estimation. Next, I’ll present an empirical risk minimization based criteria for PU classification. Finally, I will have an open ended discussion on deriving label propagation type algorithms for PU learning and extensions of PU learning to DAG structured multilabel classification.

RELATED PAPERS
- Nonparametric semi-supervised learning of class proportions: https://arxiv.org/abs/1601.01944
- Estimating the class prior and posterior from noisy positives and unlabeled data, NIPS 2016: https://papers.nips.cc/paper/6168-estimating-the-class-prior-and-posterior-from-noisy-positives-and-unlabeled-data.pdf
- Recovering True Classifier Performance in Positive-Unlabeled Learning, AAAI 2017: https://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14990

BIO
Shantanu Jain is an associate research scientist in the Khoury College of Computer Sciences at Northeastern University. He is interested in the field of statistical modeling and machine learning. His research focuses on developing semi-supervised methods under data constraints for which standard approaches lead to biased estimates. His recent work has addressed issues in binary classification and its evaluation that arise due to the absence of labeled examples from one of the classes (positive-unlabeled learning), incorrectly labeled examples and bias in the labeled examples. His research has been applied to many bioinformatics problems and mass spectrometry data.

### Spring 2020 (Wolfgang)

• (1/24, 12:00pm): Fatemeh Nargesian (University of Rochester): Organizing Data Lakes for Navigation details)

TITLE

ABSTRACT
In this talk, I discuss data-driven discovery in data lakes and introduce the data lake organization problem, where the goal is to find a navigation structure that allows a user to most effectively navigate a data lake. An organization is defined as a graph that contains nodes representing sets of attributes within a data lake and edges indicating subset relationships between nodes. I present a new probabilistic model of how users interact with an organization and an approximate algorithm for efficiently constructing a data lake organization. I discuss quantitative results on both real data lakes containing data from open data portals and on benchmarks that emulate the observed characteristics of real data lakes. I also present the results of a user study showing that 42% of users preferred the use of navigation over keyword search, suggesting these are complementary and both useful modalities for data discovery in data lakes.

BIO
Fatemeh Nargesian is an assistant professor of computer science at the University of Rochester. Her primary research interests are in data discovery and data integration. She received her PhD from the University of Toronto.

• (2/6, 1:35pm): BHS 030: Stratis Ioannidis (Northeastern): Learning from Comparisons details)

TITLE
Learning from Comparisons

ABSTRACT
We consider learning from comparisons, i.e., learning from a dataset containing pairs of samples and labels indicating their relative order. For example, a medical expert presented with a pair of patient files can order the pair w.r.t. the relative severity of a disease. Comparisons are often less noisy than class labels: human labelers disagreeing when generating class judgments often exhibit reduced variability when asked to compare pairs of samples instead. Comparisons are also more informative, as they capture both inter and intra-class relationships; the latter are not revealed via class labels alone. We discuss inference algorithms in this setting, as well as means to identify which comparison labels to collect.

BIO
Stratis Ioannidis is an assistant professor in the Electrical and Computer Engineering Department of Northeastern University, in Boston, MA, where he also holds a courtesy appointment with the College of Computer and Information Science. He received his B.Sc. (2002) in Electrical and Computer Engineering from the National Technical University of Athens, Greece, and his M.Sc. (2004) and Ph.D. (2009) in Computer Science from the University of Toronto, Canada. Prior to joining Northeastern, he was a research scientist at the Technicolor research centers in Paris, France, and Palo Alto, CA, as well as at Yahoo Labs in Sunnyvale, CA. He is the recipient of an NSF CAREER Award, a Google Faculty Research Award, and a best paper award at ACM ICN 2017. His research interests span machine learning, distributed systems, networking, optimization, and privacy.

RELATED PAPERS:
- A Severity Score for Retinopathy of Prematurity Peng Tian, Yuan Guo, Jayashree Kalpathy-Cramer, Susan Ostmo, J. Peter Campbell, Michael F. Chiang, Jennifer Dy, Deniz Erdogmus, and Stratis Ioannidis. *Knowledge Discovery and Data Mining (KDD) *, Anchorage, AK, 2019.
- Classification and Comparison via Neural Networks Peng Tian, Jennifer Dy, Deniz Erdogmus, James Brown, Jayashree Kalpathy-Cramer, Susan Ostmo, J. Peter Campbell, Michael F. Chiang, and Stratis Ioannidis. *Elsevier Journal of Neural Networks *, 2019.
- Experimental Design Under the Bradley-Terry Model Yuan Guo, Peng Tian, Jayashree Kalpathy-Cramer, Susan Ostmo, J. Peter Campbell, Michael F. Chiang, Deniz Erdogmus, Jennifer Dy, and Stratis Ioannidis. *International Joint Conference on Artificial Intelligence (IJCAI) *, Stockholm, Sweden, 2018.
- Accelerated Experimental Design for Pairwise Comparisons Yuan Guo, Jennifer Dy, Deniz Erdogmus, Jayashree Kalpathy-Cramer, Susan Ostmo, J. Peter Campbell, Michael F. Chiang, and Stratis Ioannidis. *SIAM International Conference on Data Mining (SDM) *, Calgary, Alberta, Canada, 2019.

• (2/12, 1:35pm): BHS 030: Alina Oprea (Northeastern): Security analytics: the application of machine learning and artificial intelligence in cyber security details)

• (2/19, 1:35pm): BHS 030: Cristina Nita-Rotaru (Northeastern): Designing (byzantine-)resilient distributed systems details)

Title: Designing (byzantine-)resilient distributed systems

Speaker: Cristina Nita-Rotaru

Abstract: What makes bitcoin work? How do you build a highly-resilient infrastructure for the power-grid? How could a 43 seconds network partition have cascaded into an over a day of service unavailability for github?

In this talk, I will describe several fundamental concepts from distributed systems and show their applications to blockchains, resilient infrastructure, and cloud services. I will highlight the main challenges when building such systems and describe several ongoing research projects in my group.

Bio: Cristina Nita-Rotaru is a Professor of Computer Science in the Khoury College of Computer Sciences at Northeastern University. Prior to joining Northeastern she was a faculty in the Department of Computer Science at Purdue University from 2003 to 2015. Her research lies at the intersection of security, distributed systems, and computer networks. The overarching goal of her work is designing and building secure and resilient distributed systems and network protocols, with assurance that their deployed implementations provide their security, resilience, and performance goals. Her work received several best paper awards in IEEE SafeThings 2019, NDSS 2018, ISSRE 2017, DSN 2015 as well as two IETF/IRTF Applied Networking Research Prize in 2018 and 2016 for her work on QUIC. Cristina Nita-Rotaru is a recipient of the NSF Career Award in 2006.

Network and Distributed Systems Security Laboratory: https://nds2.ccs.neu.edu/

• (2/26, 1:35pm): BHS 030: Predrag Radivojac (Northeastern): Deciphering molecular mechanisms of disease upon mutation via semi-suprvised learning details)

Title: Deciphering molecular mechanisms of disease upon mutation via semi-suprvised learning

Abstract: A major goal in computational biology is the development of algorithms, analysis techniques, and tools towards deep mechanistic understanding of life at a molecular level. In the process, computational biology must take advantage of the new developments in artificial intelligence and machine learning, and then extend beyond pattern analysis to provide testable hypotheses for experimental scientists. This talk will focus on our contributions to this process and relevant related work. We will first discuss the development of machine learning techniques for partially observable domains such as molecular biology; in particular, methods for accurate estimation of frequency of occurrence of hard-to-measure and rare events. We will show some identifiability results in parametric and nonparametric situations as well as how such frequencies can be used to correct estimated model accuracies. We will then show how these methods play key roles in inferring protein cellular roles and phenotypic effects of genomic mutations, with an emphasis on understanding the molecular mechanisms of human genetic disease. We further assessed the value of these methods in the wet lab where we tested the molecular mechanisms behind selected de novo mutations in a cohort of individuals with neurodevelopmental disorders. Finally, we will discuss implications on future research in machine learning, genome interpretation, and precision health.

Bio: Predrag Radivojac joined Northeastern University as a Professor in the Khoury College of Computer Sciences. Prior to joining Northeastern he was a Professor of Computer Science at Indiana University Bloomington and Associate Chair in the Department of Computer Science.

Prof. Radivojac’s primary research interests include computational biology and machine learning. He is motivated to improve our understanding of life at a molecular level and how molecular events affect higher level phenotypes. His group addresses such questions through the development of algorithms and analysis techniques related to the function of biological macromolecules, mass spectrometry proteomics, genome interpretation, and precision health; e.g., he is interested in elucidating the molecular mechanisms of disease consequent to genetic variation. In the area of machine learning, Prof. Radivojac’s research addresses foundational and applied problems in semi-supervised learning, structured-output learning, and active learning, and investigates topics such as kernels and distance functions (e.g., metrics) across data types and analysis techniques. He is also interested in performance evaluation of machine learning algorithms, especially in the hierarchical structured-output domains and cases of selection bias that often arise in the open world setting.

Prof. Radivojac received the National Science Foundation CAREER Award in 2007 and is an August-Wilhelm Scheer Visiting Professor at Technical University of Munich (TUM) as well as an honorary member of the Institute for Advanced Study at TUM. Prof. Radivojac co-directed all of data sciences and informatics within the multi-campus Precision Health Initiative of Indiana University. He is currently an Editorial Board member for the journal Bioinformatics, Associate Editor for PLoS Computational Biology, and serves his third (elected) term on the Board of Directors of the International Society for Computational Biology (ISCB).

• (2/26, 2:50pm): WVH 362: Evimaria Terzi (Boston University): Simple models for optimizing driver earnings in ride-sharing platforms details)

TITLE
Simple models for optimizing driver earnings in ride-sharing platforms

SPEAKER
Evimaria Terzi

ABSTRACT
On-demand ride-hailing platforms like Uber and Lyft are helping reshape urban transportation, by enabling car owners to become drivers for hire with minimal overhead. Such platforms are a multi-sided market and offer a rich space for  studies with socio-economic implications.

In this talk I am going to address two questions:

1. In the absence of coordination, what is the best course of action for a self-interested driver that wants to optimize his earnings?

2. In the presence of coordination, is it possible to maximize social welfare objectives in an environment where the objectives of the participants (drivers, customers and the platform) are (often) misaligned?

We will discuss the computational problems behind these problems and describe simple algorithmic solutions that work extremely well in practice. We will demonstrate the practical strength of our approaches with well-designed experiments on novel datasets we collected from such platforms.

BIO
Evimaria Terzi is an Associate Professor at the Computer Science Department of Boston University.  She joined the department in 2009 after spending two years as a Research Staff Member at the IBM Almaden Research Center. Evimaria got her PhD from the University of Helsinki in 2007.  Her research interests span a big range of data-centered problems. Currently she is interested in problems related to team formation, ride-sharing platforms as well as recommender systems.  He research is funded by multiple NSF grants and she was a recipient of the NSF CAREER award (2013) and Microsoft Faculty Fellowship (2010).

• (2/27, 1:35pm): BHC 030: Jon Ullman (Northeastern): (Preventing) Overfitting in Adaptive Data Analysis details)

Title: (Preventing) Overfitting in Adaptive Data Analysis

Speaker: Jon Ullman

Abstract: How can we use a random sample to draw meaningful conclusions about populations, without overfitting to the sample? This problem remains vexing despite decades of study by statisticians. One cause of overfitting is so-called adaptive data analysis---the common practice where datasets are re-used for multiple analyses, and the outcome of one analysis can influence the choice of future analyses. Adaptivity can arise in a range of ways, including data shared across research groups, multi-stage statistical methods, and data science competitions with shared validation data. Unfortunately, adaptivity invalidates traditional methods for preventing overfitting because it breaks a fundamental assumption about the independence of the data and the method.

In this talk I will introduce a relatively recent approach to understanding and preventing false discovery in adaptive data analysis based on the concept of algorithmic stability. Specifically, I will introduce and motivate the problem of adaptive data analysis, describe a model for studying this problem, and demonstrate how overfitting can occur and, to some extent, be mitigated in this model.

Related blog:  http://blog.mrtz.org/2015/03/09/competition.html

• (3/11, 3:45pm): 362 WVH: Stavros Sintos (Duke University): Efficient algorithms for querying uncertain data details)

TITLE
Efficient Algorithms for Querying Uncertain Data

ABSTRACT
Motivated by applications in sensor network, data cleaning, and data integration, there has been a growing interest in managing uncertain data. Uncertainty is typically captured using stochastic data models, and querying data requires either statistics about the probabilistic behavior of the underlying data, or data cleaning to reduce the uncertainty of the query answer. In the first part of the talk I will present efficient indexes for answering range-max queries on uncertain data. Given a collection of uncertain points in the d-dimensional space, where each point is associated with a value, the goal is to preprocess the points to an index such that given a query rectangle R, compute the expected maximum or the most likely maximum of the values of the points that lie in R. Next, I will present data cleaning problems with applications in Fact-Checking. Given a claim, made by a politician, over uncertain data our goal is to ascertain the quality of the claim, which is measured by a function f. We propose algorithms for choosing a subset of objects to clean with cost at most C to i) minimize the variance of f or ii) maximize the surprise of f. I will conclude by discussing current research and future research directions.

BIO
Stavros Sintos is a Ph.D. candidate in the Department of Computer Science at Duke University under supervision of Prof. Pankaj K. Agarwal. His main research interest is in the design of efficient algorithms for problems in databases, data mining, and computational geometry. In particular, he works on designing practical algorithms and indexes with theoretical guarantees for problems in data summarization and data uncertainty. An important aspect of his research has been on combining geometric optimization with query processing. He is also a recipient of the James B. Duke Fellowship. Before joining Duke he obtained his BS in the Department of Computer Science at University of Ioannina in Greece. For more details please see http://www.cs.duke.edu/~ssintos.

RELATED PAPERS
* Range-Max Queries on Uncertain Data (PODS 2016)
https://dl.acm.org/doi/10.1145/2902251.2902281
*Selecting Data to Clean for Fact Checking: Minimizing Uncertainty vs. Maximizing Surprise (PVLDB 2019)
http://www.vldb.org/pvldb/vol12/p2408-sintos.pdf

• (3/12, 1:35pm): Zoom: Tina Eliassi-Rad (Northeastern): Network science details)

Title: Geometric and Topological Graph Analysis for Machine Learning

Abstract:This talk has two parts: (1) geometric analysis for graph embedding and (2) topological analysis for graph distances. First, graph embedding seeks to build an accurate low-dimensional representation of a graph. This low-dimensional representation is then used for various downstream tasks such as link prediction. One popular approach is Laplacian Eigenmaps, which constructs a graph embedding based on the spectral properties of the Laplacian matrix of a graph. The intuition behind it, and many other embedding techniques, is that the embedding of a graph must respect node similarity: similar nodes must have embeddings that are close to one another. We dispose of this distance-minimization assumption. In its place, we use the Laplacian matrix to find an embedding with geometric properties (instead of spectral ones) by leveraging the simplex geometry of the graph. We introduce Geometric Laplacian Eigenmap Embedding (or GLEE for short) and demonstrate that it outperforms various other techniques (including Laplacian Eigenmaps) in the tasks of graph reconstruction and link prediction. This work is joint with Leo Torres and Kevin Chan, and is forthcoming in the Journal of Complex Networks. Second, measuring graph distance is a fundamental task in graph mining. For graph distance, determining the structural dissimilarity between networks is an ill-defined problem, as there is no canonical way to compare two networks. Indeed, many of the existing approaches for network comparison differ in their heuristics, efficiency, interpretability, and theoretical soundness. Thus, having a notion of distance that is built on theoretically robust first principles and that is interpretable with respect to features ubiquitous in complex networks would allow for a meaningful comparison between different networks. We rely on the theory of the length spectrum function from algebraic topology, and its relationship to the non-backtracking cycles of a graph, in order to introduce the Non-Backtracking Spectral Distance (NBD) for measuring the distance between undirected, unweighted graphs. NBD is interpretable in terms of features of complex networks such as presence of hubs and triangles. We showcase the ability of NBD to discriminate between networks in both real and synthetic data sets. This work is joint with Leo Torres and Suarez-Serrato, and was published in the Journal of Applied Network Science in June 2019.

Bio: Tina Eliassi-Rad is an Associate Professor of Computer Science at Northeastern University in Boston, MA. She is also a core faculty member at Northeastern University's Network Science Institute. Prior to joining Northeastern, Tina was an Associate Professor of Computer Science at Rutgers University; and before that she was a Member of Technical Staff and Principal Investigator at Lawrence Livermore National Laboratory. Tina earned her Ph.D. in Computer Sciences (with a minor in Mathematical Statistics) at the University of Wisconsin-Madison. Her research is rooted in data mining and machine learning; and spans theory, algorithms, and applications of big data from networked representations of physical and social phenomena. She has over 100 peer-reviewed publications (including a few best paper and best paper runner-up awardees); and has given over 200 invited talks and 14 tutorials. Tina's work has been applied to personalized search on the World-Wide Web, statistical indices of large-scale scientific simulation data, fraud detection, mobile ad targeting, cyber situational awareness, and ethics in machine learning. Her algorithms have been incorporated into systems used by the government and industry (e.g., IBM System G Graph Analytics) as well as open-source software (e.g., Stanford Network Analysis Project). In 2017, Tina served as the program co-chair for the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (a.k.a. KDD, which is the premier conference on data mining) and as the program co-chair for the International Conference on Network Science (a.k.a. NetSci, which is the premier conference on network science). In 2020, she is serving as the program co-chair for the International Conference on Computational Social Science (a.k.a. IC2S2, which is the premier conference on computational social science). Tina received an Outstanding Mentor Award from the Office of Science at the US Department of Energy in 2010; and became a Fellow of the ISI Foundation in Turin Italy in 2019.

• (4/1, 1:35pm): Zoom: David Lazer: Fake news on Twitter during the 2016 U.S. presidential election details)

Title: Fake news on Twitter during the 2016 U.S. presidential election

Speaker:  David Lazer

Abstract:The spread of fake news on social media became a public concern in the United States after the 2016 presidential election. We examined exposure to and sharing of fake news by registered voters on Twitter and found that engagement with fake news sources was extremely concentrated. Only 1% of individuals accounted for 80% of fake news source exposures, and 0.1% accounted for nearly 80% of fake news sources shared. Individuals most likely to engage with fake news sources were conservative leaning, older, and highly engaged with political news. A cluster of fake news sources shared overlapping audiences on the extreme right, but for people across the political spectrum, most political news exposure still came from mainstream media outlets.

• (4/3, 12:00pm): Zoom: Yanlei Diao (UMass Amherst): AIDEme -- An active learning based system for interactive exploration of large datasets details)

TITLE
AIDEme: An active learning based system for interactive exploration of large datasets

ABSTRACT
There is an increasing gap between fast growth of data and limited human ability to comprehend data. Consequently, there has been a growing demand for analytics tools that can bridge this gap and help the user retrieve high-value content from data. In this talk, I introduce AIDEme, a scalable interactive data exploration system for efficiently learning a user interest pattern over a large dataset. The system is cast in a principled active learning (AL) framework, which iteratively presents strategically selected records for user labeling, thereby building an increasingly-more-accurate model of the user interest. However, existing AL techniques experience slow convergence when learning the user interest on large datasets. To overcome the problem, AIDEme explores properties of the user labeling process and the class distribution of observed data to design new AL algorithms, which come with provable results on model accuracy and approximation, and have evaluation results showing much improved convergence over existing AL methods while maintaining interactive speed.

BIO
Yanlei Diao is Professor of Computer Science at the University of Massachusetts Amherst, USA, and Ecole Polytechnique, France. Her research interests lie in big data analytics and scalable intelligent information systems, with a focus on interactive data exploration, explainable anomaly detection, optimization in cloud analytics, data streams, and uncertain data management. She received her PhD in Computer Science from the University of California, Berkeley in 2005.

Prof. Diao was a recipient of the 2016 ERC Consolidator Award, 2013 CRA-W Borg Early Career Award (one female computer scientist selected each year for outstanding contributions), IBM Scalable Innovation Faculty Award, and NSF Career Award. She spoke at the Distinguished Faculty Lecture Series at the University of Texas at Austin and Technische Universitaet Darmstadt. She has served as Editor-in-Chief of the ACM SIGMOD Record, Associate Editor of ACM TODS, Chair of the ACM SIGMOD Research Highlight Award Committee, and member of the SIGMOD and PVLDB Executive Committees. She was PC Co-Chair of IEEE ICDE 2017 and ACM SoCC 2016, and served on the organizing committees of SIGMOD, PVLDB, and CIDR, as well as on the program committees of many international conferences and workshops.

RELEVANT PUBLICATIONS
* A Factorized Version Space Algorithm for" Human-In-the-Loop" Data Exploration, ICDM 2019
* AIDEme: An active learning based system for interactive exploration of large datasets, NeurIPS 2019
* Optimization for Active Learning-based Interactive Database Exploration, VLDB 2018

### Summer/Fall 2019 (Wolfgang)

• (5/10, 12:00pm): Predrag Radivojac (Khoury / Northeastern): A new class of metrics for learning on real-valued and structured data details)

TITLE
A new class of metrics for learning on real-valued and structured data

ABSTRACT
We propose a new class of metrics on sets, vectors, and functions that can be used in various stages of data mining, including exploratory data analysis, learning, and result interpretation. These new distance functions unify and generalize some of the popular metrics, such as the Jaccard and bag distances on sets, Manhattan distance on vector spaces, and Marczewski-Steinhaus distance on integrable functions. We prove that the new metrics are complete and show useful relationships with $f$-divergences for probability distributions. To further extend our approach to structured objects such as ontologies, we introduce information-theoretic metrics on directed acyclic graphs drawn according to a fixed probability distribution. We conduct empirical investigation to demonstrate the effectiveness on real-valued, high-dimensional, and structured data. Overall, the new metrics compare favorably to multiple similarity and dissimilarity functions traditionally used in data mining, including the Minkowski ($L^p$) family, the fractional $L^p$ family, two $f$-divergences, cosine distance, and two correlation coefficients. We provide evidence that they are particularly appropriate for rapid processing of high-dimensional and structured data in distance-based learning.

WEB PAGE

RELATED PAPER
https://doi.org/10.1007/s10618-019-00622-6

• (5/15, 12:00pm): Leilani Battle (University of Maryland): Behavior-Driven Optimizations for Big Data Exploration details)

TITLE
Behavior-Driven Optimizations for Big Data Exploration

ABSTRACT
With massive amounts of data coming from field sensors, sequencers, mobile devices, etc., data-driven decision making has become increasingly important in industry, government, and the sciences. One of the key issues for analysts and scientists who work with large datasets is efficient visualization of their data to extract patterns, observe anomalies, and debug their analysis workflows. Though a variety of visualization tools exist to help people make sense of their data, these tools often rely on database management systems (or DBMSs) for data processing and storage; and unfortunately, DBMSs fail to process the data fast enough to support a fluid, interactive visualization experience.

My work blends optimization techniques from databases and methodology from HCI and visualization in order to support interactive exploration of large datasets. In this talk, I will first discuss Sculpin, a visual exploration system that learns user exploration patterns automatically, and exploits these patterns to pre-fetch data ahead of users as they explore. I will then discuss ongoing work to better understand how analysts explore data in more sophisticated analysis systems, such as Tableau Desktop. Finally, I will report on ongoing efforts to standardize the way we evaluate visual data analysis systems in general.

RELATED PAPERS
SIGMOD 2016: http://www.cs.umd.edu/~leilani/static/papers/forecache_cr_sigmod2016.pdf
HILDA 2018: http://www.cs.umd.edu/~leilani/static/papers/battle_visualization-centered-evaluation_hilda_2018.pdf

BIO
Leilani Battle is an Assistant Professor at the University of Maryland, College Park, with a joint appointment in the University of Maryland Institute for Advanced Computer Studies (UMIACS). She is also affiliated with the UMD Human-Computer Interaction Laboratory (HCIL). Her research interests focus on developing interactive data-intensive systems that can aid analysts in performing complex data exploration and analysis. Her current research is anchored in the field of databases, but utilizes research methodology and techniques from HCI and visualization to integrate data processing (databases) with interactive interfaces (HCI, visualization). She is an NSF Graduate Research Fellowship Recipient (2012), and her research is currently supported by an NSF CISE CRII Award (2019). In 2017, she completed a postdoc in the UW Interactive Data Lab. She holds an MS (2013) and PhD (2017) in Computer Science from MIT, where she was a member of the MIT Database Group, and a BS in Computer Engineering from UW (2011), where she was a member of the UW database group.

• (5/31, 12:00pm): Raymond Chi-Wing Wong (The Hong Kong University of Science and Technology): FindYourFavorite: An Interactive System for Finding the User's Favorite Tuple in the Database details)

TITLE
FindYourFavorite: An Interactive System for Finding the User's Favorite Tuple in the Database

ABSTRACT
When faced with a database containing millions of tuples, an end user might be only interested in finding his/her favorite tuple in the database. In this talk, we study how to help an end user to find such a favorite tuple with a few user interations. In each interaction, a user is presented with a small number of tuples (which can be artificial tuples outside the database or true tuples inside the database) and s/he is asked to indicate the tuple s/he favors the most among them. Different from the previous work which displays artificial tuples to users during the interaction and requires heavy user interactions, we achieve a stronger result. Specifically, we use a concept, called the utility hyperplane, to model the user preference and an effective pruning strategy to locate the favorite tuple for a user in the whole database. Based on these techniques, we developed an interactive system, called FindYourFavorite, and demonstrate that the system could identify the favorite tuple for a user with a few user interactions by always displaying true tuples in the database.

RELATED PAPER:
SIGMOD 2018: http://home.cse.ust.hk/~raywong/paper/sigmod18-regret.pdf

BIOGRAPHY
Raymond Chi-Wing Wong is an Associate Professor in Computer Science and Engineering (CSE) of The Hong Kong University of Science and Technology (HKUST). He is currently the director of the Risk Management and Business Intelligence (RMBI) program. He was the director of the Computer Engineering (CPEG) program from 2014 to 2016 and was the associate director of the Computer Engineering (CPEG) program from 2012 to 2014. He received the BSc, MPhil and PhD degrees in Computer Science and Engineering in the Chinese University of Hong Kong (CUHK) in 2002, 2004 and 2008, respectively. In 2004-2005, he worked as a research and development assistant under an R&D project funded by ITF and a local industrial company called Lifewood. From May 2006 to Aug 2006, he was a visiting student of Prof. Jian Pei and Prof. Ke Wang, at Simon Fraser University in Canada. From Aug 2007 to Sept 2007, he visited IBM T.J. Watson Research Center as a summer intern under the supervision of Prof. Philip S. Yu. From Jun 2008 to Jul 2008, he visited Prof. Tamer Ozsu at University of Waterloo as a visiting scholar. Some of his collaborators are Prof. Ada Fu (2003-2013), Prof. Ke Wang (2003-2013), Prof. Philip S. Yu (2009-2013), Prof. Jian Pei (2006-2011), Prof. Tamer Ozsu (2009-2011), Prof. Jiuyong Li (2006-2011), Prof. Yufei Tao (2007-2009), Prof. Ihab Ilyas (2009), Prof. Jeffrey Yu (2009),Prof. Lei Chen (2008) and Prof. Eamonn Keogh (2008). He received 28 awards. He published 69 conference papers (e.g., SIGMOD, SIGKDD, VLDB, ICDE and ICDM), 31 journal/chapter papers (e.g., TODS, DAMI, TKDE, VLDB journal and TKDD) and 1 book. He reviewed papers from conferences and journals related to data mining and database, including VLDB conference, SIGMOD, TODS, VLDB Journal, TKDE, TKDD, ICDE, SIGKDD, ICDM, DAMI, DaWaK, PAKDD, EDBT and IJDWM. He is a program committee member of conferences, including SIGMOD, VLDB, ICDE, KDD, ICDM and SDM, and a referee of journals, including TODS, VLDBJ, TKDE, TKDD, DAMI and KAIS.

• (6/21, 12:00pm): Wolfgang Gatterbauer (Khoury): Anytime Approximation in Probabilistic Databases via Scaled Dissociations details)

TITLE
Anytime Approximation in Probabilistic Databases via Scaled Dissociations

ABSTRACT
Speeding up probabilistic inference remains a key challenge in probabilistic databases (PDBs) and the related area of statistical relational learning (SRL). Since computing probabilities for query answers is #P-hard, even for fairly simple conjunctive queries, both the PDB and SRL communities have proposed a number of approximation techniques over the years. The two prevalent techniques are either (i) MCMCstyle sampling or (ii) branch-and-bound (B&B) algorithms that iteratively improve model-based bounds using a combination of variable substitution and elimination.
We propose a new anytime B&B approximation scheme that encompasses all prior model-based approximation schemes proposed in the PDB and SRL literature. Our approach relies on the novel idea of “scaled dissociation” which can improve both the upper and lower bounds of existing model-based algorithms. We apply our approach to the well-studied problem of evaluating self-join-free conjunctive queries over tuple-independent PDBs, and show a consistent reduction in approximation error in our experiments on TPC-H, Yago3, and a synthetic benchmark setting.

Based on joint work with Maarten Van den Heuvel, Peter Ivanov, Floris Geerts, and Martin Theobald.

RELATED PAPER:

• (9/20, 12:00pm): Oktie Hassanzadeh (IBM): Knowledge Graphs for Data Curation & Data Science details)

TITLE
Knowledge Graphs for Data Curation & Data Science

ABSTRACT
In this informal talk, I will go over a number of projects at IBM Research that leverage knowledge graphs for large-scale data exploration, data curation, and data science. In particular, I will describe an application of knowledge graphs for large-scale data exploration and curation in the context of the IBM Research Helix project, where large-scale similarity analysis and automated linkage discovery enable the discovery and construction of data sets from a large number of data sets (or a data lake) for a given analytics task. I will then describe our ongoing project and recent progress on using knowledge graphs for automating data science. I will discuss a number of ways knowledge graphs can assist with feature engineering and the challenges we faced in implementing a prototype solution. If time permits, I will briefly go over some of our recent work on knowledge extraction from text.

RELATED WORK
* Automated Feature Enhancement for Predictive Modeling using External Knowledge, ICDM 2019 (Demo Track, To Appear)
* Exploring Big Data with Helix: Finding Needles in a Big Haystack. SIGMOD Record 43(4): 43-54 (2014) https://doi.org/10.1145/2737817.2737829
* Instance-Based Matching of Large Ontologies Using Locality-Sensitive Hashing. International Semantic Web Conference (1) 2012: 49-64 https://doi.org/10.1007/978-3-642-35176-1_4
* Inducing Implicit Relations from Text Using Distantly Supervised Deep Nets. International Semantic Web Conference (1) 2018: 38-55 https://doi.org/10.1007/978-3-030-00671-6_3 (ISWC 2017 Semantic Web Challenge Winning Solution, see blog post here: https://www.ibm.com/blogs/research/2017/11/knowledge-base-construction-iswc-2017/)
* Answering Binary Causal Questions Through Large-Scale Text Mining: An Evaluation Using Cause-Effect Pairs from Human Experts. IJCAI 2019: 5003-5009 https://doi.org/10.24963/ijcai.2019/695

BIOGRAPHY

• (9/27, 12:00pm): Boris Glavic (Illinois Institute of Technology): Relevance-based Data Management details)

TITLE
Relevance-based Data Management

ABSTRACT
In this talk, I give an overview of my group's work on relevance-based data management, a novel data management paradigm centered around the idea of analyzing what data is needed to answer queries. For the most part, the talk will focus on one particular approach which we refer to as provenance-based data skipping. Since the inception of relational databases, index structures and other physical design techniques have been used to speed-up access to data. In spite of the wide variety of mature techniques that are available, there are important classes of selective queries, e.g., top-k queries, which cannot benefit from these techniques. This issue arises because database systems statically analyze incoming queries to determine which data is needed for answering which query. However, for many queries it is impossible to determine up-front what data is relevant. Provenance-based data skipping overcomes this limitation by generating provenance sketches which are concise encodings of what data is relevant for a query. Similar to query answering with views, a provenance sketch captured for one query is used to speed up the subsequent evaluation of other queries. In contrast to materialized views, provenance sketches are guaranteed to be small (100s of bytes) and can benefit from physical database design artifacts such as indexes, horizontal partitioning of tables, and zone maps. Furthermore, I will highlight open problems in provenance-based data skipping and discuss additional applications of relevance-based data management techniques in general.

RELATED PAPER:
IEEE Data Engineering Bulletin: GProM - A Swiss Army Knife for Your Provenance Needs IEEE Data Engineering Bulletin http://sites.computer.org/debull/A18mar/p51.pdf

BIOGRAPHY
Boris Glavic is an Associate Professor of Computer Science at the Illinois Institute of Technology where he leads the IIT DBGroup (http://www.cs.iit.edu/~dbgroup/ ). Before coming to IIT, Boris spend to two years as a PostDoc at the University of Toronto working with Renée J. Miller. He received his PhD from the University of Zurich in Switzerland being advised by Michael Böhlen and Gustavo Alonso. Boris is a database researcher building comprehensive systems based on solid theoretical foundations. His main research interests are provenance, data integration and cleaning, and query processing and optimization.

• (10/11, 12:00pm): Owen O'Malley (Cloudera): Big Data's journey to ACID details)

TITLE
Big Data's journey to ACID

ABSTRACT

BIOGRAPHY
Owen O’Malley was a founder and technical fellow at Hortonworks. Hortonworks was founded in 2011 with 25 employees spinning out from Yahoo, went public in 2014, and grew to 1,400 employees before merging with Cloudera in 2019. Cloudera’s platform includes Hadoop and the large ecosystem of big data open source tools that enterprises need for their data analytics. Owen has been working on Hadoop since the beginning of 2006, was the first committer added to the project, and used Hadoop to set the Gray sort benchmark in 2008 and 2009. Over the years, he was the technical lead for MapReduce, Hadoop Security, ORC file, and adding ACID transactions to Hive.

• (10/11, 3:00pm): Nan Tang (Qatar Computing Research Institute, QCRI): Data Preparation meets Data Cleaning details)

TITLE
Data Preparation meets Data Cleaning

ABSTRACT
In this talk, I will first provide a brief overview of my research in data preparation, including data discovery, data cleaning and integration, and some systems (such as Data Civilizer, a project with MIT, and Columbus, a project with Wisconsin) to put different pieces together for data preparation. Afterwards, I will discuss why data visualization is important for data preparation, and what interesting research topics can happen when putting these two together, based on which I will present three works: (1) DeepEye: a visualization recommendation system, which was motivated by how to help users easily understand the discovered datasets in data preparation; (2) Interactive data cleaning for progressive visualization, which is to cheaply solve the problem (by 10-15 user interactions) that data visualization is bad if the data is very dirty; and (3) Cymphony: a generic system for collaborative task management, motivated by the practical scenarios that for many tasks (such as manual data cleaning, manually checking fake news), a team of users need to work collaboratively.

RELATED PAPERS:
ICDE 2018: http://da.qcri.org/ntang/pubs/icde2018deepeye.pdf
EDBT 2018 vision paper: http://da.qcri.org/ntang/pubs/edbt2018.pdf
SIGMOD 2018 demo: http://da.qcri.org/ntang/pubs/deepeyesigmod2018demo.pdf
ICDE 2020 paper “Interactive cleaning for progressive visualization through composite questions”
http://da.qcri.org/ntang/publications.html

BIOGRAPHY
Dr. Nan Tang is a senior scientist at Qatar Computing Research Institute (QCRI), HBKU, Doha, Qatar. Prior to joining QCRI, he has worked at Univeristy of Edinburgh (2010-2011) and CWI (2008-2010). He got his PhD. degree from The Chinese University of Hong Kong (2007). His research focuses on data preparation including data cleaning, entity matching, and building systems to support these tasks, and data visualization, especially on visualization recommendation and cleaning. He serves the community in many PCs, such as SIGMOD, PVLDB, ICDE, KDD. You can find more about me from my homepage: http://da.qcri.org/ntang/index.html

• (12/4, 12:00pm): Wolfgang Gatterbauer (Khoury): Scalable Compatibility Estimation from Extremely Sparse Graphs details)

TITLE
Scalable Compatibility Estimation from Extremely Sparse Graphs

ABSTRACT
Node classification is an important problem of graph management and Belief Propagation (BP) methods, along with its linearization, have been widely used for graphs with arbitrary compatibilities between classes. These methods rely on a prior specified compatibility matrix between classes, which is commonly given by domain experts or heuristics. Thus an important unresolved questions is: Can we do better in deriving compatibilities from the actual data in a principled and scalable way? We address this problem by indeed estimating the compatibilities on graphs with very sparse labels (e.g., 1 in 10,000 nodes is labeled). We propose methods to perform this estimation in a fraction of time it later takes to label the remaining nodes, which makes it a cheap preprocessing step for any existing label propagation method. Our approach consists of two steps: first, we create multiple consistent and compact graph statistics; and second, perform estimation on these far smaller representations. We show that the labeling accuracy of our proposed estimator is comparable to using the "correct" compatibilities and that our estimator is orders of magnitude faster than standard approaches based on train-test sets.

• (12/9, 12:00pm): Oliver Kennedy (U of Buffalo): Safe, Reusable Heuristic Data Transformation through Caveats details)

TITLE
Safe, Reusable Heuristic Data Transformation through Caveats

ABSTRACT
Through extensive surveys of data scientists in the field, we've identified numerous data transformation challenges, including time-series alignment, outlier management, entity resolution, parsing, and more. In each use case, a correct, fully general solution would have ben worthy of a PhD Thesis. By contrast, each of these use cases admitted a simple heuristic tailored to the specific dataset and analysis being performed, which could be scripted or applied manually in hours or even minutes. Although full generality is clearly not worth the price, dataset- and analysis- specific heuristics make it difficult to re-use the transformed dataset or transformation workflow.

In this talk, I introduce a form of lightweight data annotation called Caveats that make it easier to re-use both heuristically transformed datasets and heuristic transformation workflows. Caveats combine "where" provenance with techniques from incomplete data management to mark specific data values and records as dependent on a data transformation heuristic --- marked data elements are subject to change if the heuristic is altered. I will show how caveats can be seamlessly integrated into heuristic transformation workflows, introduce our lightweight scheme for propagating caveats through SQL queries, and demonstrate our prototype notebook, Vizier, built around Caveats.

RELATED PAPERS
https://odin.cse.buffalo.edu/papers/2015/VLDB-lenses-final.pdf
https://odin.cse.buffalo.edu/papers/2019/submitted/CIDR-CrumbyNotebooks.pdf (to appear)
https://vizierdb.info/ (try it out!)

BIOGRAPHY
Oliver Kennedy is an associate professor at the University at Buffalo. He earned his PhD from Cornell University in 2011 and now leads the Online Data Interactions (ODIn) lab, which operates at the intersection of databases and programming languages. Oliver is the recipient of an NSF CAREER award, UB's Exceptional Scholar Award, and the UB SEAS Early Career Teacher of the Year Award. Several of his papers have been invited to "Best of" compilations from SIGMOD and VLDB. The ODIn lab is currently exploring uncertain data management, just-in-time data structure design, and "small data" management, as well as implementing VizierDB, a reproducibility- and reusability-focused notebook.

### Spring 2019 (Wolfgang)

• (1/9, 12:00pm): Kathleen Fisher (Tufts): Using a Declarative Description Language to Tame Ad Hoc Data: An Overview of the PADS project details)

TITLE
Using a Declarative Description Language to Tame Ad Hoc Data: An Overview of the PADS project

SPEAKER
Kathleen Fisher (Tufts University)

ABSTRACT
The goal of the PADS project (http://www.padsproj.org) is to make it easier for data analysts to extract useful information from ad hoc data files. This talk gives an overview of the project and how it helps bridge the gap between the unmanaged world of ad hoc data and the managed world of typed programming languages and databases. In particular, the paper reviews the design of PADS data description languages, describes the generated parsing tools and discusses the importance of meta-data. It also sketches the formal semantics, discusses useful tools and how can they can be generated automatically from PADS descriptions, and describes an inferencing system that can learn useful PADS descriptions from positive examples of the data format.

BIO
Kathleen Fisher is a Professor in and the Chair of the Computer Science Department at Tufts. Previously, she was a program manager at DARPA where she started and managed the HACMS and PPAML programs, a Consulting Faculty Member in the Computer Science Department at Stanford University, and a Principal Member of the Technical Staff at AT&T Labs Research. Kathleen is an ACM Fellow. She has served as Program Chair for PLDI, OOPSLA, ICFP, CUFP, and FOOL, and as General Chair for ICFP 2015. She is a former Associate Editor for TOPLAS and a former editor of the Journal of Functional Programming. Kathleen is a past Chair of the ACM Special Interest Group in Programming Languages (SIGPLAN) and past Co-Chair of CRA's Committee on the Status of Women (CRA-W). Kathleen is a recipient of the SIGPLAN Distinguished Service Award. She is Vice Chair of DARPA's ISAT Study Group and a member of the Board of Trustees of Harvey Mudd College.

RELATED PAPERS
Kathleen Fisher, David Walker: The PADS project: an overview. ICDT 2011: 11-17
https://dl.acm.org/citation.cfm?doid=1938551.1938556

• (1/11, 12:00pm): Xiaoyu Liu (MIT): Kyrix: Interactive Visual Data Exploration at Scale details)

TITLE
Kyrix: Interactive Visual Data Exploration at Scale

ABSTRACT
Scalable interactive visual data exploration is crucial in many domains due to increasingly large datasets generated at rapid rates. Details-on-demand provides a useful interaction paradigm for exploring large datasets, where the user starts at an overview, finds regions of interest, zooms in to see detailed views, zooms out and then repeats. This paradigm is the primary user interaction mode of widely used systems such as Google Maps, Aperture Tiles and ForeCache. These earlier systems, however, are highly customized with hardcoded visual representations and optimizations. A more general framework is needed to facilitate the development of visual data exploration systems at scale. We present Kyrix, an end-to-end system for developing scalable details-on-demand data exploration applications. Kyrix provides the developer with a declarative model for easy specification of general visualizations. Behind the scenes, Kyrix utilizes a suite of performance optimization techniques to achieve a response time within 500 ms for various user interactions. We also report results from a performance study which shows that a novel dynamic fetching scheme adopted by Kyrix outperforms tile-based fetching used in traditional systems.

BIO
Xiaoyu is a research intern at MIT data system group. Before that, she completed her master at Purdue, ECE department. She is interested in building scalable and reliable data-intensive systems.

RELATED WORK
http://cidrdb.org/cidr2019/papers/p70-tao-cidr19.pdf
http://dsail.csail.mit.edu/index.php/kyrix/

• (1/23, 12:00pm): Panayiotis Tsaparas (University of Ioannina): Finding Patterns in Temporal and Flow Networks details)

TITLE
Finding Patterns in Temporal and Flow Networks

ABSTRACT
Networks are natural models for complex systems consisting of multiple interconnected entities. Many of these systems are dynamic, with connections appearing and disappearing over time. In the resulting networks, edges are annotated with time stamps, resulting in a graph history consisting of multiple network snapshots over time. We refer to such networks as temporal networks (graphs).

In this talk we consider two knowledge extraction problems on temporal networks. First, we consider the problem of finding lasting dense subgraphs. We provide formal definitions of graph density for a temporal graph, and we formulate the BFF problem that seeks a subset of nodes that are densely connected over time. Furthermore, we also consider the problem of finding a set of graph snapshots in which there are dense subsets of nodes. We study the complexity of the problems and propose exact, approximation and heuristic algorithms. Experiments indicate that our approach can find interesting patterns in collaboration and word co-occurrence networks.

In the second problem we assume that edges are also associated with a value, indicating a flow between the two nodes at a specific time stamp. Such networks appear naturally in practice in the case of money exchange networks (e.g., bitcoin), or traffic networks. We are interested in finding motifs in these networks, that is, small subgraphs that appear more often than random. Our motifs are restricted temporally and with respect to flow: all interactions must happen within a given time-window, and they must involve a minimum amount of flow. We propose efficient algorithms for finding such motifs and experiment with them on real networks.

RELATED PAPERS
Konstantinos Semertzidis, Evaggelia Pitoura, Evimaria Terzi, Panayiotis Tsaparas. Finding lasting dense subgraphs. ECML/PKDD Journal Track (DAMI), 2019 (to appear)
Chrysanthi Kosyfaki, Nikos Mamoulis, Evaggelia Pitoura, Panayiotis Tsaparas. Flow motifs in interaction networks. EDBT 2019 (to appear).
https://arxiv.org/abs/1810.08408

BIO
Panayiotis Tsaparas completed his undergraduate studies at the Department of Computer Science at University of Crete, Greece in 1995. He continued his graduate studies at University of Toronto, where he received his M.Sc., and Ph.D. degree, under the supervision of Allan Borodin. After graduation, he worked as a post-doctoral fellow at University of Rome, “La Sapienza”, and at University of Helsinki, and as a researcher at Microsoft Research. Since 2011 he joined the Department of Computer Science and Engineering at University of Ioannina, where he is now an Associate Professor. His research interests include Social Network Analysis, Algorithmic Data Mining, Web Mining and Information Retrieval.
Web: http://www.cs.uoi.gr/~tsap/

• (1/24, 12:00am): Northeast Database Day 2019 details)

• (1/25, 1:15pm): Fatemeh Nargesian (University of Toronto): Table Union and Navigation details)

TITLE

ABSTRACT
Preparing data for advanced analytics is prohibitively time-consuming and expensive for all but the best-trained and best-funded engineers. Nevertheless, the success of trained systems often depends on data and generated features more than powerful statistical algorithms. Among the data management challenges that need to be addressed is data enrichment which is the discovery and integration of meaningful data in data lakes for a data science task. However, there are many challenges to overcome: (1) the sheer size of data, (2) the unique distributions and characteristics of data lakes, and (3) the probabilistic and human-in-the-loop nature of data discovery. In this talk, I discuss two prevalent data discovery scenarios. In the first scenario, the query is a dataset and the data scientist is interested in interactively finding datasets that can be integrated (e.g unioned) with the query. I will introduce a probabilistic framework for finding and aligning unionable tables with a query table and discuss the need for distribution-aware techniques for data discovery. In the second scenario, search does not start with a query, instead, it is data-driven. I will talk about data lake organization problem where the goal is to find an organization (a directed acyclic graph) that allows a user to most efficiently navigate data lakes. I will present a probabilistic navigation model of how users interact with an organization and introduce a scalable structure learning algorithm for optimizing data lake organizations.

RELATED PAPER
Nargesian, Zhu, Pu, Miller: Table Union Search on Open Data, PVLDB 2018.
http://www.vldb.org/pvldb/vol11/p813-nargesian.pdf

BIO
Fatemeh Nargesian is a PhD student in the Data Curation Group of the Department of Computer at University of Toronto. Her research focuses on optimizing and automating data preparation for end-to-end data science, encompassing dataset discovery in data lakes and enriching datasets with new features. While at University of Toronto, Fatemeh was a joint Research intern at IBM Research-NY. Prior to University of Toronto, Fatemeh worked at the Clinical Informatics research group at McGill University on clinical data management, and received M.Sc. degrees in computer science and artificial intelligence at University of Toronto and Sharif University of Technology.

• (1/28, 12:00pm): Stefano Ceri (Politecnico di Milano): WVH 366: Data-Driven Genomic Computing: Making Sense of the Signals from the Genome details)

TITLE
Data-Driven Genomic Computing: Making Sense of the Signals from the Genome

ABSTRACT
Genomic computing is a new science focused on understanding the functioning of the genome, as a premise to fundamental discoveries in biology and medicine. Next Generation Sequencing (NGS) allows the production of the entire human genome sequence at a cost of about 1000 US \$; many algorithms exist for the extraction of genome features, or "signals", including peaks (enriched regions), variants, or gene expression (intensity of transcription activity). The missing gap is a system supporting data integration and exploration, giving a “biological meaning” to all the available information; such a system can be used, e.g., for better understanding cancer or how environment influences cancer development.

The GeCo Project (Data-Driven Genomic Computing, ERC Advanced Grant, 2016-2021) has the objective or revisiting genomic computing through the lens of basic data management, through models, languages, and instruments, focusing on genomic data integration. Starting from an abstract model, we developed a system that can be used to query processed data produced by several large Genomic Consortia, including Encode and TCGA; the system employs internally the Spark engine, and prototypes can already be accessed from Polimi, from Cineca (Italian supercomputing center) and from the Broad Institute in Cambridge. During the five-years of the ERC project, the system will be enriched with data analysis tools and environments and will be made increasingly efficient. Among the objectives of the project, the creation of an “open source” repository of public data, available to biological and clinical research through queries, web services and search interfaces.

BIO
Stefano Ceri is professor of Database Systems at the Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB) of Politecnico di Milano. His research work covers four decades (1978-2018) and has been generally concerned with extending database technologies in order to incorporate new features: distribution, object-orientation, rules, streaming data; with the advent of the Web, his research has been targeted towards the engineering of Web-based applications and to search systems. More recently he turned to genomic computing. He authored over 350 publications (H-index 75) and authored or edited 15 books in English. He is the recipient of two ERC Advanced Grants: "Search Computing (SeCo)" (2008-2013), focused upon the rank-aware integration of search engines in order to support multi-domain queries and “Data-Centered Genomic Computing (GeCo)” (2016-2021), focused upon new abstractions for querying and integrating genomic datasets. He is the recipient of the ACM-SIGMOD "Edward T. Codd Innovation Award" (New York, June 26, 2013), an ACM Fellow and a member of Academia Europaea.

RELATED PUBLICATION
http://www.bioinformatics.deib.polimi.it/geco/publications/Bioinformatics_2018.pdf

• (1/28, 12:30pm): Marco Brambilla (Politecnico di Milano): Extraction of Evolving Knowledge from Social Media details)

TITLE
Extraction of Evolving Knowledge from Social Media

ABSTRACT
Knowledge in the world continuously evolves. Ontologies that aim at formalizing this knowledge are largely incomplete, especially regarding data belonging to the so-called long tail. On the other side, informal sources such has social media are typically very up to date with respect to facts, events and relations between real-world entities. We propose a method for discovering emerging knowledge by extracting it from social content. Once initialized by domain experts, the method is capable of finding relevant entities by means of a mixed syntactic-semantic method. The method uses seeds, i.e. prototypes of emerging entities provided by experts, for generating candidates; then, it associates candidates to feature vectors built by using terms occurring in their social content and ranks the candidates by using their distance from the centroid of seeds, returning the top candidates. Our method can run iteratively, using the results as new seeds. The talk will describe the different extraction techniques, the advantages obtained by combining them, and the results of the experiments performed with the different methods.

RELATED PAPERS
Extracting Emerging Knowledge from Social Media. WWW 2017 https://dl.acm.org/citation.cfm?id=3052697
Iterative Knowledge Extraction from Social Networks. WWW Comp. 2018 https://dl.acm.org/citation.cfm?id=3191578

BIO
Marco Brambilla is associate professor at Politecnico di Milano. His research interests include data science, domain specific modeling languages and design patterns, crowdsourcing, social media monitoring, and big data analysis. He has been visiting researcher at CISCO, San Josè, and University of California, San Diego. He has been visiting professor at Dauphine University, Paris. He is co-founder of the startups Fluxedo, focusing on social media analysis and Social engagement, and WebRatio, devoted to software modeling tools for Web, Mobile and Business Process based software applications. He is author of various international books and research articles in journals and conferences, with over 200 papers. He was awarded various best paper prizes and gave keynotes and speeches at many conferences and organisations. He runs research projects on data science and industrial projects on data-driven innovation and big data. He is the main author of the OMG standard IFML.

• (1/30, 12:00pm): Evimaria Terzi (BU): Active Matrix Completion Problems details)

TITLE
Active Matrix Completion Problems

ABSTRACT
In many applications, e.g., recommender systems and traffic monitoring, the data comes in the form of a matrix that is only partially observed. A fundamental data-analysis task for these datasets is matrix completion, where the goal is to accurately infer the entries missing from the matrix. In this talk we will consider matrices that satisfy one of the two — very common — assumptions: low rank and positive definite. In both cases we will consider the active version of the matrix-completion problem that says: given access to an oracle that can obtain some (small) number of entries of the original matrix which entries shall we query so that we achieve the least completion error. We will demonstrate how this question can be transformed into elegant combinatorial problems and we will discuss algorithms for solving them. We will also argue that our combinatorial formulations are very different from existing work in the matrix-completion literature.

TALK BASED ON TWO PAPERS:
Charalampos Mavroforakis, Dóra Erdös, Mark Crovella, Evimaria Terzi: Active Positive-Definite Matrix Completion. SDM 2017.
Natali Ruchansky, Mark Crovella, Evimaria Terzi: Matrix Completion with Queries. KDD 2015.

BIO
Evimaria Terzi is an associate professor at the Computer Science Department at Boston University, where she also serves as an Associate Chair. Before joining BU in 2009, she was a research scientist at IBM Almaden Research Center. Evimaria has received her Ph.D. from University of Helsinki, Finland and her MSc from Purdue University. Evimaria is a recipient of the Microsoft Faculty Fellowship (2010) and NSF CAREER award and multiple NSF awards. Her research interests span a wide range of data-mining topics including algorithmic problems arising in recommendation systems, online social networks and social media.

• (2/1, 12:00pm): Huy Ngyuen (Northeastern): Submodular Maximization with Nearly-optimal Approximation and Adaptivity details)

TITLE
Submodular Maximization with Nearly-optimal Approximation and Adaptivity

ABSTRACT
In this talk, we present recent progress on understanding the tradeoff between the approximation guarantee and adaptivity for submodular maximization. The adaptivity of an algorithm is the number of sequential rounds of queries it makes to the evaluation oracle of the function, where in every round the algorithm is allowed to make polynomially-many parallel queries. Adaptivity is an important consideration in settings where the objective function is estimated using samples and in applications where adaptivity is the main running time bottleneck. We present nearly-optimal algorithms for submodular maximization subject to a variety of constraints.

RELATED PAPERS
https://arxiv.org/abs/1804.05379
https://arxiv.org/abs/1808.09987
https://arxiv.org/abs/1812.01591

BIO
Huy Lê Nguyen is an Assistant Professor of Computer Science in the College of Computer and Information Science at Northeastern University. Prior to joining Northeastern, he was a Research Assistant Professor at the Toyota Technological Institute in Chicago and before that, a Google Research Fellow at the Simons Institute at University of California, Berkeley. He received his PhD in Computer Science from Princeton University. Professor Nguyen is broadly interested in the design and analysis of algorithms, with an emphasis on algorithmic techniques for massive data sets and machine learning.

• (2/13, 12:00pm): Sarah Ostadabbas (Northeastern): Physics-based Simulation to Bootstrap Learning in Small Data Domains details)

TITLE
Physics-based Simulation to Bootstrap Learning in Small Data Domains

ABSTRACT
Deep learning (DL) regularly obliterate records for regression and classification tasks that have previously seen only incremental accuracy improvements. Furthermore, the training process is much more automated than classical techniques, no longer requiring a huge investment in feature selection and dataset pruning. However, this performance comes at a large data cost, frequently requiring upwards of 10^9 data/label pairs. Our digital economy has provided many problems for which such data exists or can be obtained cheaply relative to the benefits. There are many other fields that would get significant benefit from DL, but where data collection or labelling is expensive. This is very common for medical and military applications where data collection and/or labeling is expensive, individualized, and protected by very strong privacy or classification laws. Many applications will benefit from a learning framework with deep structure that works with limited labeled training samples, integrates domain-knowledge into the model, and maximizes the generalization of learning across domains. In this talk, I introduce our proposed 3-step framework that enables training of accurate and robust learning models under data limitation constraints based the use of simulation as its generative models. This framework includes: (i) employment of physics-based computational models referred to as simulation; (ii) design and analysis of unsupervised domain adaption techniques to close the gap between the simulated and real-world data distributions through a low-dimensional subspace transformation; (iii) development of learning techniques in the projected subspace to train an initial weak labeler; (iv) combined use of the weak labeler and a generative-adversarial framework to refine the simulated datasets by employing on a set of unlabeled real-world dataset in order to train a strong labeler; and (v) development and analysis of active learning techniques to select the most informative datasets to refine and adapt the strong labeler into a novel case with small data in the target application.

RELATED PAPERS
Inner Space Preserving Generative Pose Machine
https://arxiv.org/abs/1808.02104
Background Subtraction via Fast Robust Matrix Completion
https://arxiv.org/abs/1711.01218

BIO
Professor Ostadabbas is an assistant professor in the Electrical and Computer Engineering Department of Northeastern University (NEU), Boston, Massachusetts, USA. Professor Ostadabbas joined NEU in 2016 from Georgia Tech, where she was a post-doctoral researcher following completion of her PhD at the University of Texas at Dallas in 2014. At NEU, Professor Ostadabbas is the director of the Augmented Cognition Laboratory (ACLab) with the goal of enhancing human information-processing capabilities through the design of adaptive interfaces via physical, physiological, and cognitive state estimation. These interfaces are based on rigorous models adaptively parameterized using machine learning and computer vision algorithms. In particular, she has been integrating domain knowledge with machine learning by using physics-based simulation as generative models for bootstrapping deep learning recognizers. Professor Ostadabbas is the co-author of more than 50 peer-reviewed journal and conference articles and her research has been awarded by the National Science Foundation (NSF), Mathworks, Amazon AWS, and NVIDIA. ​She is the co-organizer of the Multimodal Data Fusion (MMDF2018) workshop, an NSF PI mini-workshop on Deep Learning in Small Data, and will be the program chair of the upcoming Machine Learning in Signal Processing (MLSP2019) conference. Prof. Ostadabbas is an associate editor of the IEEE Transactions on Biomedical Circuits and Systems, on the Editorial Board of the IEEE Sensors Letters and Digital Biomarkers Journal, and has been serving in several signal processing and machine learning conferences as a technical chair or session chair.

• (2/15, 12:00pm): Olga Papaemmanouil (Brandeis University): Deep Learning meets Query Optimization details)

TITLE
Deep Learning meets Query Optimization

ABSTRACT
Query optimization remains one of the most important and well studied problems in database systems. However, traditional query optimizers are complex, heuristically-driven systems that do not to learn from past experiences: they plan the execution of a query, but are ignorant of the actual performance of the picked plan. Because of the lack of feedback, a query optimizer may select the same bad query plan repeatedly, never learning from its previous good or bad choices.

In this talk, I will argue that a new type of query optimizer, one that integrates deep learning with query optimization, can drastically improve on the state-of-the-art. Towards this direction, I will discuss ReJOIN, a proof-of-concept join enumerator that relies on deep reinforcement learning. ReJOIN leverages prior experience, and learns how to optimize future queries more effectively (i.e., discovers better query plans) and efficiently (i.e., spending less time on optimization) compared with traditional optimizers. I will discuss potential challenges for future research and describe deep learning approaches that can lead the way to end-to-end learning-based query optimizers.

RELATED PAPERS
http://www.cs.brandeis.edu/%7Eolga/publications/ReJOIN_aiDM18.pdf
https://arxiv.org/pdf/1809.10212

BIO
Olga Papaemmanouil is an Associate Professor in the Department of Computer Science at Brandeis University. She received her undergraduate degree in Computer Science and Informatics at the University of Patras, Greece in 1999. In 2001, she received her Sc.M. in Information Systems at the University of Economics and Business, Athens, Greece. She then joined the Computer Science Department at Brown University, where she completed her Ph.D in Computer Science at Brown University in 2008. Her research interests are in databases and distributed data management. She is the recipient of an NSF Career Award (2013) and a Paris Kanellakis Fellowship from Brown University (2002).

• (3/5, 3:30pm): Laura Di Rocco (Northeastern): The role of geographical knowledge in subcity level geolocation algorithms details)

ABSTRACT
Geolocation of microblog messages has been largely investigated in the literature. Many solutions exist, most of whichare data-driven (i.e., they rely on a training phase) and achieve good results at city-level. The development of algorithmsfor geolocation at sub-city level is still an open problem. In this paper, we investigate the role that external geographic knowledge can play in achieving reasonably accuratesub-city level geolocation. We propose a knowledge-base method, called Sherloc, to accurately geolocate messages at sub-city level, by exploiting the presence of toponyms in the message. Sherloc extracts the semantics of toponyms from geographic gazetteers and embeds them into a metric space that captures the semantic distance among them. This way we can identify the semantically closest toponyms to a microblog message and, at the same time, cluster them withrespect to their spatial locations. In contrast to the state of the art methods, Sherloc requires no prior training, it is not limited to geolocating on a fixed spatial grid, and is able to infer the location at sub-city level with higher accuracy.

• (3/29, 11:00am): Yongjoo Park (University of Michigan): Bringing Statistical Tradeoffs to Data Systems details)

TITLE
Bringing Statistical Tradeoffs to Data Systems

ABSTRACT
Despite advances in computing power, the cost of large-scale analytics and machine learning remains daunting to small and large enterprises alike. This has created a pressing demand for reducing infrastructure costs and query latencies. To meet these goals, data analysts and applications are in many cases willing to tolerate a slight—but controlled—degradation of accuracy in exchange for substantial gains in cost and performance, which we refer to as statistical tradeoffs. This is particularly true in the early stages of data exploration and is in stark contrast to traditional tradeoffs where the infrastructure costs must increase for higher performance.

BIO
My research builds large-scale data systems that can make these statistical tradeoffs in a principled manner. In this talk, I will focus on two specific directions. First, I will present VerdictDB, a system that enables quality-guaranteed, statistical tradeoffs without any changes to backend infrastructure; thus, it offers a universal solution for off-the-shelf query engines. Second, I will introduce Database Learning, a new query execution paradigm that allows existing query engines to constantly learn from their past executions and become “smarter” over time without any user intervention. I will conclude by briefly discussing other promising directions with emerging workloads beyond SQL, including visualization and machine learning.

• (4/2, 10:00am): WVH 155/168: Volker Markl & members of his research group details)

Four students from Prof Volker Markl's group in TU Berlin will present their work within 15min each (after general introductions by Renee and Volker).

Bio: Volker Markl is a Full Professor and Chair of the Database Systems and Information Management (DIMA) Group at the Technische Universität Berlin (TU Berlin). At the German Research Center for Artificial Intelligence (DFKI), he is both a Chief Scientist and Head of the Intelligent Analytics for Massive Data Research Group. In addition, he is Director of the Berlin Big Data Center (BBDC) and Co- Director of the Berlin Machine Learning Center (BzMl). Earlier in his career, he was a Research Staff Member and Project Leader at the IBM Almaden Research Center in San Jose, California, USA and a Research Group Leader at FORWISS, the Bavarian Research Center for Knowledge-based Systems located in Munich, Germany. Dr. Markl has published numerous research papers on indexing, query optimization, lightweight information integration, and scalable data processing. He holds 18 patents, has transferred technology into several commercial products, and advises several companies and startups. He has been both the Speaker and Principal Investigator for the Stratosphere Project, which resulted in a Humboldt Innovation Award as well as Apache Flink, the open-source big data analytics system. He serves as the President-Elect of the VLDB Endowment and was elected as one of Germany's leading Digital Minds (Digitale Köpfe) by the German Informatics (GI) Society. Most recently, Volker and his team earned an ACM SIGMOD Research Highlight Award 2016 for their work on “Implicit Parallelism Through Deep Language Embedding.” Volker Markl and his team earned an ACM SIGMOD Research Highlight Award 2016 for their work on implicit parallelism through deep language embedding.

LINKS: Prof Markl's group website: http://www.dima.tu-berlin.de

Speaker: Sebastian Breß

Title: Efficient Data Processing on Modern Hardware

Abstract: Processor manufacturers build increasingly specialized processors to mitigate the effects of the power wall in order to deliver improved performance. Currently, database engines have to be manually optimized for each processor which is a costly and error prone process. In this talk, we provide an overview of our research on automatic performance tuning in Hawk, a hardware-tailored code generator. Our key idea is to create processor-specific code variants and to learn a well- performing code variant for each processor. These code variants leverage various parallelization strategies and apply both generic and processor-specific code transformations. We observe that performance of different code variants may diverge up to two orders of magnitude. Thus, we need to generate custom code for each processor for peak performance. To this end, Hawk automatically finds efficient code variants for CPUs, GPUs, and MICs.

Bio: Sebastian Breß received his PhD (Dr.-Ing.) from University of Magdeburg, Germany in 2015, under the supervision of Gunter Saake (University of Magdeburg) and Jens Teubner (TU Dortmund). He is the the initiator and system architect of the research database system CoGaDB and the Hawk Code Generator. Currently, Sebastian is a Senior Researcher at German Research Center for Artificial Intelligence (DFKI) and a PostDoc at Technische Universität Berlin, working with Prof. Dr. Volker Markl and Prof. Dr. Tilmann Rabl. Sebastian‘s research interests include data management on modern hardware, query compilation, stream processing, and optimizing data management systems for heterogeneous processors. Sebastian has been selected as a Distinguished Program Committee Member at SIGMOD 2018.

Speaker: Jonas Traub

Title: On-Demand Data Stream Gathering in the Internet of Things

Abstract: In the Internet of Things (IoT), billions of sensor nodes form a sensor cloud and offer data streams to analysis systems. However, it is impossible to transfer all available data with maximal frequencies to all applications. In this talk, we show how we gather streaming data efficiently from a huge number of sensor nodes and how we make it available to a huge number of applications. Our key-idea is to gather data streams based on the data demand of streaming queries. The data demand of a query is the minimum number of data points which allows for answering the query with the desired precision. We present a solution which shares sensor nodes among many concurrent streaming queries by multiplexing their data demands. Our technique shares sensor reads and corresponding network traffic among all queries to save costs. Our experiments with real-world sensor data show that our approach saves up to 87% in data transmissions.

Bio: Jonas is a Research Associate at Technische Universität Berlin and the German Research Center for Artificial Intelligence (DFKI). His research interests include data stream processing, sensor data analysis, and data acquisition from sensor nodes. Jonas authored several publications related to data stream gathering, processing and transmission in the Internet of Things and will complete his PhD in March 2019 under the supervision of Volker Markl. Before he started his PhD, Jonas wrote his master thesis at the Royal Institute of Technology (KTH) and the Swedish Institute of Computer Science (SICS) / RISE in Stockholm under supervision of Seif Haridi and Volker Markl and advised by Paris Carbone and Asterios Katsifodimos. Prior to that, he received his B.Sc. degree at Baden-Württemberg Cooperative State University (DHBW Stuttgart) and worked several years at IBM in Germany and the USA. Jonas is an alumnus of "Software Campus", "Studienstiftung des deutschen Volkes" and "Deutschlandstipendium". All publication and supplementary material are available on http://www.user.tu-berlin.de/powibol/.

Speaker: Andreas Kunft

Title: Efficient Matrix Partitioning Through Joins

Abstract: End-to-end machine learning pipelines often consist of (i) relational operators to join the input data, (ii) user defined functions for feature extraction and vectorization, and (iii) linear algebra operators for model training and cross-validation. Often, these pipelines need to scale out to large datasets. In this case, these pipelines are usually implemented on top of dataflow engines like Hadoop, Spark, or Flink. Dataflow engines implement relational operators on row-partitioned datasets. However, efficient linear algebra operators use block-partitioned matrices. As a result, pipelines combining both kinds of operators require rather expensive changes to the physical representation, in particular re-partitioning steps. In this talk, I investigate the potential of reducing shuffling costs by fusing relational and linear algebra operations into specialized physical operators. I present BlockJoin, a distributed join algorithm which directly produces block-partitioned results. To minimize shuffling costs, BlockJoin applies database techniques known from columnar processing, such as index-joins and late materialization, in the context of parallel dataflow engines.

Bio: Andreas Kunft is a PhD student/research associate at Technische Universität Berlin in the Database Systems and Information Management Group (DIMA) led by Volker Markl. His research interests include massive parallel processing frameworks, databases, and compilers, with focus on the integration of database and compiler techniques for holistic optimization of analytics pipelines.

Speaker: Martin Kiefer

Title: Approximate Data Analysis using Modern Hardware

Abstract: Data summaries are an effective tool to balance efficiency and accuracy in data analysis tasks: Most relational query optimizers rely heavily on selectivity estimates computed from histograms or samples. Sketches have become an increasingly popular technique to perform high-bandwidth stream analysis tasks such as change or heavy-hitter detection. Using modern hardware such as GPUs and FPGAs reduces the cost of approximate

• (4/5, 12:00pm): Tingjian Ge (UMass Lowell): Dense regions: a different type of event in graph streams details)

TITLE
Dense regions: a different type of event in graph streams

ABSTRACT
A graph stream, as a special data stream, consists of a sequence of edge insertions and deletions over a dynamic graph. Analogous to general data streams, complex events have been studied on graph streams which incorporate both time order and structural relationship over basic edge events. In this talk, I will propose a new type of complex event – the occurrence of a dense region that also lasts for a long time. This turns out to be very useful in applications of various domains including telecommunication hotspot detection, road traffic control, spam network filtering, and dynamic community detection.

To efficiently discover dense regions, we propose an algorithmic framework called the Expectation-Maximization with a Utility function (EMU), a novel stochastic approach that nontrivially extends the conventional EM. We validate our EMU approach by showing that it converges to the optimum—by proving that it is a specification of the general Minorization-Maximization (MM) framework with convergence guarantees. We then devise EMU algorithms for the densest lasting subgraph problem. Using real-world graph streams, we experimentally verify the effectiveness and efficiency of our techniques.

RELATED PAPER
http://www.cs.uml.edu/~ge/pdf/dense_subgraph_ICDE_19.pdf

BIO
Tingjian Ge is an associate professor in Computer Science at the University of Massachusetts, Lowell. He received a Ph.D. from Brown University in 2009. Prior to that, he got his Bachelor’s and Master’s degrees in Computer Science from Tsinghua University and UC Davis, respectively, and worked at Informix and IBM for six years. His research areas are in data management and analytics, with a recent focus on applying machine learning, AI, and algorithmic techniques in data management and mining. He is a recipient of the NSF CAREER Award in 2012, and a Teaching Excellence Award at UMass Lowell in 2014.

### Fall 2018 (Wolfgang)

• (9/13, 10:40am): WVH 366: Anshumali Shrivastava (Rice University): Hashing Algorithms for Extreme Scale Machine Learning details)

TITLE
Hashing Algorithms for Extreme Scale Machine Learning.

ABSTRACT
In this talk, I will discuss some of my recent and surprising findings on the use of hashing algorithms for large-scale estimations. Locality Sensitive Hashing (LSH) is a hugely popular algorithm for sub-linear near neighbor search. However, it turns out that fundamentally LSH is a constant time (amortized) adaptive sampler from which efficient near-neighbor search is one of the many possibilities. Our observation adds another feather in the cap for LSH. LSH offers a unique capability to do smart sampling and statistical estimations at the cost of few hash lookups. Our observation bridges data structures (probabilistic hash tables) with efficient unbiased statistical estimations. I will demonstrate how this dynamic and efficient sampling beak the computational barriers in adaptive estimations where, for the first time, it is possible that we pay roughly the cost of uniform sampling but get the benefits of adaptive sampling. We will demonstrate the power of one simple idea for three favorite problems 1) Partition function estimation for large NLP models such as word2vec, 2) Adaptive Gradient Estimations for efficient SGD and 3) Sub-Linear Deep Learning with Huge Parameter Space.

In the end, if time permits, we will switch to memory cost show a simple hashing algorithm that can shrink memory requirements associated with classification problems exponentially! Using our algorithms, we can train 100,000 classes with 400,000 features, on a single Titan X while only needing 5% or less memory required to store all the weights. Running a simple logistic regression on this data, the model size of 320GB is unavoidable.

BIO
Anshumali Shrivastava is an assistant professor in the computer science department at Rice University. His broad research interests include randomized algorithms for large-scale machine learning. He is a recipient of National Science Foundation (NSF) CAREER Award, a Young Investigator Award from Air Force Office of Scientific Research (AFOSR), and machine learning research award from Amazon. His research on hashing inner products has won Best Paper Award at NIPS 2014 while his work on representing graphs got the Best Paper Award at IEEE/ACM ASONAM 2014. Anshumali got his PhD in 2015 from Cornell University.

• (9/26, 12:00pm): Spyros Blanas (Ohio State University): Scaling database systems to high-performance computers details)

TITLE
Scaling database systems to high-performance computers

ABSTRACT
We are witnessing the increasing use of warehouse-scale computers to analyze massive datasets quickly. This poses two challenges for database systems. The first challenge is interoperability with established analytics libraries and tools. Massive datasets often consist of images (arrays) in file formats like FITS and HDF5. To analyze such datasets users turn to domain-specific libraries and deep learning frameworks, and thus write code that directly manipulates files. We will first present ArrayBridge, an open-source I/O library that allows SciDB, TensorFlow and HDF5-based programs to co-exist in a pipeline without converting between file formats. With ArrayBridge, users benefit from the optimizations of a database system without sacrificing the ability to directly manipulate data through the existing HDF5 API when they want to.

The second challenge is scalability, as warehouse-scale computers expose communication bottlenecks in foundational data processing operations. This talk will focus on data shuffling and parallel aggregation. We will first present an RDMA-aware data shuffling algorithm that transmits data up to 4X faster than MPI. This is achieved by switching to a connectionless, datagram-based network transport layer that scales better but requires flow control in software. We will then present a parallel aggregation algorithm for high-cardinality aggregation that carefully schedules data transmissions to avoid unscaleable all-to-all communication. The algorithm leverages similarity to transmit less data over congested network links. We will conclude by highlighting additional challenges that need to be overcome to scale database systems to massive computers.

BIO
Spyros Blanas is an assistant professor in the Department of Computer Science and Engineering at The Ohio State University. His research interest is high performance database systems, and his current goal is to build a database system for high-end computing facilities. He has received the IEEE TCDE Rising Star award and a Google Research Faculty award. He completed his Ph.D. at the University of Wisconsin–Madison where part of his Ph.D. dissertation was commercialized in Microsoft SQL Server as the Hekaton in-memory transaction processing engine.

EUROSYS 2017:

ICDE 2018:

• (9/28, 12:00pm): Stratos Idreos (Harvard): The periodic table of data structures details)

TITLE
The periodic table of data structures.

ABSTRACT
Data structures are critical in any data-driven scenario, and they define the behavior of modern data systems and data-driven algorithms. However, they are notoriously hard to design due to a massive design space and the dependence of performance on workload and hardware which evolve continuously.

What if we knew how many and which data structures are possible to design? What if we could compute the expected performance of a data structure design on a given workload and hardware without having to implement it and without even having access to the target machine? We will discuss our quest for 1) the first principles of data structures, 2) design continuums that make it possible to automate design, and 3) self-designing systems that can morph between what we now consider fundamentally different structures. We will draw examples from the NoSQL key-value store design space and discuss how to accelerate them and balance space-time tradeoffs.

BIO
Stratos Idreos is an assistant professor of Computer Science at Harvard University where he leads DASlab, the Data Systems Laboratory. Stratos was awarded the 2011 ACM SIGMOD Jim Gray Doctoral Dissertation award and the 2011 ERCIM Cor Baayen award. He is also a recipient of an IBM zEnterpise System Recognition Award, a VLDB Challenges and Visions best paper award and an NSF Career award. In 2015 he was awarded the IEEE TCDE Rising Star Award from the IEEE Technical Committee on Data Engineering for his work on adaptive data systems.

https://stratos.seas.harvard.edu/files/stratos/files/datacalculator.pdf
https://stratos.seas.harvard.edu/files/stratos/files/dostoyevski.pdf
https://stratos.seas.harvard.edu/files/stratos/files/monkeykeyvaluestore.pdf

Prior recorded talk at UW:
http://db.cs.washington.edu/nwds/nwds.html

Further pointer:
https://www.zdnet.com/article/zen-and-the-art-of-data-structures-from-self-tuning-to-self-designing-data-systems/

• (10/4, 4:30pm): Angelika Kimmig (Cardiff University): A Collective, Probabilistic Approach to Schema Mapping details)

TITLE
A Collective, Probabilistic Approach to Schema Mapping

ABSTRACT
We propose a probabilistic approach to the problem of schema mapping. Our approach is declarative, scalable, and extensible. It builds upon recent results in both schema mapping and probabilistic reasoning and contributes novel techniques in both ﬁelds. We introduce the problem of mapping selection, that is, choosing the best mapping from a space of potential mappings, given both metadata constraints and a data example. As selection has to reason holistically about the inputs and the dependencies between the chosen mappings, we deﬁne a new schema mapping optimization problem which captures interactions between mappings. We then introduce Collective Mapping Discovery (CMD), our solution to this problem using state-of-the-art probabilistic reasoning techniques, which allows for inconsistencies and incompleteness. Using hundreds of realistic integration scenarios, we demonstrate that the accuracy of CMD is more than 33% above that of metadata-only approaches already for small data examples, and that CMD routinely ﬁnds perfect mappings even if a quarter of the data is inconsistent.

BIO
Angelika Kimmig is a Lecturer at Cardiff University, UK. She obtained her Ph.D. from KU Leuven, Belgium, and was a postdoctoral fellow at KU Leuven and the University of Maryland, College Park. Her research interests include symbolic AI, reasoning under uncertainty, machine learning, logic programming, and especially combinations thereof such as probabilistic programming and statistical relational learning. She is a key contributor to the probabilistic logic programming language ProbLog.

PAPER from ICDE 2017:

• (10/26, 12:00pm): Mania Abdi (Northeastern): D3N: A multi-layer cache for improving big-data applications’ performance in data centers with imbalanced networks details)

TITLE
D3N: A multi-layer cache for improving big-data applications’ performance in data centers with imbalanced networks

ABSTRACT
Caching methods for improving the performance of datalakes for throughput-bound big-data jobs assume unlimited bandwidth across the datacenter. However, most enterprise and academic datacenters grow organically and have heavily oversubscribed networks between different computer clusters. This paper describes D3N, an architecture that caches data on the access side of potential network bottlenecks and uses a multilayer approach. In the case of a two layer cache, reuse distances are dynamically computed along with cache miss costs to determine the partitioning of the cache into L1 (used for local accesses) and L2 (used for remote access, but within same access side) layers. We have implemented a prototype of D3N by modifying Ceph’s RADOS gateway. Micro and macro evaluations of the prototype demonstrate that the implementation is performant enough to saturate the (40,Gbit/s) NICs and (5 GB/s read) SSD of our caching server and deliver substantial storage bandwidth improvements for real workloads. Numerical models show that a multi-level, dynamic cache can have substantial advantages over today’s single-level caches when bandwidth is constrained.

BIO
Mania Abdi is a PhD student at the Northeastern University Solid State Storage research group. Prior to that, she was a software engineer. She received her MSc. in Computer Engineering at the Sharif University of Technology and B.Eng. in Computer Engineering at the Amirkabir University of Technology. Mania is a computer systems researcher with a storage focus and has worked on a broad set of topics, including distribted storage, caching, data center debuging, and end-to-end tracing.

• (11/7, 12:00pm): Laurel Orr (University of Washington): Probabilistic Database Summarization for Interactive Data Exploration details)

TITLE:
Probabilistic Database Summarization for Interactive Data Exploration

ABSTRACT
A fundamental assumption of traditional DBMSs is that the database contains all information necessary to answer a query; i.e., the database contains the entire universe of data. Many data scientists, however, do not have access to the universe of data and instead rely on samples to answer queries. These data scientists need to use tools outside of the database or alter their queries to correctly reweight and debias their samples to get a more accurate query answer. This talk presents preliminary research into building the first open world database system (OWDB) that inherently assumes relations are samples from some universe, even if the sampling mechanism is unknown. We will discuss two different approaches for building an OWDB: probabilistic modeling of the universe and sample reweighting. We then present EntropyDB, a system that takes the former approach to summarize a database, and we lastly discuss ongoing research into the later approach using Bayesian Networks. We conclude with discussing the open challenges in developing an OWDB.

BIO
Laurel Orr is a 6th year PhD student in the Database Group at the Paul G Allen School for Computer Science and Engineering. Her research interests are data summarization, approximate query processing, data exploration, and the general development of tools and techniques to aid data scientists in their data investigation and analysis pipeline. Her current research goal is to develop a prototype open world database system that inherently treats relations a samples drawn from some larger universe of data, even when the sampling mechanism is unknown. She was a 2018 NCWIT Collegiate Award Honorable Mention and was awarded a NSF Graduate Research Fellowship in 2015.

RELATED PAPER:
http://db.cs.washington.edu/projects/entropydb/ljorr-vldb2017.pdf

PROJECT PAGE:
http://db.cs.washington.edu/projects/entropydb/

• (11/9, 12:00pm): Raul Castor Fernandez (MIT): Aurum: A Data Discovery System details)

TITLE
Aurum: A Data Discovery System

ABSTRACT
Organizations store data in hundreds of different data sources, including relational databases, files, and large data lake repositories. These data sources contain valuable information and insights that can be beneficial to multiple aspects of modern data-driven organizations. However, as more data is produced, our ability to use it reduces dramatically, as no single person knows about all the existent data sources. One big challenge is to discover the data sources that are relevant to answer a particular question. Aurum is a data discovery system to answer "discovery queries" on large volumes of data. In this talk, I'll motivate the data discovery problem with use cases from different industries. I will describe Aurum's design and I will talk a bit about a new research project that aims to discover data beyond tables.

BIO
In my research, I build high-performance and scalable systems to discover, prepare and process data. I'm a postdoctoral researcher at MIT, working with Sam Madden and Mike Stonebraker. Before, I completed my PhD at Imperial College London with Peter Pietzuch.

RELATED PAPERS
http://raulcastrofernandez.com/papers/icde18-aurum.pdf
http://raulcastrofernandez.com/papers/wip_termite.pdf

• (11/14, 12:00pm): Lawson Wong (Northeastern): Abstraction in robotics details)

TITLE
Abstraction in robotics

ABSTRACT
Robotics is a big data problem. To make sense of the physical world, perform tasks well, and generalize across environments, robots need to represent and understand the world at the "correct" level of abstraction. What "correct" should mean remains to be seen.
In this talk, I will describe two lines of work that attempt to answer this question from very different perspectives. I will first discuss work on grounding natural language instructions to robot behavior, where we demonstrate that having the right representations can enable human-robot communication. This is an important problem for robotics, since we envision users using natural language to instruct robots to perform a wide variety of tasks. In the second half, I will discuss recent preliminary work on the theoretical foundations of state abstraction in reinforcement learning, a common framework used in robot learning problems. In particular, we view state abstraction as data compression, and apply results in information theory (rate-distortion theory) to the reinforcement learning setting.
Time permitting, I will describe some extensions to the above work, as well as other abstraction-related problems, that I envision my group will pursue at Northeastern.

BIO
Lawson L.S. Wong is an assistant professor in the College of Computer and Information Science at Northeastern University. His research focuses on learning, representing, and estimating knowledge about the world that an autonomous robot may find useful. Prior to Northeastern, Lawson was a postdoctoral fellow at Brown University. He completed his PhD at the Massachusetts Institute of Technology. He has received a Siebel Scholarship, AAAI Robotics Student Fellowship, and Croucher Foundation Fellowship for Postdoctoral Research.
https://www.ccis.northeastern.edu/people/lawson-wong/

RELATED PAPERS
Sequence-to-Sequence Language Grounding of Non-Markovian Task Specifications
Nakul Gopalan, Dilip Arumugam, Lawson L.S. Wong, Stefanie Tellex
Robotics: Science and Systems (2018)

Grounding Natural Language Instructions to Semantic Goal Representations for Abstraction and Generalization
Dilip Arumugam, Siddharth Karamcheti, Nakul Gopalan, Edward C. Williams, Mina Rhee, Lawson L.S. Wong, Stefanie Tellex
http://cs.brown.edu/~lwong5/papers/2018-languagegrounding.pdf
Autonomous Robots (in press; 2018)
Extended journal version of Robotics: Science and Systems (2017) paper

State Abstraction as Compression in Apprenticeship Learning
David Abel, Dilip Arumugam, Kavosh Asadi, Yuu Jinnai, Michael L. Littman, Lawson L.S. Wong
Preprint available on request (by e-mail)
To appear in AAAI Conference on Artificial Intelligence (2019)

• (11/16, 12:00pm): Alexandra Meliou (UMass Amherst): Creating a Higher-Quality Data World details)

TITLE
Diagnoses and Explanations: Creating a Higher-Quality Data World

ABSTRACT
The correctness and proper function of data-driven systems and applications relies heavily on the correctness of their data. Low quality data can be costly and disruptive, leading to revenue loss, incorrect conclusions, and misguided policy decisions. Improving data quality is far more than purging datasets of errors; it is critical to improve the processes that produce the data, to collect good data sources for generating the data, and to address the root causes of problems.

Our work is grounded on an important insight: While existing data cleaning techniques can be effective at purging datasets of errors, they disregard the fact that a lot of errors are systemic, inherent to the process that produces the data, and thus will keep occurring unless the problem is corrected at its source. In contrast to traditional data cleaning, we focus on data diagnosis: explaining where and how the errors happen in a data generative process. I will describe our work on Data X-Ray and QFix, two diagnostic frameworks for large-scale extraction systems and relational data systems. I will also provide a brief overview of new results on knowledge augmentation and explanations for dataset differences, building towards a vision for toolsets that assist the exploration of information in a varied, diverse, and highly non-integrated data world.

BIO
Alexandra Meliou is an Assistant Professor in the College of Information and Computer Sciences, at the University of Massachusetts, Amherst. Prior to that, she was a Post-Doctoral Research Associate at the University of Washington. Alexandra received her PhD degree from the Electrical Engineering and Computer Sciences Department at the University of California, Berkeley. She has received recognitions for research and teaching, including a CACM Research Highlight, an ACM SIGMOD Research Highlight Award, an ACM SIGSOFT Distinguished Paper Award, an NSF CAREER Award, a Google Faculty Research Award, and a Lilly Fellowship for Teaching Excellence. Her research focuses on data provenance, causality, explanations, data quality, and algorithmic fairness.
https://people.cs.umass.edu/~ameli/

RELATED PAPERS
http://sites.computer.org/debull/A18june/p47.pdf
http://people.cs.umass.edu/ameli/projects/queryProvenance/papers/WangMW2017.pdf
https://people.cs.umass.edu/~ameli/projects/packageBuilder/papers/scalable-paql.pdf

• (12/7, 11:30am): Fei Chiang (McMaster University): Contextual and Spatio-temporal Data Cleaning details)

TITLE
Contextual and Spatio-temporal Data Cleaning

ABSTRACT
It is becoming increasingly difficult for organizations to reap value from their data due to poor data quality. This is motivated by the observation that real data is rarely error free, containing incomplete, inconsistent, and stale values. This leads to inaccurate, and out-of-date data analysis downstream. Addressing data inconsistency requires not only reconciling differing syntactic references to an entity, but it is often necessary to include domain expertise to correctly interpret the data. For example, understanding that a reference to ‘jaguar’ may be interpreted as an animal or as a vehicle. Secondly, having up-to-date (or current) data is important for timely data analysis. Cleaning stale values goes beyond just relying on timestamps, especially when timestamps may be missing, inaccurate or incomplete.

In this talk, I will present our work towards achieving consistent and up-to-date data. First, I will discuss contextual data cleaning that uses a new class of data integrity constraints that tightly integrate domain semantics from an ontology. Second, we argue that data currency is a relative notion based on individual spatio-temporal update patterns, and these patterns can be learned and predicted. I will present our framework to achieve these two objectives, and provide a brief overview of recent extensions with applications to knowledge fusion.

BIO
Fei Chiang is an Assistant Professor in the Department of Computing and Software at McMaster University. She is a Faculty Fellow at the IBM Centre for Advanced Studies, and served as an inaugural Associate Director of the McMaster MacData Institute. She received her M. Math from the University of Waterloo, and B.Sc and PhD degrees from the University of Toronto, all in Computer Science. Her research interests are in data quality, data cleaning, data privacy and text mining. She holds four patents for her work in self-managing database systems. Her work has been featured in the Southern Ontario Smart Computing Impact Report. She is a recipient of the Dean’s Teaching Honour Roll, and a 2018 Ontario Early Researcher Award.

• (12/12, 12:00pm): Erkang Zhu (University of Toronto): Get Your Data Together! Algorithms for Managing Data Lakes details)

TITLE
Get Your Data Together! Algorithms for Managing Data Lakes

ABSTRACT
Data lakes (e.g., enterprise data catalogs and Open Data portals) are data dumps if users cannot find and utilize the data in them. In this talk, I present two problems in massive, dynamic data lakes: 1) searching for joinable tables without a precomputed join graph, and 2) joining tables from different sources through auto-generated syntactic transformation on join values. I will also present two algorithmic solutions that can be used for data lakes that are large both in the number of tables (millions) and table sizes. The presented work has been published in SIGMOD and VLDB.

BIO
Erkang (Eric) Zhu is a 5th year computer science PhD candidate at University of Toronto. His supervisor is Prof. Renée J. Miller. His research focuses on data discovery, large-scale similarity search, and randomized algorithms (data sketches).

RELATED PAPERS
Erkang Zhu, Fatemeh Nargesian, Ken Q. Pu, Renée J. Miller: LSH Ensemble: Internet-Scale Domain Search. PVLDB 9(12): 1185-1196 (2016)
Erkang Zhu, Ken Q. Pu, Fatemeh Nargesian, Renée J. Miller: Interactive Navigation of Open Data Linkages. PVLDB 10(12): 1837-1840 (2017)

### Fall 2017 (Wolfgang)

This semester the data lab seminar is combined with in our special topics class (CS 7290: Special topics: Foundations in scalable data management) and takes place every Tuesday 11:45am-1:25pm and Thursday 2:50-4:30pm in Ryder Hall 126 (map).