Fall 2021 – DATA Lab @ Northeastern

(9/2, 12:00pm): Robin Walters (Northeastern University): Equivariant Neural Networks for Learning Spatiotemporal Dynamics details)

TITLE
Equivariant Neural Networks for Learning Spatiotemporal Dynamics

ABSTRACT
Applications such as climate science and transportation require learning complex dynamics from large-scale spatiotemporal data. Existing machine learning frameworks are still insufficient to learn spatiotemporal dynamics as they often fail to exploit the underlying physics principles. Representation theory can be used to describe and exploit the symmetry of the dynamical system. We will show how to design neural networks that are equivariant to various symmetries for learning spatiotemporal dynamics. Our methods demonstrate significant improvement in prediction accuracy, generalization, and sample efficiency in forecasting turbulent flows and predicting real-world trajectories. This is joint work with Rose Yu, Rui Wang, and Jinxi Li.

RELATED PAPERS
* Walters, R.*, Wang, R.*, Yu, R. (2021). Incorporating Symmetry into Deep Dynamics Models for Improved Generalization. International Conference on Learning Representations (ICLR). https://arxiv.org/abs/2002.03061
* Walters, R.*, Li, J.*, Yu R. (2021). Trajectory Prediction using Equivariant Continuous Convolution. International Conference on Learning Representations (ICLR). https://arxiv.org/abs/2010.11344

BIO
Robin Walters is a postdoctoral research fellow in the Khoury College of Computer Sciences. He joined Khoury in July 2020 through the Experiential AI program. Formerly, Robin was a Zelevinsky Research Instructor in the Mathematics Department at Northeastern. His research studies the connections between representation theory and differential equations both theoretically and practically using equivariant neural networks.

(9/9, 12:00pm): Wolfgang Gatterbauer (Northeastern University): Structure-preserving diagrams for relational queries and first-order logic details)

TITLE
Structure-preserving diagrams for relational queries and first-order logic

ABSTRACT
The relative expressiveness of relational query languages has been studied extensively: the logical expressiveness of particular fragments of relational algebra, relational calculus, Datalog with negation and restricted SQL under set semantics is well known to be equivalent. Yet what does it take to represent ``logical patterns'' and compare languages by their abilities to represent particular structures of reasoning that are common across various syntactic conventions?
We describe a complete diagrammatic representation system that preserves the logical structure for non-disjunctive safe relational calculus. It also solves a problem that has vexed the logical community for over 100 years: finding an unambiguous and complete representation system for first-order logic sentences.
Topics to discuss: 1. Why are disjunctions inherently more difficult to visualize than conjunctions? 2. Why do visual query languages based on relational algebra lack in their abilities to represent certain structural patterns? 3. A discussion of 3 common ``abuses of the line" that have created conceptual difficulties in past related work (with a particular focus on the influential and widely studied Peirce's existential beta graphs). 4. Ideas on how to solve the representation problem for disjunctions.

RELATED WORK
https://queryvis.com/

SHORT BIO
Wolfgang Gatterbauer is an Associate Professor in the Khoury College of Computer Sciences at Northeastern University. A major focus of his research is to extend the capabilities of modern data management systems in generic ways and to allow them to support novel functionalities that seem hard at first.
https://gatterbauer.name

(9/16, 12:00pm): Amir Ilkhechi (Brown University): DeepSqueeze: Deep Semantic Compression for Tabular Data details)

TITLE
DeepSqueeze: Deep Semantic Compression for Tabular Data

ABSTRACT
With the rapid proliferation of large datasets, efficient data compression has become more important than ever. Columnar compression techniques (e.g., dictionary encoding, run-length encoding, delta encoding) have proved highly effective for tabular data, but they typically compress individual columns without considering potential relationships among columns, such as functional dependencies and correlations. Semantic compression techniques, on the other hand, are designed to leverage such relationships to store only a subset of the columns necessary to infer the others, but existing approaches cannot effectively identify complex relationships across more than a few columns at a time.
In this talk, I will describe DeepSqueeze, a novel semantic compression framework that can efficiently capture these complex relationships within tabular data by using autoencoders to map tuples to a lower-dimensional representation. DeepSqueeze also supports guaranteed error bounds for lossy compression of numerical data and works in conjunction with common columnar compression formats. Our experimental evaluation uses real-world datasets to demonstrate that DeepSqueeze can achieve over a 4x size reduction compared to state-of-the-art alternatives.

PAPER and recorded video from SIGMOD 2010:
http://cs.brown.edu/people/acrotty/pubs/3318464.3389734.pdf
https://dl.acm.org/doi/abs/10.1145/3318464.3389734

BIO
Amir Ilkhechi is a Ph.D. student in the Computer Science Department at Brown University, where he is a member of the Database Group advised by Professor Ugur Cetintemel. His research explores applications of deep learning to fundamental data management problems, most recently focusing on novel approaches to compression.

(9/23, 12:00pm): Roee Shraga (Northeastern University): PoWareMatch: a Quality-aware Deep Learning Approach to Improve Human Schema Matching details)

TITLE
PoWareMatch: a Quality-aware Deep Learning Approach to Improve Human Schema Matching

ABSTRACT
Schema matching is a core task of any data integration process. Being investigated in the fields of databases, AI, Semantic Web and data mining for many years, the main challenge remains the ability to generate quality matches among data concepts (e.g., database attributes). In this work, we examine a novel angle on the behavior of humans as matchers, studying match creation as a process. We analyze the dynamics of common evaluation measures (precision, recall, and f-measure), with respect to this angle and highlight the need for unbiased matching to support this analysis. Unbiased matching, a newly defined concept that describes the common assumption that human decisions represent reliable assessments of schemata correspondences, is, however, not an inherent property of human matchers. In what follows, we design PoWareMatch that makes use of a deep learning mechanism to calibrate and filter human matching decisions adhering the quality of a match, which are then combined with algorithmic matching to generate better match results. We provide an empirical evidence, established based on an experiment with more than 200 human matchers over common benchmarks, that PoWareMatch predicts well the benefit of extending the match with an additional correspondence and generates high quality matches. In addition, PoWareMatch outperforms state-of-the-art matching algorithms.

RELATED WORK
https://arxiv.org/abs/2109.07321

BIO
Roee Shraga is a Postdoctoral fellow at the Khoury College of Computer Sciences at Northeastern University. He received his PhD degree in 2020 from the Technion – Israel Institute of Technology in the area of Data Science. Roee has published more than a dozen papers in leading journals and conferences on the topics of data integration, human-in-the-loop, machine learning, process mining, and information retrieval. He is also a recipient of several PhD fellowships including the Leonard and Diane Sherman Interdisciplinary Fellowship (2017), the Daniel Excellence Scholarship (2019), and the Miriam and Aaron Gutwirth Memorial Fellowship (2020).

(9/30, 12:00pm): Can Qin (Northeastern University): Neural Pruning via Growing Regularization details)

TITLE
Neural Pruning via Growing Regularization

ABSTRACT
Regularization has long been utilized to learn sparsity in deep neural network pruning. However, its role is mainly explored in the small penalty strength regime. In this work, we extend its application to a new scenario where the regularization grows large gradually to tackle two central problems of pruning: pruning schedule and weight importance scoring. (1) The former topic is newly brought up in this work, which we find critical to the pruning performance while receives little research attention. Specifically, we propose an L_2 regularization variant with rising penalty factors and show it can bring significant accuracy gains compared with its one-shot counterpart, even when the same weights are removed. (2) The growing penalty scheme also brings us an approach to exploit the Hessian information for more accurate pruning without knowing their specific values, thus not bothered by the common Hessian approximation problems. Empirically, the proposed algorithms are easy to implement and scalable to large datasets and networks in both structured and unstructured pruning. Their effectiveness is demonstrated with modern deep neural networks on the CIFAR and ImageNet datasets, achieving competitive results compared to many state-of-the-art algorithms.

RELATED WORK
ICLR 2021: https://openreview.net/pdf?id=o966_Is_nPA

BIO
Can Qin has received the B.E. degree from the Xidian University (XDU), China, in 2018. Currently, he is a Ph.D. candidate in the Department of Electrical and Computer Engineering, Northeastern University, under the supervision of Prof. Yun Raymond Fu. He has been awarded the Best Paper Award in ICCV Workshop on Real-World Recognition from Low-Quality Images and Videos 2019. He also has some top-tier conference papers accepted at NeurIPS, AAAI, ECCV, ICLR, et al. His research interests broadly include the transfer learning and deep learning.

(10/7, 12:00pm): Senjuti Basu Roy (New Jersey Institute of Technology): Optimization Opportunities in Human-in-the-loop Systems details)

TITLE
Optimization Opportunities in Human-in-the-loop Systems

ABSTRACT
An emerging trend is to leverage an under-explored and richly heterogeneous pool of human knowledge inside machine algorithms, a practice popularly termed as human-in-the-loop (HIL) processes. A wide variety of applications, starting from sentiment analysis to image recognition, query processing to text translation, or even feature engineering stand to benefit from such synergistic man-machine collaboration. This talk will explore optimization opportunities inside such HIL systems, considering the roles and responsibilities of three key stakeholders - humans (workers), machines (algorithms), and platforms (online infrastructure where the work takes place). Optimization inside such HIL systems investigates judicious involvement of workers inside machine algorithms, as well as the desired functionality of the platforms to satisfy a variety of goals pertinent to the aforementioned stakeholders. Following that, the talk will specifically discuss both modeling as well as algorithmic challenges in task design and deployment inside large-scale HIL systems.

RELATED WORK
Multi-Session Diversity to Improve User Satisfaction in Web Applications. WWW 2021.
https://dl.acm.org/doi/10.1145/3442381.3450046
Recommending Deployment Strategies for Collaborative Tasks. SIGMOD 2020.
https://dl.acm.org/doi/10.1145/3318464.3389719
Making AI Machines Work for Humans in FoW. SIGMOD Record 2020.
https://dl.acm.org/doi/10.1145/3442322.3442327

BIO
Senjuti Basu Roy is the Panasonic Chair in Sustainability and an Associate Professor at the Department of Computer Science at the New Jersey Institute of Technology. Her broader research interests lie in the area of large scale data management with the focus on designing principled algorithms for "human-in-the-loop" systems. She is the tutorial co-chair of The Web Conference 2022, has served as the Mentorship co-chair of SIGMOD 2018, PhD workshop co-chair of VLDB 2018, co-chair of SEADATA Workshop 2021 (colocated with VLDB 2021), and HMData Workshops 2017-2021 (colocated with IEEE BigData conference). She is a recipient of the NSF CAREER Award, and one of the 100 invited early career engineers to attend the National Academy of Engineering’s 2021 US Frontiers of Engineering Symposium.

(10/14, 12:00pm): Steven Holtzen (Northeastern University): Exploiting Symmetry for Scaling Discrete Factor Graph Inference details)

TITLE
Exploiting Symmetry for Scaling Discrete Factor Graph Inference

ABSTRACT
A key goal in the design of probabilistic inference algorithms is identifying and exploiting properties of the distribution that make inference tractable. One such property is symmetry, which is characterized by points in the distribution that are guaranteed to have the same probability. In this talk I will describe two inference algorithms for discrete factor graphs that scale in the degree of symmetry of the distribution. The first inference algorithm, called orbit generation, is the first exact inference algorithm for factor graphs that scales in the degree of symmetry of the distribution. The second inference algorithm is a Markov-Chain Monte-Carlo algorithm that mixes rapidly in the degree of symmetry.

RELATED WORK
Generating and Sampling Orbits for Lifted Probabilistic Inference. Steven Holtzen, Todd Millstein, and Guy Van den Broeck. In Uncertainty in Artificial Intelligence (UAI), 2019.
https://arxiv.org/abs/1903.04672

BIO
Steven Holtzen is an assistant professor at Northeastern University. His research focuses on programming languages, artificial intelligence, and machine learning. In particular he is interested in probabilistic programming languages, foundations of probabilistic inference, tractable probabilistic modeling, automated reasoning, and probabilistic verification. His work has been recognized by an ACM SIGPLAN distinguished paper award.

(10/21, 12:00pm): Nesime Tatbul (MIT): Towards Explainable Time Series Anomaly Detection details)

TITLE
Towards Explainable Time Series Anomaly Detection

ABSTRACT
Time series is a ubiquitous data type with a growing range of applications from telemetry to finance. A key capability that lies at the core of managing time series data is the identification of unusual patterns of interest called anomalies. Detecting and explaining anomalies not only finds use in many mission-critical domains, but also empowers systems and users with the ability to handle large data volumes by guiding attention and resources to information that matters the most. Despite years of effort, diversity of time series applications, often noisy nature of datasets, and contextual variations in anomaly types and instances challenge the creation of robust and generalizable solutions. In this talk, I will present a collection of novel data science tools and techniques to help overcome such challenges in practice, including Exathlon -- the first public benchmark for explainable anomaly detection over high-dimensional time series.

RELATED LINKS
Exathlon: http://vldb.org/pvldb/vol14/p2613-tatbul.pdf, https://github.com/exathlonbenchmark/exathlon
TSAD-Evaluator: https://arxiv.org/pdf/1803.03639.pdf, https://github.com/IntelLabs/TSAD-Evaluator
Metro-Viz: https://dl.acm.org/doi/pdf/10.1145/3299869.3320247

SHORT BIO
Nesime Tatbul is a senior research scientist at Intel Labs and MIT, currently serving as an industry co-PI for MIT's Data Systems and AI Lab jointly funded by Intel, Google, and Microsoft. Previously, she served on the computer science faculty of ETH Zurich after receiving a Ph.D. degree from Brown University. Her research interests are broadly in large-scale data management systems and modern data-intensive applications, with a current focus on time series analytics and learned data systems. She has been an active member of the database research community for 20+ years, serving in various roles for the VLDB Endowment, ACM SIGMOD, and other organizations.

(10/28, 12:00pm): Azza Abouzied (NYU): Scalable Prescriptive Analytics in Database Systems details)

TITLE
Scalable Prescriptive Analytics in Database Systems

ABSTRACT
Currently, database systems do not natively support the many data processing needs of data-driven decision making, leaving experts to develop their own custom, ad hoc application-level solutions that are difficult to scale and may produce sub-optimal results. While many systems provide support for scalable descriptive analytics (like statistics and summaries of the raw data) and even some predictive analytics (such as forecasts), there is little support for prescriptive analytics, which searches for the best course of action given the available data. As we move from "what is the data?" to "what to do with it?,” we need to augment database systems with efficient computational problem-solving capabilities that take into consideration the inherent uncertainty of data and models. In this talk, I will explore some of the systems we built to integrate state-of-the-art solvers within the DBMS to scalably solve stochastic constrained optimization problems. I will also describe our work in building a scalable system to support sequential decision-making, and how this system is being used to help public health policy makers construct cost-effective policies that curb epidemics like COVID-19.

PAPER LINKS
https://dl.acm.org/doi/10.1145/3472749.3474794
https://dl.acm.org/doi/10.1145/3318464.3389765
http://packagebuilder.cs.umass.edu/papers/p576-brucato.pdf

BIO
Azza Abouzied is an associate professor of computer science at New York University Abu Dhabi. Her research focuses on designing intuitive data querying tools and combines techniques from various research fields such as HCI, machine learning, and database systems. In 2019, she won a VLDB test of time award for her work on HadoopDB. Her work on integrating decision-making support in database systems received a Best of VLDB recognition, a SIGMOD Research Highlight, a CACM research highlight and a Best VLDB demo award. She earned her doctoral degree from Yale in 2013. She spent a year as a visiting scholar at UC Berkeley.

(11/4, 12:00pm): Brian Hentschel (Harvard): Cerebral Data Structures details)

TITLE
Cerebral Data Structures

ABSTRACT
Data structures form the basis of all data-driven software applications and operations on data structures form a critical part of the overall cost of systems. For instance, just a singular data structure, hash tables, and in just a single language, C++, accounts for 2% of total CPU usage and 5% of total RAM usage across all of Google.
We make the case for cerebral data structures, which use machine learning and statistical modeling at a high level to redesign base data structures in computer science. This redesign keeps the core structure of classical data structures; by keeping this core, the operations remain simple and therefore efficient as well as robust to shifting workloads. Additionally, keeping this core allows for theoretical arguments about data structure performance. At the same time, machine learning and statistics are used to transfer properties of the workload and data into the data structure through a redesign, achieving better expected performance than classic designs.
As well as discussing the general approach, we present in detail two applications of this approach. First, we present Stacked Filters, which uses workload skew to produce 100X lower false positive rates for filter data structures. Second, we present Entropy-Learned Hashing, which separates hashing speed from input size, producing 10X faster hash function evaluation and 4X faster hash tables than state-of-the-art approaches from Facebook and Google.

RELATED WORK
https://stratos.seas.harvard.edu/files/stratos/files/stackedfilters_vldb2021_extended_version.pdf

SHORT BIO
Brian Hentschel is a Ph.D. candidate at Harvard University advised by Stratos Idreos. He is interested in blending statistics and machine learning with classical techniques to improve computer systems. His research has been awarded the SIGMOD research highlight award as well a best paper from EDBT. He earned his BA in mathematics and computer science at Pomona College and has previously spent time at Microsoft, Amazon, LinkedIn, and IBM.

(11/5, 12:00pm): Kuldeep Meel (National University of Singapore): Statistical Learning and Symbolic Reasoning: Better Together for Software 2.0 details)

TITLE
Statistical Learning and Symbolic Reasoning: Better Together for Software 2.0

ABSTRACT
The advent of personal computing has made manual tabling of data obsolete in today's context. What about the future where software engineers would balk at the prospect of the data-driven manual design of heuristics?. Well, we think such a future is on the horizon: in this talk, I will discuss our ambitious project, CrystalBall, that seeks to automate the design of heuristics in modern SAT solvers. The past two decades have witnessed how SAT went from the dreaded Non-deterministic Polytime (NP)-complete problem to Not a Problem (NP) for formal methods and AI community. Such progress was fueled by experts' careful design of heuristics, and every year's SAT competition sees experts continue to spend hundreds of hours tuning these heuristics. I will discuss how such heuristics can be learned automatically by statistical techniques and glimpses of the future where experts only focus on high-level ideas.

RELEVANT PAPER
https://www.comp.nus.edu.sg/~meel/Papers/sat19skm.pdf

BLOG
https://www.msoos.org/2019/06/crystalball-sat-solving-data-gathering-and-machine-learning/

BIO
Kuldeep Meel holds the NUS Presidential Young Professorship in the School of Computing at the National University of Singapore. His research interests lie at the intersection of Formal Methods and Artificial Intelligence. He is a recipient of the 2019 NRF Fellowship for AI (accompanied with S$2.5 million funding) and was named AI’s 10 to Watch by IEEE Intelligent Systems in 2020. His work received the 2020 Amazon Research Award, the 2018 Ralph Budd Award for Best PhD Thesis in Engineering, 2014 Outstanding Masters Thesis Award from Vienna Center of Logic and Algorithms and Best Student Paper Award at CP 2015. His CP-18 paper, CAV-20, and PODS-21 papers received IJCAI-19 Sister conferences best paper award track invitation, "Best of PODS-21" invite from from ACM TODS, and "Best Papers of CAV-20" invite from FMSD journal respectively.

(11/11, 12:00pm): Xiao Hu (Duke): Enumeration Algorithms for Conjunctive Queries with Projection details)

TITLE
Enumeration Algorithms for Conjunctive Queries with Projection

ABSTRACT
We investigate the enumeration of query results for an important subset of CQs with projections, namely star and path queries. The task is to design data structures and algorithms that allow for efficient enumeration with delay guarantees after a preprocessing phase. Our main contribution is a series of results based on the idea of interleaving precomputed output with further join processing to maintain delay guarantees, which may be of independent interest. In particular, we design combinatorial algorithms that provide instance-specific delay guarantees in nearly linear preprocessing time. These algorithms improve upon the currently best known results. Further, we show how existing results can be improved upon by using fast matrix multiplication. We also present new results involving tradeoff between preprocessing time and delay guarantees for enumeration of path queries that contain projections. CQs with projection where the join attribute is projected away is equivalent to boolean matrix multiplication. Our results can therefore be also interpreted as sparse, output-sensitive matrix multiplication with delay guarantees.

RELATED WORK
[1] Enumeration Algorithms for Conjunctive Queries with Projection, Shaleen Deep, Xiao Hu, Paraschos Koutris, ICDT 2021. https://arxiv.org/abs/2101.03712
[2] Trade-offs in Static and Dynamic Evaluation of Hierarchical Queries, Ahmet Kara, Milos Nikolic, Dan Olteanu, Haozhe Zhang, PODS 2020. https://arxiv.org/abs/1907.01988
[3] On acyclic conjunctive queries and constant delay enumeration, Guillaume Bagan, Arnaud Durand, and Etienne Grandjean, CSL 2007. https://grandjean.users.greyc.fr/Recherche/PublisGrandjean/EnumAcyclicCSL07.pdf
[4] Structural tractability of enumerating csp solutions, G. Greco and F. Scarcello, Constraints 2013. https://arxiv.org/abs/1005.1567

SHORT BIO
Xiao Hu is a postdoctoral associate in the Department of Computer Science at Duke University, co-supervised by Prof. Pankaj Agarwal and Prof. Jun Yang. Prior to that, she received her Ph.D. in Computer Science and Engineering from HKUST, and a BE degree in Computer Software from Tsinghua University. Her research has focused on studying fundamental problems in database theory and their implications to practical systems. Her work on massively parallel join algorithms has been invited to ACM Transactions on Database Systems as a research paper, as well as a feature article in the Database Principles Column in SIGMOD Record.

(11/18, 12:00pm): Stijn Vansummeren (Hasselt University): General Dynamic Yannakakis: Conjunctive Queries with Theta Joins Under Updates details)

TITLE
General Dynamic Yannakakis: Conjunctive Queries with Theta Joins Under Updates

ABSTRACT
The ability to efficiently analyze changing data is a key requirement of many real-time analytics applications. In database terms, such analysis is closely related to the problem of Incremental View Maintenance (IVM) where we are asked to maintain the result Q(db) of a query Q on a database db under updates.
In this talk I will summarize selected algorithmic ideas of the IVM literature and illustrate that they inherently rely on either (1) recomputation of query subresults or (2) materialization of subresults. Both have their drawbacks: recomputation is detrimental to update latency while materialization may waste both memory and be detrimental to latency.
By moving to the framework of query evaluation with small delay, introduced by the seminal paper of Baghan, Durand, and Grandjean in 2007, we are able to circumvent both drawbacks. In particular, for the so called q-hierarchical queries, it is possible to (1) represent query results and subresults succinctly, in space at most linear in the database; (2) enumerate the results from this representation efficiently: with constant delay; and (3) maintain the representation efficiently: in constant time for updates of constant size.
I will illustrate how we can obtain these features by modifying Yannakakis' seminal algorithm for evaluating Acyclic Conjunctive Queries. The resulting algorithm, which is called Dynamic Yannakakis, can be generalized to not only apply to Acyclic Conjunctive Queries (which only contain equality joins by definition), but also to such queries endowed with theta-joins (in particular: inequality predicates like A < B).
This talk summarizes our results published in SIGMOD 2017, VLDB 2018, and VLDB Journal 2020, which received the SIGMOD Research Highlights Award 2018.

RELATED WORK
* On acyclic conjunctive queries and constant delay enumeration, Guillaume Bagan, Arnaud Durand, and Etienne Grandjean, CSL 2007.
https://grandjean.users.greyc.fr/Recherche/PublisGrandjean/EnumAcyclicCSL07.pdf
* General dynamic Yannakakis: conjunctive queries with theta joins under updates. Muhammad Idris, Martín Ugarte, Stijn Vansummeren, Hannes Voigt, Wolfgang Lehner. VLDB J. 29(2-3): 619-653 (2020).
https://link.springer.com/article/10.1007/s00778-019-00590-9
* Efficient Query Processing for Dynamically Changing Datasets. Muhammad Idris, Martín Ugarte, Stijn Vansummeren, Hannes Voigt, Wolfgang Lehner. SIGMOD Rec. 48(1): 33-40 (2019).
https://sigmodrecord.org/?smd_process_download=1&download_id=3073
* Conjunctive Queries with Inequalities Under Updates. Muhammad Idris, Martín Ugarte, Stijn Vansummeren, Hannes Voigt, Wolfgang Lehner. Proc. VLDB Endow. 11(7): 733-745 (2018).
http://www.vldb.org/pvldb/vol11/p733-idris.pdf
* The Dynamic Yannakakis Algorithm: Compact and Efficient Query Processing Under Updates. Muhammad Idris, Martín Ugarte, Stijn Vansummeren. SIGMOD Conference 2017: 1259-1274. https://martinugarte.com/media/pdfs/main_pDxeVno.pdf

SHORT BIO
Stijn Vansummeren is research professor of Data Management and Data Wrangling at the Data Science Institute of Hasselt University, Belgium. His research focuses on large scale data management, with as overarching theme the study, development, and application of formal approaches to wrangling, querying, and analyzing data at scale. Stijn Vansummeren obtained his PhD in 2005 from Hasselt University, where he was a PhD fellow of the Research Foundation Flanders (FWO) in the Databases and Theoretical Computer Science group, advised by Jan Van den Bussche. From 2005-2009, he was a postdoctoral fellow of the FWO in the same group. In 2009, he joined the Université Libre de Bruxelles (ULB), Belgium where he was associate professor of Computer Science at the Engineering faculty, until September 2020.
His research has been awarded with the ACM SIGMOD Research Highlights Award (2018), a best paper award at WebDB (2016), and a best paper nomination award at the World Wide Web (WWW) conference (2008).

(12/2, 12:00pm): Asterios Katsifodimos (TU Delft): Valentine: Evaluating Matching Techniques for Dataset Discovery details)

TITLE
Valentine: Evaluating Matching Techniques for Dataset Discovery

ABSTRACT
Data scientists today search large data lakes to discover and integrate datasets. In order to bring together disparate data sources, dataset discovery methods rely on some form of schema matching: the process of establishing correspondences between datasets. This process has been traditionally taken care with schema matching techniques. After 20 years of research in schema matching, we are still missing a benchmark for schema matching, as well as proper datasets, and proper evaluation metrics! In this talk I will present Valentine, is an extensible open-source experiment suite to execute and organize large-scale automated matching experiments on tabular data. Valentine now includes implementations of 7 seminal schema matching methods that we either implemented from scratch (due to absence of open source code) or imported from open repositories. Finally, Valentine offers a data fabrication toolbox for constructing testing datasets with ground truth. I will conclude my talk with insights from a very large set of experiments we have been performing at TU Delft, focusing on the strengths and weaknesses of existing techniques, that can serve as a guide for employing schema matching in future dataset discovery methods.

LINKS:
Original Paper: https://ieeexplore.ieee.org/abstract/document/9458921
Demo: http://www.vldb.org/pvldb/vol14/p2871-koutras.pdf
Project: https://delftdata.github.io/valentine/
Code: https://github.com/delftdata/valentine

SHORT BIO
Asterios Katsifodimos is an assistant professor the Delft University of Technology. Before TU Delft, Asterios worked at the SAP Innovation Center, and as a postdoc at TU Berlin. Asterios holds a PhD from INRIA & University of Paris 11 in France. Asterios currently works on scalable stream processing and data integration.

(12/3, 8:00am): Qichen Wang (Hong Kong UST): Maintaining Acyclic Foreign-Key Joins under Updates details)

TITLE
Maintaining Acyclic Foreign-Key Joins under Updates

ABSTRACT
In this paper, we study the problem of incrementally maintaining the query results of acyclic joins under updates, i.e., insertion and deletion of tuples to any of the relations. Prior work has shown that this problem is inherently hard, requiring at least $\Omega(|db|^{{1\over 2} -\epsilon})$ time per update, where $|db|$ is the size of the database, and $\epsilon > 0$ can be any small constant. However, this negative result holds only on adversarially constructed update sequences; on the other hand, most real-world update sequences are "nice", nowhere near these worst-case scenarios.
We introduce a measure $\lambda$, which we call the enclosureness of the update sequence, to more precisely characterize its intrinsic difficulty. We present an algorithm to maintain the query results of any acyclic join in $O(\lambda)$ time amortized, on any update sequence whose enclosureness is $\lambda$. This is complemented with a lower bound of $\Omega(\lambda^{1-\epsilon})$, showing that our algorithm is essentially optimal with respect to $\lambda$. Moreover, the new measurement also recovers prior lower bounds on static as well as dynamic query evaluation. Next, using this algorithm as the core component, we show how all the 22 queries in the TPC-H benchmark can be supported in $\tilde{O}(\lambda)$ time. Finally, based on the algorithms developed, we built a continuous query processing system on top of Flink, and experimental results show that our system outperforms previous ones significantly.

RELATED WORKS
* Wang, Qichen, and Ke Yi. "Maintaining Acyclic Foreign-Key Joins under Updates." In SIGMOD 2020.
https://www.cse.ust.hk/~yike/sigmod20.pdf
* Chirkova, Rada, and Jun Yang. "Materialized views." Foundations and Trends in Databases 4, no. 4, 2011.
http://db.cs.duke.edu/papers/fntdb12-ChirkovaYang-mat_views.pdf
* Idris, Muhammad, Martín Ugarte, and Stijn Vansummeren. "The dynamic Yannakakis algorithm: Compact and efficient query processing under updates." In PVLDB 2017.
https://dl.acm.org/doi/pdf/10.1145/3035918.3064027
* Ahmad, Yanif, Oliver Kennedy, Christoph Koch, and Milos Nikolic. "DBToaster: Higher-order Delta Processing for Dynamic, Frequently Fresh Views," PVLDB 2012.
http://vldb.org/pvldb/vol5/p968_yanifahmad_vldb2012.pdf
* Berkholz, Christoph, Jens Keppeler, and Nicole Schweikardt. "Answering conjunctive queries under updates." In PODS 2017.
https://dl.acm.org/doi/pdf/10.1145/3034786.3034789

SHORT BIO
Qichen Wang is a PhD candidate in the Department of Computer Science and Engineering at Hong Kong University of Science and Technology, supervised by Prof. Ke Yi. Prior to that, he received a BE degree in Computer Science from Zhejiang University. His research has focused on studying optimization techniques for query evaluation, from theoretical aspects to practical implementations. He is also interested in parallel and distributed algorithms, and algorithms for data streams.

(12/16, 12:00pm): Bill Howe (University of Washington): Data-Centric AI: Reuse, Integration, and Synthesis of Weakly Structured Data details)

TITLE
Data-Centric AI: Reuse, Integration, and Synthesis of Weakly Structured Data

ABSTRACT
What good is a collection of 1000s of loosely related tables? Repositories of weakly structured datasets -- datasets with rows and columns but few other guarantees -- have been built by organizations, cities, and in scientific domains, but the anticipated value of these systems has been difficult to realize. Urban open data repositories, data marketplaces, enterprise intranets, and scientific repositories are motivated by the idea that the data can be reused and integrated, but technical friction, limited provenance, data quality issues, limited search and discovery, and governance restrictions make widespread reuse difficult.
Across a number of projects in the last decade, we have considered ways to make these repositories more valuable by enabling global reasoning over collections of locally defined datasets. I'll discuss a few projects in this space, including SQLShare, where we aimed to share and reuse the verbs (SQL queries) as well as the nouns (tables), ClaimJumper, where we aimed to enable repository-wide claim verification and statistical inference, and equitensors, where we aim to learn integrated, reusable, and unbiased representations from data with a shared time and space domain. Ultimately, our goal is to understand the limits of using diverse, undercurated, weakly structured datasets to train better models. I'll finish with some ideas for synthesizing data to enable better training and evaluation of ML models in specialized application domains where data is scarce.

RELATED WORK
https://faculty.washington.edu/billhowe/publications/pdfs/jain2016sqlshare.pdf
https://faculty.washington.edu/billhowe/publications/pdfs/grechkin17ezlearn.pdf
https://dl.acm.org/doi/abs/10.1145/3448016.3452777 (sorry for the paywall)
bonus:
https://faculty.washington.edu/billhowe//publications/pdfs/jain_cidr_2019.pdf

BIO
Bill Howe is Associate Professor in the Information School and Adjunct Associate Professor in the Allen School of Computer Science & Engineering and the Department of Electrical Engineering. His research interests are in data management, machine learning, and visualization, particularly as applied in the physical and social sciences. As Founding Associate Director of the UW eScience Institute, Dr. Howe played a leadership role in the Moore-Sloan Data Science Environment program through a $32.8 million grant awarded jointly to UW, NYU, and UC Berkeley, and founded UW’s Data Science for Social Good Program. With support from the MacArthur Foundation, NSF, and Microsoft, Howe directs UW’s participation in the Cascadia Urban Analytics Cooperative. He founded the UW Data Science Masters Degree, serving as its inaugural Program Chair, and created a first MOOC on data science that attracted over 200,000 students. His research has been featured in the Economist and Nature News, and he has authored award-winning papers in conferences across data management, machine learning, and visualization. He has a Ph.D. in Computer Science from Portland State University and a Bachelor’s degree in Industrial & Systems Engineering from Georgia Tech.

(12/17, 12:00pm): Luana Ruiz (UPenn): Graphon Signal Processing details)

TITLE
Graphon Signal Processing

ABSTRACT
Graphons are inﬁnite-dimensional objects that represent the limit of convergent sequences of graphs as their number of nodes goes to inﬁnity. This paper derives a theory of graphon signal processing centered on the notions of graphon Fourier transform and linear shift invariant graphon ﬁlters, the graphon counterparts of the graph Fourier transform and graph ﬁlters. It is shown that for convergent sequences of graphs and associated graph signals: (i) the graph Fourier transform converges to the graphon Fourier transform when the graphon signal is bandlimited; (ii) the spectral and vertex responses of graph ﬁlters converge to the spectral and vertex responses of graphon ﬁlters with the same coefﬁcients. These theorems imply that for graphs that belong to certain families, i.e., that are part of sequences that converge to a certain graphon, graph Fourier analysis and graph ﬁlter design have well deﬁned limits. In turn, these facts extend applicability of graph signal processing to graphs with large number of nodes — since signal processing pipelines designed for limit graphons can be applied to ﬁnite graphs — and to dynamic graphs — since we can relate the result of SP pipelines designed for different graphs from the same convergent graph sequence.

RELATED WORK
* Graphon Signal Processing, IEEE Transactions on Signal Processing ( Volume: 69), 2021.
https://arxiv.org/pdf/2003.05030
* Graphon Neural Networks and the Transferability of Graph Neural Networks
https://proceedings.neurips.cc/paper/2020/file/12bcd658ef0a540cabc36cdf2b1046fd-Paper.pdf
* Transferability Properties of Graph Neural Networks
https://arxiv.org/pdf/2112.04629.pdf

SHORT BIO
Luana Ruiz received the B.Sc. degree in electrical engineering from the University of São Paulo, Brazil, and the M.Sc. degree in electrical engineering from the École Supérieure d'Electricité (now CentraleSupélec), France, in 2017. She is currently a Ph.D. candidate with the Department of Electrical and Systems Engineering at the University of Pennsylvania. Her research interests are in the areas of large-scale graph machine learning and the mathematical foundations of deep learning. She was awarded an Eiffel Excellence scholarship from the French Ministry for Europe and Foreign Affairs between 2013 and 2015, nominated an iREDEFINE fellow in 2019 and a MIT EECS Rising Star in 2021, and received best student paper awards at the European Signal Processing Conference (EUSIPCO) in 2019 and 2021.