Activities

DATA Lab Seminar

For spring 2024 we continue our seminars in person in WVH 462 (map) with brown bag lunch. Our speakers may dial in remotely at times via Zoom. If you are interested to join the seminars, please subscribe to our DATA Lab talks email list or DATA Lab talks calendar. Zoom links are only announced via email. When you sign up to the DATA Lab talks email list, please use your name or write us a short note so we know who you are before we add you to the announcements.

Summer 2024 (Wolfgang)

  • (5/10, 12:00pm): Jun Yang (Duke): What Teaching Databases Taught me about Researching Databases details)

    TITLE
    What Teaching Databases Taught me about Researching Databases

    ABSTRACT
    Declarative querying and automatic optimization are the cornerstones of the success and longevity of database systems, yet these concepts often pose challenges for novice learners accustomed to different coding paradigms. The transition is further hampered by the lack of query debugging tools (be it for correctness or performance) compared to the plethora available for programming languages. The talk samples several systems that we build at Duke University to help students learn and debug database queries. These systems have not only helped scale up teaching and improve learning, but also inspired research on interesting and fundamental questions concerning databases. Furthermore, with the rise of generative AI, we argue that there is a heightened need for skills in scrutinizing and debugging AI-generated queries, and we outline several ongoing and future work directions aimed at addressing this emerging challenge.

    RELATED PAPERS
    * Hu, Gilad, Stephens-Martinez, Roy, Yang. Qr-Hint: Actionable Hints Towards Correcting Wrong SQL Queries. SIGMOD 2024 https://arxiv.org/abs/2404.04352
    * Miao, Roy, Yang. Explaining Wrong Queries Using Small Examples. SIGMOD 2019 https://dl.acm.org/doi/10.1145/3299869.3319866
    * Gilad, Miao, Roy, Yang. Understanding Queries by Conditional Instances. SIGMOD 2022 https://dl.acm.org/doi/10.1145/3514221.3517898

    BIO
    Jun Yang is currently the Bishop-MacDermott Family Professor of Computer Science at Duke University. He joined Duke after receiving his Ph.D. from Stanford in 2001 and chaired the Department of Computer Science at Duke during 2020-2023. He has broad research interests in databases and data-intensive systems. He is a Trustee of the VLDB Endowment and served as the general co-chair of SIGMOD 2017 and the co-Editor-in-Chief of PVLDB during 2022-2023. He is a recipient of the CAREER Award, IBM Faculty Award, HP Labs Innovation Research Award, and Google Faculty Research Award. He has striven to connect research to his other passions, such as journalism, where he has worked on computational fact-checking since its nascent days, and education, where he has built a number of software tools for learning databases. He received the David and Janet Vaughan Brooks Teaching Award at Duke.

  • (5/24, 12:00pm): Ding Ming (ETH Zürich): Two-Commodity Flow is Equivalent to Linear Programming details)

    TITLE
    Two-Commodity Flow is Equivalent to Linear Programming

    ABSTRACT
    Maximum flow is a very well-studied problem, which can be formulated as a linear program. While an almost-linear time algorithm for single-commodity flow is developed, progress for multi-commodity flow is only made by progress in general linear program solvers. It turns out it is not by accident since two-commodity flow problem (2CF) is equivalent to general linear programs (LP). More precisely, we show that any LP (under milder assumptions) can be encoded as a 2CF problem in nearly linear time, with only a polylogarithmic blow-up in problem size and polynomial blow-up in error. Our reduction applies to high-accuracy approximation algorithms and exact algorithms.
    Our proof builds upon Itai's polynomial-time reduction from LP to 2CF [Itai JACM'78], improving its reduction time, problem size, and establishing an error bound. In this talk, I will go through the key steps of the reduction algorithm from LP to 2CF. Additionally, I will provide an intuitive error analysis of the solution mapping from a 2CF to LP.
    Joint work with Rasmus Kyng and Peng Zhang.

    RELATED WORK
    * Ming, Kyng, Zhang. Two-commodity flow is equivalent to linear programming under nearly-linear time reductions, 49th EATCS International Conference on Automata, Languages, and Programming, https://arxiv.org/abs/2201.11587
    * Presentation at Hausdorff Center for Mathematics: Two-Commodity Flow is as Hard as Linear Programming https://www.youtube.com/watch?v=f6SWJMQIJRc
    * Alon Itai. Two-commodity flow. JACM 1978. https://dl.acm.org/doi/10.1145/322092.322100

    SHORT BIO
    I am a fourth-year PhD student at the Department of Computer Science, ETH Zürich. I am fortunate to be supervised by Rasmus Kyng. My research interests lie in algorithms and optimization problems, with a specific focus on fast algorithms and complexity analysis for structured Linear Equations and Linear Programming. Before coming to ETH Zürich, I received my Bachelor’s degree from Shanghai Jiao Tong University. Additionally, I participated in a dual-degree program, obtaining a Diplôme d’Ingénieur from CentraleSupélec in France and a Master’s degree from Shanghai Jiao Tong University in China.
    https://2015dingm.github.io/

Spring 2024 (Zixuan)

  • (2/7, 11:00am): Nikolaos Tziavelis (Northeastern University): A New Generation of Query Processing for Advanced Data Analytics details)

    TITLE
    A New Generation of Query Processing for Advanced Data Analytics

    ABSTRACT
    While both the amount of data and our need to analyze them are growing, today’s data systems are limited in the analytical tasks they can handle efficiently. When presented with a complex analytical task, they often become unresponsive because they resort to naive query-processing algorithms. A major culprit is a join operation in the middle of the query-processing pipeline: that join can often produce unnecessarily large intermediate results even though the final output is small. Overcoming this challenge by “pushing” different kinds of operators behind joins is not always simple and often requires advancements in algorithms.

    In this talk, I will present my work on improving the algorithmic capabilities of data systems, focused on the case where users have preferences over the query answers and are interested in retrieving only a select few of them. I show that for a large class of these queries, it is possible to avoid materializing intermediate joins, resulting in non-trivial complexity guarantees and efficient implementations that outperform existing systems by orders of magnitude. Besides addressing fundamental questions regarding the limits of query processing, this work opens up unexplored possibilities for system design. The long-term goal is a new generation of scalable systems for data analytics that can efficiently handle queries and data that are considered infeasible today.

    BIO
    Nikolaos (Nikos) Tziavelis is a PhD candidate at Northeastern University, advised by Mirek Riedewald and Wolfgang Gatterbauer, and before that, he received a Diploma in Electrical and Computer Engineering from the National Technical University of Athens. His research interests lie in the design and implementation of novel practical algorithms for query processing in data systems. His work has been awarded with a Google PhD fellowship, a PODS 2021 “Best of” recognition, and a 2023 PhD Workshop Best Paper Award.
    https://ntzia.github.io/

  • (2/14, 9:00am): Gao Cong (Nanyang Technological University): Machine Learning for Databases: Foundations, Paradigms, and Open problems details)

    Title: Machine Learning for Databases: Foundations, Paradigms, and Open problems

    Abstract: The database community has actively explored how machine learning (ML) can be leveraged to address challenges in database systems (ML4DB) for the past several years. The talk will briefly cover several selected ML4DB tasks, and then focus on our perceptions on foundations and paradigms. First, query plan is used as input in several ML4DB tasks, and most studies usually focus on one task and develop a new design to represent query plans along with a ML4DB framework. This talk will present a different approach to these tasks by abstracting the query plan representation as a common component or foundation for multiple ML4DB tasks. This talk will also present a summary of finding of experimental study on query plan representation. Second, learned indexes are proposed to learn a ML model to replace the classic indexes. This talk will present a different paradigm of using ML for indexes, namely enhancing the classic indexes with ML, rather than replacing them. The talk will end with a discussion on some open problems.

    Several related papers:

    • Yue Zhao, Gao Cong, Jiachen Shi, Chunyan Miao: QueryFormer: A Tree Transformer Model for Query Plan Representation. Proc. VLDB Endow. 15(8): 1658-1670 (2022)
    • Yue Zhao, Zhaodonghui Li, Gao Cong: A Comparative Study and Component Analysis of Query Plan Representation Techniques in ML4DB Studies, VLDB 24
    • Jingyi Yang, Gao Cong: PLATON: Top-down R-tree Packing with Learned Partition Policy. SIGMOD24
    • Tu Gu, Kaiyu Feng, Gao Cong, Cheng Long, Zheng Wang, Sheng Wang: The RLR-Tree: A Reinforcement Learning Based R-Tree for Spatial Data. SIGMOD 2023
    • Jiangneng Li, Zheng Wang, Gao Cong, Cheng Long, Han Mao Kiah, Bin Cui: Towards Designing and Learning Piecewise Space-Filling Curves. Proc. VLDB Endow. 16(9): 2158-2171 (2023)

    Bio: Gao Cong is currently a Professor in the School of Computer Science and Engineering at Nanyang Technological University (NTU), and a co-director for Singtel Cognitive and Artificial Intelligence Lab for Enterprises@NTU. He previously worked at Aalborg University, Denmark, Microsoft Research Asia, and the University of Edinburgh. His current research interests include ML4DB, spatial data management, spatial-temporal data mining, and recommendation systems. He received the best paper runner-up awards at the WSDM'20 and WSDM'22 conferences for two of his research papers. His citation in Google Scholar was over 18,000 with H-index 70. He served as a PC co-chair for ICDE’2022, the associate general chair of KDD’21, a PC co-chair for E&A track of VLDB 2014, and a PC vice-Chair for ICDE’18. He is an associate editor for ACM Transactions on Database Systems (TODS) and the vice Chair of ACM KDD Singapore chapter.

  • (2/23, 12:00pm): Manuel Bodirsky (TU Dresden): The Complexity of Resilience Problems via Valued Constraint Satisfaction Problems [hosted by Wolfgang] details)

    TITLE
    The Complexity of Resilience Problems via Valued Constraint Satisfaction Problems

    ABSTRACT
    Valued constraint satisfaction problems (VCSPs) constitute a large class of computational optimisation problems. It was shown recently that, over finite domains, every VCSP is in P or NP-complete, depending on the admitted cost functions. In this article, we study cost functions over countably infinite domains whose automorphisms form an oligomorphic permutation group. Our results include a hardness condition based on a generalisation of pp-constructability as known from classical CSPs and a polynomial-time tractability condition based on the concept of fractional polymorphisms. We then observe that the resilience problem for unions of conjunctive queries (UCQs) studied in database theory, under bag semantics, may be viewed as a special case of the VCSPs that we consider. We obtain a complexity dichotomy for the case of acyclic UCQs and exemplarily use our methods to determine the complexity of a query that had remained open in the literature. Further, we conjecture that our hardness and tractability conditions match for resilience problems for UCQs.

    RELATED WORK
    1) The Complexity of Resilience Problems via Valued Constraint Satisfaction Problems.
    https://arxiv.org/abs/2309.15654

    2) The complexity of temporal constraint satisfaction problems. JACM 2010.
    https://dl.acm.org/doi/10.1145/1667053.1667058

    3) Complexity of Infinite-Domain Constraint Satisfaction. 2021.
    https://www.cambridge.org/de/universitypress/subjects/mathematics/logic-categories-and-sets/complexity-infinite-domain-constraint-satisfaction?format=HB&isbn=9781107042841

    4) Datalog and constraint satisfaction with infinite templates. JCSS 2013.
    https://www.sciencedirect.com/science/article/pii/S0022000012001213

    SHORT BIO
    https://tu-dresden.de/mn/math/algebra/bodirsky
    Since 3.2023: PI of the ERC Synergy Grant "POCOCOP" (jointly with Libor Barto, Charles University Prague, and Michael Pinsker, TU Wien)
    Consolidator Grant of the ERC for the project "CSP-Infinity'' (2016-2021).
    Since August 2014: Professor for Algebra and Discrete Structures at TU Dresden.
    Starting Grant of the ERC for the project "Constraint Satisfaction Problems: Algorithms and Complexity'' (2011).
    2008-2014: Chargé de Recherche of the CNRS at Ecole Polytechnique.
    2005-2007: Postdoc in the DFG research group "Algorithms, Structure, Randomness'' and researcher at the Institute of Algorithms and Complexity at HU Berlin.
    2004: PhD at Humboldt-University Berlin (supervision: Prof. Dr. Hans Jürgen Prömel) with a scholarship of the DFG research training group "Combinatorics, Geometry, and Computation''.
    2001: Diploma in Computer Science (minor: mathematics) at Saarland University.

  • (2/28, 9:00am): Steven Euijong Whang (KAIST): Towards a Holistic Framework for Data-centric Responsible AI details)

    Title: 

    Towards a Holistic Framework for Data-centric Responsible AI

    Abstract: 

    Responsible AI is becoming critical as AI is widely used in our everyday lives. Many companies that deploy AI publicly state that when training a model, we not only need to improve its accuracy, but also need to guarantee that the model does not discriminate against users (fairness), is resilient to noisy or poisoned data (robustness), protects personal information (privacy), and is interpretable (explainability) among others. In addition, these objectives are not only relevant to model training, but to all steps of machine learning starting from the data. A holy grail challenge is to support all of these objectives in one holistic framework that is also easy to use. To this end, I will give an overview of the data-centric responsible AI research in the Data Intelligence Lab at KAIST. I will first introduce fairness techniques throughout the machine learning process: fair model training using sample selection (ICLR’21), fair data labeling (VLDB’24, ACM SIGMOD’23), and fair data acquisition (ACM SIGMOD’21). Next, I will explain our works on other responsible AI objectives: fair and robust training (NeurIPS’21, ICML’20), privacy using disinformation (AAAI’23), and explainability-first clustering (AAAI’23). We believe we are just scratching the surface of a truly holistic solution and that many exciting challenges lie ahead.

    Related papers:

    Bio:

    Steven Euijong Whang is an associate professor with tenure at KAIST EE and AI and leads the Data Intelligence Lab. His research interests include Responsible AI and Data-centric AI. He is an Associate Editor of IEEE TKDE (2023-2025), IEEE DE Bulletin (2023, 2024), and VLDB 2025. Previously he was a Research Scientist at Google Research and co-developed the data infrastructure of the TensorFlow Extended (TFX) machine learning platform. Steven received his Ph.D. in computer science in 2012 from Stanford University and his B.S. in computer science from KAIST in 2003. He was a Kwon Oh-Hyun Endowed Chair Professor (2020-2023) and received a Google AI Focused Research Award (2018, the first in Asia). Homepage: https://stevenwhang.com

  • (3/1, 1:30pm): Amélie Marian (Rutgers University): Explainable Disparity Compensation for Fair Ranking details)

    Title: Explainable Disparity Compensation for Fair Ranking

    Abstract: Ranking functions are commonly used to assist in decision-making in a wide variety of applications. Oftentimes, these ranking functions produce disparate results for different populations because of bias in the underlying data. In this talk, I will discuss transparent participation metrics to clarify the ranking process, by assessing the contribution of each parameter used in the ranking function in the creation of the final ranked outcome, using information about the ranking functions themselves, as well as observations of the underlying distributions of the parameter values involved in the ranking. In addition, I will propose easily explainable data-driven compensatory measures for ranking functions that rely on the generation of bonus points given to members of underrepresented groups to address disparity in the ranking function. The bonus points can be set in advance, and can be combined, allowing for considering the intersections of representations and giving better transparency to stakeholders. I will present efficient sampling-based algorithms to calculate the number of bonus points to minimize disparity, and present the results of an evaluation of the proposed metrics over a real-world school admissions dataset.

    Related papers: 

    • Abraham Gale, Amélie Marian: Explaining Ranking Functions. Proc. VLDB Endow. 14(4): 640-652 (2020)
    • Abraham Gale, Amélie Marian: Explainable Disparity Compensation for Efficient Fair Ranking. ICDE 2024

    Bio: Amélie Marian is an Associate Professor in the Computer Science Department at Rutgers University. Her research interests are in Explainable Rankings, Accountability of Decision-making Systems, Transparent Democracy, Personal Data, and Data Integration. Her recent public scholarship work on explaining the NYC School Admission lottery process to families, in collaboration with elected parent representatives, was instrumental in increasing transparency and accountability in the NYC high school application system. Amélie received her Ph.D. in Computer Science from Columbia University in 2005. She is the recipient of a Microsoft Live Labs Award, three Google Research Awards, and an NSF CAREER award.

  • (3/22, 12:00pm): Xiaoqian Wang (Purdue University): Fairness in Machine Learning without Demographics details)

    Abstract:
    Machine learning has been widely adopted in real-world applications that have great social impact, and fairness in automatic decision-making systems has become an arising concern. It has been shown that without proper intervention, machine learning models can exhibit biases against certain demographic groups. However, simply excluding sensitive attributes (such as race, gender, religion) during training does not improve fairness, given the distributional disparities across different demographic groups. Although various methods have been proposed to mitigate biases, many rely on access to predefined sensitive information during training, presenting challenges in practical settings where collecting such data may be infeasible due to legal or regulatory constraints. Additionally, for many automatic decision-making systems, it is important to ensure fairness across multiple sensitive attributes, which presents a challenge for conventional methods formulated with only one predefined sensitive attribute.

    In response to these challenges, I will discuss our recent work focusing on fair machine learning without demographics. We provide rigorous theoretical guarantees of fairness for our approach. We also propose strategies to enhance efficiency, particularly in addressing fairness concerns in large-scale transformer models. Notably, our approach demonstrates effectiveness in improving fairness in large-scale transformers with substantially lower energy consumption.

    Related papers:

    • Junyi Chai, Taeuk Jang, Xiaoqian Wang: Fairness without Demographics through Knowledge Distillation. NeurIPS 2022
    • Shenyu Lu, Yipei Wang, Xiaoqian Wang: Debiasing Attention Mechanism in Transformer without Demographics. ICLR 2024

    Bio: Xiaoqian Wang is an Assistant Professor of Electrical and Computer Engineering at Purdue University. She received her Ph.D. degree from the University of Pittsburgh in 2019, and the B.S. degree from Zhejiang University in 2013. She focuses on designing novel machine learning models to improve interpretability, fairness, and robustness. She also work on the intersection of machine learning and bioinformatics, healthcare.  She has published more than 40 papers in top conferences and journals including NeurIPS, ICML, ICLR, AAAI, IJCAI, CVPR, KDD, ICDM, RECOMB, ECCB, IPMI, MICCAI, etc. She received an NSF CAREER award in 2022, and an AAAI distinguished paper award in 2023.

  • (3/27, 9:00am): Raymond Chi-Wing Wong (HKUST): Finding Best Tuple via Error-prone User Interaction details)

    Title: Finding Best Tuple via Error-prone User Interaction

    Abstract:
    In the literature of the database community, there are a lot of studies about finding a utility function from a user (representing the user's preference), via interaction with the user by asking a number of questions each requiring him/her to compare 2 points for choosing a more preferred point, in order to find the best tuple in the database containing a lot of tuples. In the real world, the user may make mistakes (carelessly), which means that s/he may answer some of the questions wrongly. Unfortunately, existing interaction algorithms may find the undesirable point based on the wrongly learnt utility function because they assume that all answers from the user are 100% correct. In particular, even if the user answers only 1 wrong answer, the output of the existing algorithms may be far away from the users' real need. Motivated by this, in this paper, we propose a new problem of finding the most interesting point via interaction which is robust to possible mistakes made by a user. Besides, we propose (1) an algorithm that asks an asymptotically optimal number of questions when the dataset contains 2 dimensions and (2) two algorithms with provable performance guarantee when the dataset contains d dimensions where d = 2. Experiments on real and synthetic datasets show that our algorithms outperform the existing ones with a higher accuracy with only a small number of questions asked.

    Related Paper:
    Qixu Chen, Raymond Chi-Wing Wong: Finding Best Tuple via Error-prone User Interaction. ICDE 2023: 1803-1816. https://home.cse.ust.hk/~raywong/paper/icde23-interactionError.pdf

    Biography:
    Raymond Chi-Wing Wong is a Professor in Computer Science and Engineering (CSE) of The Hong Kong University of Science and Technology (HKUST). He is currently the associate head of Department of Computer Science and Engineering (CSE). He was the associate director of the Data Science & Technology (DSCT) program (from 2019 to 2021), the director of the Risk Management and Business Intelligence (RMBI) program (from 2017 to 2019), the director of the Computer Engineering (CPEG) program (from 2014 to 2016) and the associate director of the Computer Engineering (CPEG) program (from 2012 to 2014). He received the BSc, MPhil and PhD degrees in Computer Science and Engineering in the Chinese University of Hong Kong (CUHK) in 2002, 2004 and 2008, respectively. In 2004-2005, he worked as a research and development assistant under an R&D project funded by ITF and a local industrial company called Lifewood.
    He received 40 awards. He published 119 conference papers (e.g., SIGMOD, SIGKDD, VLDB, ICDE and ICDM), 46 journal/chapter papers (e.g., TODS, DAMI, TKDE, VLDB journal and TKDD) and 1 book. He reviewed papers from conferences and journals related to data mining and database, including VLDB conference, SIGMOD, TODS, VLDB Journal, TKDE, TKDD, ICDE, SIGKDD, ICDM, DAMI, DaWaK, PAKDD, EDBT and IJDWM. He is a program committee member of conferences, including SIGMOD, VLDB, ICDE, KDD, ICDM and SDM, and a referee of journals, including TODS, VLDBJ, TKDE, TKDD, DAMI and KAIS.
    His research interests include database, data mining and artificial intelligence.

  • (4/2, 4:00pm): Wolfgang Gatterbauer (Northeastern University): HITSNDIFFS: From Truth Discovery to Ability Discovery by Recovering Matrices with the Consecutive Ones Property details)

    ABSTRACT:
    We analyze a general problem in a crowd-sourced setting where one user asks a question (also called item) and other users return answers (also called labels) for this question. Different from existing crowd sourcing work which focuses on finding the most appropriate label for the question (the “truth”), our problem is to determine a ranking of the users based on their ability to answer questions. We call this problem “ability discovery” to emphasize the connection to and duality with the more well-studied problem of “truth discovery”.

    To model items and their labels in a principled way, we draw upon Item Response Theory (IRT) which is the widely accepted theory behind standardized tests such as SAT and GRE. We start from an idealized setting where the relative performance of users is consistent across items and better users choose better fitting labels for each item. We posit that a principled algorithmic solution to our more general problem should solve this ideal setting correctly and observe that the response matrices in this setting obey the Consecutive Ones Property (C1P). While C1P is well understood algorithmically with various discrete algorithms, we devise a novel variant of the HITS algorithm which we call “HITSnDIFFs” (or HnD), and prove that it can recover the ideal C1P-permutation in case it exists. Unlike fast combinatorial algorithms for finding the consecutive ones permutation (if it exists), HnD also returns an ordering when such a permutation does not exist. Thus it provides a principled heuristic for our problem that is guaranteed to return the correct answer in the ideal setting. Our experiments show that HnD produces user rankings with robustly high accuracy compared to state-of-the-art truth discovery methods. We also show that our novel variant of HITS scales better in the number of users than ABH, the only prior spectral C1P reconstruction algorithm.

    RELATED WORK
    Based on joint work with Zixuan Chen, Subhodeep Mitra and R Ravi
    ICDE'24 preprint:
    https://arxiv.org/pdf/2401.00013

  • (4/5, 12:00pm): Senjuti Basu Roy (New Jersey Institute of Technology): Optimization Opportunities in Human-in-the-loop Systems details)

    Title: 
    Optimization Opportunities in Human-in-the-loop Systems

    Abstract: 
    An emerging trend is to leverage an under-explored and richly heterogeneous pool of human knowledge inside machine algorithms, a practice popularly termed as human-in-the-loop (HIL) process. A wide variety of applications, starting from query processing to text translation, feature engineering, or even human decision making in complex uncertain environments stand to benefit from such synergistic man-machine collaboration. This talk will discuss ongoing federally funded research projects in my lab, some recent research results, and impacts that investigate a variety of optimization opportunities inside such HIL systems, considering the roles and responsibilities of three key stakeholders - humans (workers), machines (algorithms), and platforms (online infrastructure where the work takes place). I will highlight data management challenges and scalability opportunities present in these problems.

    Related papers:
    a. Md Mouinul Islam, Dong Wei, Baruch Schieber, Senjuti Basu Roy:
    Satisfying Complex Top-k Fairness Constraints by Preference Substitutions. Proc. VLDB Endow. 16(2): 317-329 (2022)
    b. Dong Wei, Md Mouinul Islam, Baruch Schieber, Senjuti Basu Roy:
    Rank Aggregation with Proportionate Fairness. SIGMOD Conference 2022: 262-275
    c. Sepideh Nikookar, Paras Sakharkar, Sathyanarayanan Somasunder, Senjuti Basu Roy, Adam Bienkowski, Matthew Macesker, Krishna R. Pattipati, David Sidoti:
    Cooperative Route Planning Framework for Multiple Distributed Assets in Maritime Applications. SIGMOD Conference 2022: 1518-1527

    Speaker Bio:
    Senjuti Basu Roy is the Panasonic Chair in Sustainability and an Associate Professor in the Department of Computer Science at the New Jersey Institute of Technology. Her research focus lies at the intersection of data management, data exploration, and AI, especially enabling human-machine analytics in scale. Senjuti has published more than 85 research papers in high impact data management and data mining conferences and journals. She is currently serving as the track chair of Economics, Online Markets, and Human Computation track of The ACM Web Conference 2024, an Associate Editor of ICDE 2024,  SIGMOD 2025, VLDB Journal, and Information Systems. She is a recipient of the NSF CAREER Award, a Presidential Early Career Award for Scientists and Engineers (PECASE) nominee, and one of the 100 invited early career engineers selected by the National Academy of Engineering in 2021.

  • (4/11, 10:00am): Ahmed Abdi (London School of Economics): A survey of ideal matrices [hosted by Wolfgang] details)

    TITLE
    A survey of ideal matrices

    ABSTRACT
    A 0,1 matrix M is ideal if the set covering polyhedron {x>=0: Mx>=1} is integral. This notion remains a cornerstone in the theory of integer and linear programming. In this survey talk, I will talk about the historical context of ideal matrices and related objects, present some old and recent findings, and end with some exciting conjectures in the area.

    RELATED WORK
    * Abdi and Lee: Deltas, extended odd holes and their blockers. Journal of Combinatorial Theory, Series B 136:193-203 (2019), https://doi.org/10.1016/j.jctb.2018.10.006
    * Abdi, Cornuejols, Lee: Intersecting restrictions in clutters. Combinatorica 40:605-623 (2020), https://doi.org/10.1007/s00493-020-4076-2
    * Abdi, Cornuejols, Guenin, Tuncel: Testing idealness in the filter oracle model. Operation Research Letters 50(6): 753-755 (2022), https://doi.org/10.1016/j.orl.2022.11.004

    SHORT BIO
    I obtained my PhD degree in Mathematics at the University of Waterloo in 2018; the title of my dissertation was "Ideal Clutters". I then spent a year and a bit as a postdoctoral fellow at Carnegie Mellon University, before joining the London School of Economics as a tenure-track assistant professor in 2019. My research expertise are broadly in Combinatorial Optimization, Integer and Linear Programming, Matroid Theory, and Graph Theory. More specifically, the focus of my research so far has been the study of Ideal Clutters, a polyhedrally defined class of objects with deep connections to the areas mentioned, and its applications.
    https://www.ahmadabdi.com/

  • (4/12, 1:30pm): Sainyam Galhotra (Cornell University): Beyond Correlation: Harnessing Causal Inference for Enhanced Data Insights details)

    Title: Beyond Correlation: Harnessing Causal Inference for Enhanced Data Insights

    Causal reasoning has emerged as a crucial element in artificial intelligence, healthcare, and economics, offering a powerful tool for understanding the complex relationships between variables. However, the practical application of causal reasoning methods has been hindered by the extensive and challenging data preparation required. This talk will explore the promise of leveraging causal reasoning methods for various data analytics tasks. We will introduce Hyper, a framework that answers what-if and how-to queries over relational data, which are commonly required for risk management, policy design, exploratory data analysis, and experiment design. By extending SQL syntax, users can express hypothetical queries. We have developed novel techniques that capture the probabilistic dependencies between attributes in the form of a causal graph and use it to efficiently answer user queries. Furthermore, we will discuss the application of causal reasoning techniques for other data preparation and analytics tasks such as fairness, explainability, and debugging. The talk will conclude with a glimpse of how data management techniques can be invaluable for causal discovery, highlighting the need for further exploration of this synergy.

    Related papers:
    https://dl.acm.org/doi/abs/10.1145/3514221.3526149
    https://dl.acm.org/doi/abs/10.1145/3514221.3517864

    Bio: Sainyam Galhotra is an Assistant Professor in the Computer Science department at Cornell University. Prior to this role, he held the position of Computing Innovation Postdoctoral Fellow at the University of Chicago. His research is driven by the ambition to establish the foundational principles of responsible data science, which is pivotal for the efficient development and deployment of reliable data analytics applications. Galhotra's work is distinguished by its interdisciplinary approach, integrating techniques from Data Management, Probabilistic Methods, Causal Inference, Machine Learning, and Software Engineering. He earned his Ph.D. from the University of Massachusetts Amherst. His contributions to the field have been recognized with several accolades, including the Best Paper Award at FSE 2017, and the Most Reproducible Paper Award at SIGMOD in both 2017 and 2018. Additionally, he is a DAAD AInet Fellow and the inaugural recipient of the Krithi Ramamritham Award at UMass, an honor bestowed for outstanding contributions to database research.

  • (4/19, 12:00pm): Ke Yang (University of Texas at San Antonio): Data-driven Solutions for AI Responsibility details)

    Title: Data-driven Solutions for AI Responsibility

    Abstract: Data-driven tasks are used in assisting decisions in many domains that greatly affect people’s lives. Though these tasks are efficient, they inevitably inherit bias in the input data and show problematic behaviors. Deploying unfair algorithms in critical domains will reinforce and legitimize the inequitable treatment of groups who are already disadvantaged. Moreover, the deployment of these tasks lacks transparency and accountability. In this talk, I will introduce my research on mitigating bias in two types of data-driven tasks: rankings and classifications, which are frequently set up to automate decision-making processes. I will also highlight some ongoing work of promoting responsible AI by data-driven solutions.

    Related papers:
    1. https://arxiv.org/abs/2303.17566 (ICDE 24 preprint)
    2. https://drops.dagstuhl.de/storage/00lipics/lipics-vol192-forc2021/LIPIcs.FORC.2021.7/LIPIcs.FORC.2021.7.pdf
    3. https://dl.acm.org/doi/pdf/10.1145/3555041.3589405

    Bio: Dr. Ke Yang is an Assistant Professor in Computer Science in the College of Sciences, at the University of Texas San Antonio. She found the Cohort for AI REsponsibility (CAREAI) at UTSA. Her research centers around data management and machine learning, emphasizing algorithmic fairness, model explanation, and the social impact of AI and machine learning. Prior to joining UTSA, she was a Postdoctoral Research Associate in the College of Information and Computer Sciences, at the University of Massachusetts Amherst.  Dr. Yang received her Ph.D. degree from New York University. She has received a Postdoctoral Fellowship from the Center for Data Science at UMass and a Pearl Brownstein Doctoral Research Award from the Tandon School of Engineering at NYU.

Fall 2023 (Renée)

  • (9/8, 12:00pm): Fatemeh Nargesian (University of Rochester). Algorithms for Data Distribution Tailoring details)

    Title: Algorithms for Data Distribution Tailoring
    Abstract: The rise of powerful and data-hungry models has shifted the focus from model-centric AI to data-centric AI, where the primary effort lies in collecting, cleaning, and improving the quality of data. Data integration is a core task in the data pre-processing step of ML pipelines. Despite decades of research in this field, the synergies of data bias and distribution with integration are still understudied. This lecture will be about the first step of integration: data acquisition.

    The lecture explores ways of acquiring a target dataset based on distribution requirements. I will start by defining and motivating the data distribution tailoring problem. Then, I will describe some building blocks including concepts from multi-armed bandit and coupon collectors' problems. Finally, we will see how these concepts can be applied in designing algorithms that enable tailoring a dataset with desired distribution requirements from multiple sources.

    Bio: Fatemeh Nargesian is an assistant professor of computer science at the University of Rochester. She obtained her PhD at the University of Toronto. Her research interests are in data management for AI-based data analytics and scientific time-series management.
    Paper: Tailoring data source distributions for fairness-aware data integration

  • (10/13, 12:00pm): Babak Salimi (UCSD) Certifying Fair Predictive Models in the Face of Selection Bias details)

    Title:  "Certifying Fair Predictive Models in the Face of Selection Bias"

    Abstract: The widespread use of data-driven algorithmic decision making in crucial areas such as hiring, loan assessments, medical diagnoses, and pretrial release has raised questions about the accuracy and fairness of these algorithms. Selection bias, a prevalent data quality issue in sensitive domains, is a major obstacle to creating fair predictive models. Most existing fair predictive modeling approaches are unable to address selection bias. To overcome this challenge, we introduce a new framework called CRAB that leverages principles of data management and query answering from inconsistent and incomplete databases to produce certifiably fair predictive models.

    In this talk, we will delve into the concept of consistent range approximation, which plays a critical role in approximating the fairness of predictive models on a target population using biased data. We will also discuss the difficulties in achieving consistent range approximation when limited or no external data is available. With the help of our framework, CRAB, we can train predictive models that are certifiably fair on the target population, even in the presence of selection bias. This talk will provide valuable insights for those working in data management, ML, and responsible data science and emphasize the importance of addressing selection bias in algorithmic decision making.

    Related paper:  https://arxiv.org/pdf/2212.10839.pdf

    Bio: Babak Salimi is an Assistant Professor in the HDSI department at UC San Diego. Prior to this, he was a postdoctoral researcher in the Computer Science and Engineering Department at the University of Washington, where he collaborated with Prof. Dan Suciu and the database group. Salimi received his Ph.D. from Carleton University's School of Computer Science, where he was advised by Prof. Leopoldo Bertossi. His research focuses on responsible data management and causal inference, including algorithmic fairness and transparency. He has made several significant contributions to the understanding of responsible data management and analysis, including explainability, fairness, reliability, and robustness. Salimi also has a strong interest in database theory and data management. His research achievements have been acknowledged with awards such as the Postdoc Research Award at the University of Washington, the Best Demonstration Paper Award at VLDB 2018, the Best Paper Award at SIGMOD 2019, and the Research Highlight Award at SIGMOD 2020.

  • (10/16, 2:00pm): C. Mohan (IBM Fellow Emeritus) A Survey of Cloud Database Systems details)

    A Survey of Cloud Database Systems

    C. Mohan

    Monday October 16th, 2pm ISEC 655 (805 Columbus Ave, Boston, MA 02120)

    Distinguished Professor of Science (Hong Kong Baptist University, China)
    Distinguished Visiting Professor (Tsinghua University, China)
    Retired IBM Fellow (IBM Research, USA)

    seemohan@gmail.com https://bit.ly/CMoDUK

     

    In this talk, I will first introduce traditional (non-cloud) parallel and distributed database systems. Concepts like SQL and NoSQL systems, data replication, distributed and parallel query processing, and data recovery after different types of failures will be covered. Then, I will discuss how the emergence of the (public) cloud has introduced new requirements on parallel and distributed database systems, and how such requirements have necessitated fundamental changes to the architectures of such systems. One of the most important cloud-related developments is the emergence of the concept of storage disaggregation which allows for dynamic allocation and deallocation of compute nodes for query processing. Exploitation of AI technologies to make the management of such systems more automated is another trend. I will also argue that ideas from shared disks on-premises DBMSs need to be adopted and adapted in the cloud environment to support increased capacity for read-write transactions, and to improve availability during system upgrades and compute node failures.

     

    I will illustrate the related developments by discussing some of the details of systems like Alibaba POLARDB, Microsoft Azure SQL DB, Microsoft Socrates, Azure Synapse POLARIS, Google Spanner, Google AlloyDB, CockroachDB, Amazon Aurora and Snowflake. 

    BIO:  Dr. C. Mohan is currently a Distinguished Professor of Science at Hong Kong Baptist University, a Distinguished Visiting Professor at Tsinghua University in China, and a member of the inaugural Board of Governors of Digital University Kerala. He retired in June 2020 from being an IBM Fellow at the IBM Almaden Research Center in Silicon Valley. He was an IBM researcher for 38.5 years in the database, blockchain, AI and related areas, impacting numerous IBM and non-IBM products, the research and academic communities, and standards, especially with his invention of the well-known ARIES family of database locking and recovery algorithms, and the Presumed Abort distributed commit protocol. This IBM (1997-2020), ACM (2002-) and IEEE (2002-) Fellow has also served as the IBM India Chief Scientist (2006-2009). In addition to receiving the ACM SIGMOD Edgar F. Codd Innovations Award (1996), the VLDB 10 Year Best Paper Award (1999) and numerous IBM awards, Mohan was elected to the United States and Indian National Academies of Engineering (2009), and named an IBM Master Inventor (1997). This Distinguished Alumnus of IIT Madras (1977) received his PhD at the University of Texas at Austin (1981). He is an inventor of 50 patents. During the last many years, he focused on Blockchain, AI, Big Data and Cloud technologies (https://bit.ly/sigBcPhttps://bit.ly/CMoTalks). Since 2017, he has been an evangelist of permissioned blockchains and the myth buster of permissionless blockchains. During 1H2021, Mohan was the Shaw Visiting Professor at the National University of Singapore. In 2019, he became an Honorary Advisor to the Tamil Nadu e-Governance Agency. In 2020, he joined the Advisory Board of the Kerala Blockchain Academy. Since 2016, Mohan has been a Distinguished Visiting Professor of China’s prestigious Tsinghua University. In 2023, he was named Distinguished Professor of Science of Hong Kong Baptist University. In 2021, he was inducted as a member of the inaugural Board of Governors of the new Indian university Digital University Kerala. Mohan has served on the advisory board of IEEE Spectrum, and on numerous conference and journal boards. During most of 2022, he was a consultant at Google with the title of Visiting Researcher. He has also been a Consultant to the Microsoft Data Team in 2020. Mohan is a frequent speaker in North America, Europe and Asia. He has given talks in 43 countries. He is highly active on social media and has a huge network of followers. More information can be found in the Wikipedia page at https://bit.ly/CMwIkP and his homepage at https://bit.ly/CMoDUK

  • (10/18, 12:00pm): Roee Shraga (WPI), One Algorithm to Rule Them All: On the Changing Roles of Humans in Data Integration details)

    One Algorithm to Rule Them All: On the Changing Roles of Humans in Data Integration


    Roee Shraga (WPI), 


    Data integration is an essential element of today’s ecosystem. In this article, we take the reader on a trip from the days when data integration was a human-intensive task, through the era of algorithmic-supported data integration and show a vision of what the future may bring.

    Paper:  https://ieeexplore.ieee.org/document/10098128

  • (10/25, 12:00pm): Alexandra Meliou (UMass), Diversity and Fairness in Data Selection details)

    Title: Diversity and Fairness in Data Selection
    Speaker:  Alexandra Melious

    Location:  WVH 446 
    Host:  Renée Miller (contact for zoom link)

    Abstract:
    Data is generated and collected from all aspects of human activity. However,
    while datasets grow large and are readily available, they are often
    down-sampled for various uses. This is often due to practical implications,
    such as requirements of efficiency or inherent limitations in human perception.
    While data subset selection is very common, deriving good subsets is a
    non-trivial task. In this talk, we will highlight the importance of diversity
    and fairness in data selection and will discuss algorithmic results on the
    optimization of diversity objectives subject to fairness constraints. We will
    further highlight some ongoing work on the impact of data selection in
    information visualization, and how the interplay of diversification and human
    perception can help derive data samples that better capture visual information.

    Related paper:

    Zafeiria Moumoulidou, Andrew McGregor, and Alexandra Meliou, Diverse Data Selection under Fairness Constraints, in International Conference on Database Theory, (ICDT), 2021, pp. 11:1–11:25.

  • (10/26, 11:00am): Gerardo Vitagliano (Hasso Plattner Institute), Modeling File Structure for Data Preparation details)

    Title: Modeling File Structure for Data Preparation
    Speaker:  Gerardo Vitagliano 
    Time:  11am ET
    Location:  WVH 446, 440 Huntington Ave., Boston
    Host:  Renée Miller 

    Tabular data is at the core of many data-driven processes. Tabular files are referred to as "structured" data, implying that they can be easily processed automatically. However, in real-world processes, extensive data preparation is often required to parse, repair, or transform these files before leveraging their contents.  We overview several models to represent the heterogeneous structure of real-world files: a formal model, an implicit model, and an embedding model.

    We present MaGRiTTE, a transformer-based architecture that models file structure as vectorial embeddings, and apply it to solve automated data preparation tasks.Bio:

    Bio:  Gerardo Vitagliano is a PhD student at the Hasso Plattner Institute, in the Information System group of Prof. Felix Naumann. His research interests are structural data preparation, automated data management, and the intersection between machine learning and data quality. 

    Papers:  http://www.vldb.org/pvldb/vol15/p646-vitagliano.pdf
    https://openproceedings.org/2021/conf/edbt/p32.pdf


  • (10/27, 12:00pm): Dakuo Wang (Northeastern), Designing Human-Centered AI Systems for Human-AI Collaboration details)

    Title:  Designing Human-Centered AI Systems for Human-AI Collaboration

    Speaker:  Dakuo Wang (Northeastern)

    WVH 446, 440 Huntington Ave, Boston

    Host:  Renée Miller (contact for zoom link)

    Abstract:
    Human-Centered AI (HCAI) refers to the research effort that aims to design and implement AI techniques to support various human tasks, while taking human needs into consideration and preserving human control. Prior work has focused on human-AI interaction design and explainable AI research (XAI). But, why do some human-centered AI systems work, and some not? In this talk, I show how we can learn from human-human collaboration to design and build AI systems for human-AI collaboration. This work serves as a cornerstone towards the ultimate goal of Human-AI Collaboration, where AI and humans can take complementary and indispensable roles to achieve a better outcome and experience.

    Bio:
    Dakuo Wang is an Associate Professor at Northeastern University, jointly appointed at Khoury College of Computer Sciences and the College of Arts, Media and Design. His research lies at the intersection of human-computer interaction (HCI) and artificial intelligence (AI), with a focus on the exploration, development, and evaluation of human-centered AI (HCAI) systems. The overarching goal is to democratize AI for every person and every organization, so that they can access their own AI and collaborate with these real-world AI systems (human-AI collaboration).

    Before joining Northeastern, Dakuo was a Senior Staff Member at IBM Research, Principal Investigator at MIT-IBM Watson AI Lab, and a Visiting Scholar at Stanford Institute for Human-Centered Artificial Intelligence. He got his Ph.D. from the University of California Irvine (advisor: Judith Olson and Gary Olson). He has worked as a designer, researcher, and engineer in the U.S., China, and France. He serves in organizing committees, program committees, and editorial boards for a variety of venues, and ACM has recognized him as an ACM Distinguished Speaker.

    Paper:   https://arxiv.org/abs/2305.16641

  • (11/3, 12:00pm): Michael Cafarella (MIT), Sawmill: Applying Causal Reasoning to System Logs details)

    Speaker:  Michael Cafarella, MIT

    Location:  WVH 446 
    Host:  Renée Miller (contact for zoom link)

    Abstract:
    Causal reasoning is a growing part of AI and has become a core part of economists’ toolkit. But applying causal reasoning methods in practice has required a large amount of careful, difficult database preparation. As a result, causal methods have not penetrated into more mundane domains that could benefit from causal discovery.
    This talk will describe early work on how to apply causal methods to the practical domain of distributed system debugging. It offers a framework for processing log files and automatically constructing a representation that is amenable to causal interpretation. We also present algorithms for how to efficiently use this representation for causal discovery of dynamics of a distributed system, and for marshaling scarce user curation attention for building domain knowledge needed to answer practical systems questions.
    Title: Sawmill: Applying Causal Reasoning to System Logs

    Bio:  Michael Cafarella is a Principal Research Scientist in the MIT Computer Science and Artificial Intelligence Lab. His research interests include databases, information extraction, data integration, and applying data-intensive methods to economics. He has published extensively in venues such as SIGMOD, VLDB, and elsewhere. Mike received his PhD from the University of Washington, Seattle, in 2009. His academic awards include the NSF CAREER award, the Sloan Research Fellowship, and the VLDB Ten-Year Best Paper award. In addition to his academic work, he co-founded the widely-used Hadoop open-source project and information extraction startup Lattice Data.√

  • (11/8, 12:00pm): Abolfazl Asudeh (UIC), Maximizing Neutrality in News Ordering details)

    Abolfazl Asudeh, UIC

    Maximizing Neutrality in News Ordering

    noon ET Nov 8,  WVH 446, 440 Huntington Ave, Boston

    Host:  Renee Miller (contact for zoom link)

    The detection of fake news has received increasing attention over the past few years, but there are more subtle ways of deceiving one's audience. In addition to the content of news stories, their presentation can also be made misleading or biased. In this work, we study the impact of the ordering of news stories on audience perception. We introduce the problems of detecting cherry-picked news orderings and maximizing neutrality in news orderings. We prove hardness results and present several algorithms for approximately solving these problems. Furthermore, we provide extensive experimental results and present evidence of potential cherry-picking in the real world.

    Paper: KDD 2023

    Bio:  Abolfazl Asudeh is an assistant professor of Computer Science at the University of Illinois Chicago and the director of Innovative Data Exploration Laboratory (InDeX Lab). He is a VLDB Ambassador, and a regular PC member of Database flagship venues.His research spans to different aspects of Big Data and Data Science, including Data Management, Information Retrieval, and Data Mining, for which he designs efficient, accurate, and scalable algorithmic solutions. Data Equity Systems, Algorithmic Fairness, and Data-centric AI are his major focus in research. His research interest also includes: Ranking, Social Networks Analysis, Large-Scale Computation on Limited Resources, Computational Fact Checking, Data management for Machine Learning, Web Databases, and applied Computational Geometry.

  • (11/10, 12:00pm): Boris Glavic (IIT-Chicago), UNCERTAIN DATA MANAGEMENT FOR RESPONSIBLE DATA MANAGEMENT Lightweight Uncertain Data Models to The Rescue details)

    Speaker: Boris Glavic, IIT-Chicago

    Title:  UNCERTAIN DATA MANAGEMENT FOR RESPONSIBLE DATA                            MANAGEMENT      
         Lightweight Uncertain Data Models to The Rescue

    Host:  Renee Miller (contact for zoom link)

    noon ET Nov 10, WVH 446, 440 Huntington Ave, Boston

    Uncertainty arises naturally in many application domains due to data
    entry errors, sensor errors and noise, uncertainty in information
    extraction and parsing, selection bias, ambiguity from data integration,
    and heuristic data preparation & cleaning techniques. Analyzing
    uncertain data without accounting for its uncertainty can create hard to
    trace errors with severe real world implications. Thus, tracking
    uncertainty in data is essential for responsible data management. In the
    first part of this talk, I will discuss how uncertainty in data
    collection and preparation leads to biased and untrustworthy analysis
    results. Afterwards, I will introduce the audience to methods for
    modeling uncertainty and for determining how uncertainty impacts
    analysis results. Specifically, I will focus on uncertainty-annotated
    databases (UA-DBs), a light-weight and practical uncertain data model.
    UA-DBs accommodate a wide range of uncertain data sources, can
    approximate many existing incomplete and probabilistic data models as
    well as can approximate the uncertainty arising naturally in many data
    cleaning techniques, and, most importantly, query answering over UA-DBs
    is efficient for a very large class of queries.

    Uncertainty Annotated Databases - a Lightweight Approach for Approximating Certain Answers. Su Feng, Aaron Huber, Boris Glavic, Oliver Kennedy.  SIGMOD19

    Efficient Uncertainty Tracking for Complex Queries with Attribute-Level Bounds. Su Feng, Aaron Huber, Boris Glavic, Oliver Kennedy.  SIGMOD21

  • (11/15, 12:00pm): Stratis Ioannidis (Northeastern), Explanations of Black-Box Models based on Directional Feature Interactions details)

    Stratis Ioannidis (Northeastern)
    Noon EST November 15 WVH 446, 440 Huntington Ave, Boston 
    Host:  Renée Miller (contact for zoom link)

    Title: Explanations of Black-Box Models based on Directional Feature
    Interactions

    Abstract: As machine learning algorithms are deployed ubiquitously to
    a variety of domains, it is imperative to make these often black-box
    models transparent. Several recent works explain black-box models by
    capturing the most influential features for prediction per instance;
    such explanation methods are univariate, as they characterize
    importance per feature. We extend univariate explanation to a
    higher-order; this enhances explainability, as bivariate methods can
    capture feature interactions in black-box models, represented as a
    directed graph. Analyzing this graph enables us to discover groups of
    features that are equally important (i.e., interchangeable), while the
    notion of directionality allows us to identify the most influential
    features. We apply our bivariate method on Shapley value explanations,
    and experimentally demonstrate the ability of directional explanations
    to discover feature interactions. We show the superiority of our
    method against state-of-the-art on CIFAR10, IMDB, Census, Divorce,
    Drug, and gene data.

    This is joint work with Aria Masoomi, Davin Hill, Zhonghui Xu, Craig P
    Hersh, Edwin K. Silverman, Peter J. Castaldi, and Jennifer Dy

    Paper:  https://ece.northeastern.edu/fac-ece/ioannidis/static/pdf/2022/explanations_of_black_box_mode.pdf

  • (11/17, 12:00pm): Vasilis Efthymiou (Univ. Athens), Fairness in Algorithmic Decision Making details)

    Title: Fairness in Algorithmic Decision Making
    Speaker: Vasilis Efthymiou
    Host:  Renee Miller (contact for zoom link)

    Abstract
    From providing suggestions of restaurants and movies, to deciding who to hire, who goes to jail and who gets a liver transplant, decision-making algorithms are becoming increasingly important for our lives. By automating such decisions, we aspire to overcome human flaws like fatigue, greed, corruption and prejudice. This talk will explain that some of those human traits are also (perhaps inevitably) embedded into our algorithms and the data that we train those algorithms on. We will discuss how algorithms learn to discriminate, as well as how we can identify and prevent such discrimination. We will conclude the talk with a use-case of studying algorithmic bias in the task of entity resolution, a task that is very commonly used for cleaning the data that are later used to train downstream machine learning applications.

    Bio
    Vasilis Efthymiou is an incoming Assistant Professor at the Department of Informatics and Telematics of Harokopio University of Athens (HUA), Greece. He received his BSc in computer science in 2010, his master's degree in Information Systems and in Bioinformatics in 2012, and his PhD on the topic of entity resolution in the Web of data in 2017, from the Computer Science Department of University of Crete (UOC). Before joining HUA, he was a postdoctoral researcher at the Information Systems Laboratory of FORTH-ICS, a visiting instructor at UOC, and a postdoctoral researcher at the database group of IBM Research in Almaden Research Center, CA, USA. After his PhD research internship at IBM T.J. Watson Research Center, NY, USA, on matching Web tables to Knowledge Graphs (KGs), he has been co-organizing the SemTab challenges at ISWC, an effort to benchmark systems dealing with the tabular data to KG matching problem, and the TaDA workshop at VLDB. He has co-authored two books, more than 60 papers, and co-invented four US patents.

    Indicative publications for the ER use case: 
    Fairness-aware ER (short paper): https://isl.ics.forth.gr/ResponsibleER/files/CIKM2021short_fairER_CR.pdf 
    KG sampling for ER: https://isl.ics.forth.gr/ResponsibleER/files/ESWC23_EA_Diversity_Fanourakis.pdf

  • (12/1, 12:00pm): Fei Chiang (McMaster University), Towards Diversified and Privatized Data Preparation, noon EST details)

    Towards Diversified and Privatized Data Preparation 

    Speaker:  Fei Chiang, McMaster University

    Host:  Renee Miller (contact for zoom link)

    Recent privacy legislation has aimed to restrict and control the amount of personal data published by companies and shared with third parties. Much of this real data is not only sensitive requiring anonymization but also contains characteristic details from a variety of individuals. This diversity is desirable in many applications ranging from Web search to drug and product development. Unfortunately, data anonymization techniques have largely ignored diversity in its published result. This inadvertently propagates underlying bias in subsequent data analysis. We study the problem of finding a diverse anonymized data instance where diversity is measured via a set of diversity constraints.   In this talk, I will present some of our work on finding a diverse anonymized instance, as well as extensions to data publishing techniques to consider data semantics while protecting sensitive values. 

    Bio: 

    Fei Chiang is an Associate Professor in the Department of Computing and Software (Faculty of Engineering), the Director of the Data Science Research Group, and a Faculty Fellow at the IBM Centre for Advanced Studies. She served as an inaugural Associate Director of the MacData Institute. Her research interests and industrial experience is in data management, spanning data quality, data profiling, temporal graphs, and database systems. She holds four patents for her work in self-managing database systems.  She received a 2018 Ontario Early Researcher Award, and a 2023 VLDB Distinguished Reviewer Award.  She received her M. Math from the University of Waterloo, and B.Sc and PhD degrees from the University of Toronto, all in Computer Science.

Summer 2023 (Wolfgang)

  • (5/3, 3:00pm): Kuldeep Meel (NUS): Distinct Elements in Streams: An Algorithm for the (Text) Book details)

    TITLE
    Distinct Elements in Streams: An Algorithm for the (Text) Book

    ABSTRACT
    Given a data stream of m elements, the Distinct Elements problem is to estimate the number of distinct elements in the stream. Distinct Elements has been a well studied problem over four decades but all the current state-of-the-art algorithms are often difficult to analyze or impractical. I will present a simple, intuitive, sampling-based space-efficient algorithm whose description and the proof are accessible to undergraduates with a knowledge of basic probability theory.

    Besides the simplicity, the approach has significant theoretical and practical implications: our approach allowed us to resolve the open problem of (Discrete) Klee's Measure Problem in the streaming setting and build a state-of-the-art DNF counter in practice.

    RELEVANT PUBLICATIONS
    * PODS-21: https://dl.acm.org/doi/10.1145/3452021.3458333
    * PODS-22: https://www.comp.nus.edu.sg/~meel/Papers/pods22.pdf
    * ESA-22: https://www.comp.nus.edu.sg/~meel/Papers/esa22.pdf
    (ESA-22 paper simplified of techniques presented first in PODS-21 and PODS-22, so should be treated as the main reference)

    BIO
    Kuldeep Meel holds the NUS Presidential Young Professorship in the School of Computing at the National University of Singapore (NUS). His research interests lie at the intersection of Formal Methods and Artificial Intelligence. He is a recipient of the 2022 ACP Early Career Researcher Award, the 2019 NRF Fellowship for AI and was named AI's 10 to Watch by IEEE Intelligent Systems in 2020. His research program's recent recognitions include the CACM Research Highlight Award, ACM SIGMOD Research Highlight, IJCAI-22 Early Career Spotlight, "Best Papers of CAV" (2022 and 2020) invite from FMSD journal, IJCAI-19 Sister conferences best paper award track invitation.

  • (5/5, 11:00am): Jian Zhang (Northeastern University): Viper -- A Fast Snapshot Isolation Checker details)

    TITLE
    Viper: A Fast Snapshot Isolation Checker

    ABSTRACT
    Snapshot isolation (SI) is supported by most commercial databases and is widely used by applications. However, checking SI today—given a set of transactions, checking if they obey SI—is either slow or gives up soundness.

    We present Viper, an SI checker that is sound, complete, and fast. Viper checks black-box databases and hence is transparent to both users and databases. To be fast, viper introduces BC-polygraphs, a new representation of transaction dependencies. A BC-polygraph is acyclic iff transactions are SI, a theorem that we prove. Viper also introduces heuristic pruning, an optimization to accelerate checking SI by leveraging common knowledge of real-world database implementations. Besides vanilla SI, viper supports major SI variants including Strong SI, Generalized SI, and Strong Session SI. Our experiments show that given the same time budget, viper improves over baselines by 15× in the workload sizes being checked.

    RELATED PAPERS
    * Zhang, Ji, Mu, Tan. Viper: A Fast Snapshot Isolation Checker. (https://www.khoury.northeastern.edu/home/zhangjian3/attaches/eurosys23-final117.pdf)
    * Adya. Weak Consistency: A Generalized Theory and Optimistic Implementations for Distributed Transactions (https://pmg.csail.mit.edu/papers/adya-phd.pdf
    * Crooks, Pu, Alvisi, Clement. Seeing is Believing: A Client-Centric Specification of Database Isolation (https://www.cs.cornell.edu/lorenzo/papers/Crooks17Seeing.pdf")

    BIO
    Jian Zhang is a second-year PhD student of Khoury College of Computer Sciences at Northeastern University, advised by Cheng Tan. His main research interests are verifying correctness of distributed systems and building efficient distributed systems. His recent projects mainly focus on verifying isolation levels of databases.

  • (5/15, 3:00pm): Mario Nascimento (Northeastern University): Optimal routings and crowdsourcing (and some "work in progress") details)

    TITLE
    Optimal routings and crowdsourcing (and some "work in progress")

    ABSTRACT
    In this seminar I will present three problems involving crowdsourcing and routing and whose solution involve the use of skylines. The main theme among those problems is to present so-called "gig-workers" with optimal alternative routes which optimize two conflicting criteria, e.g., minimizing the cost of a path while maximizing rewards from completing tasks in such a path. I will also introduce two problems that I am currently working on: leveraging large language models for spatially oriented tasks such as trajectory imputation, and how to detect and fix data that causes previously learned models to fail.

    READINGS:
    - C.F. Costa and M.A. Nascimento, Last Mile Delivery Considering Time-Dependent Locations, ACM SIGSPATIAL, 2021. https://dl.acm.org/doi/10.1145/3474717.3483919
    - C.F. Costa and M.A. Nascimento, In-Route Task Selection in Spatial Crowdsourcing, ACM Trans. Spatial Algorithms and Systems, 2020. https://dl.acm.org/doi/10.1145/3368268
    - S. Anwar, F. Lettich, M.A. Nascimento, Towards A Personal Shopper's Dilemma: Time vs Cost. ACM SIGSPATIAL/GIS 2020. https://dl.acm.org/doi/10.1145/3397536.3422276

    BIO
    Mario Nascimento is currently with the Northeastern University’s Khoury College, Vancouver Campus, serving as its first Director of Pacific Northwest Research. His main research interests are in the area of spatiotemporal data management and, according to Google Scholar, my publications have been cited 4,600+ times, yieding an H-index of 33. He is also serving as the PC co-Chair for ACM SIGSPATIAL 2023, and General co-Chair for SSTD 2023. Among many other service-oriented roles, Mario has also served as ACM SIGMOD‘s Information Director, ACM SIGMOD Record‘s Editor and VLDB Journal‘s Editorial Board member.

  • (5/26, 11:00am): Hayden Jananthan (MIT Lincoln Laboratory): Associative Array Algebra of Big Data, Graphs, & Machine Learning details)

    TITLE
    Associative Array Algebra of Big Data, Graphs, & Machine Learning

    ABSTRACT
    Rectangular arrays abound in mathematics and computer science, e.g., spreadsheets, databases, matrices, and graphs (the last by way of incidence or adjacency arrays). Despite none of these instantiations being new, the ever-increasing amount of data means that novel approaches are required for storing, operating on, and interpreting these huge sets of data. Some such approaches include leveraging semirings and sparse linear algebra, as in GraphBLAS and the Dynamic Distributed Dimensional Data Model (D4M). Associative arrays exemplify the marrying of semirings and sparse linear algebra, providing a common interface for the aforementioned varying types of rectangular arrays--generalizing them yet still enforcing familiar and important mathematical properties. The main features of associative arrays that differ from traditional treatments of, say, matrices are the fact that row and column indices may be taken from arbitrary sets (as opposed to finite initial segments of the positive or nonnegative integers) and entries are taken in semirings (as opposed to fields or rings).

    In this seminar I will explore the mathematical theory of semirings and associative array algebra (including both positive and negative results), their use and implementations in GraphBLAS and D4M, and current and recent applications.

    READINGS
    * Associative Arrays: Unified Mathematics for Spreadsheets, Databases, Matrices, and Graphs (https://arxiv.org/abs/1501.05709)
    * Python implementation of the Dynamic Distributed Dimensional Data Model (https://arxiv.org/abs/2209.00602)
    * GraphBLAS Mathematics (https://www.mit.edu/~kepner/GraphBLAS/GraphBLAS-Math-release.pdf)

    BIO
    Dr Hayden Jananthan is a mathematician whose focus is on applying traditionally "pure" mathematics tools to applied contexts, particularly on the theory of associative arrays as a tool to address big data and machine learning problems. More recently, Dr Jananthan has been applying the theory of associative arrays and the study of heavy-tailed statistics to better understand the foundational characteristics of malicious network traffic. In addition to his research, Dr Jananthan is passionate about making pure mathematics more accessible to STEM professionals in general. Dr Jananthan received his PhD from Vanderbilt University, where his doctoral research was focused on computability theory. As recognition for his research and teaching contributions, he received the Bjarni Jonsson Prize for Research and the B. F. Bryant Award for Excellence in Teaching.

  • (6/2, 11:00am): David Bau (Northeastern University): Direct model editing details)

    TITLE
    Direct Model Editing

    ABSTRACT
    Can we understand large deep networks well enough to reprogram them by changing their parameters directly? In this talk I will talk about Direct Model Editing, a framework for understanding and modifying the structure of their internal computations using causal interventions. We will consider examples in computer vision and NLP: I will talk about how to probe and rewrite computations within a StyleGAN to alter compositional rules that govern rendering of realistic images, and I will describe how to edit specific factual knowledge within a large language model, directly tracing and modifying parameters of GPT.

    RELATED WORK
    * Locating and Editing Factual Associations in GPT. https://rome.baulab.info/
    * Mass Editing Memory in a Transformer. Preprint. https://memit.baulab.info/
    * GAN Dissection. PNAS 2020. https://gandissect.csail.mit.edu/
    * StyleSpace Analysis: Disentangled Controls for StyleGAN Image Generation. CVPR 2021. https://github.com/betterze/StyleSpace

    BIO
    David Bau is Assistant Professor at the Northeastern University Khoury College of Computer Science. He received his PhD from MIT and AB from Harvard, and he has previously worked at Google and Microsoft. He is known for his network dissection studies of individual neurons in deep networks and has published research on the interpretable structure of learned computations in large models in PNAS, CVPR, NeurIPS, and SIGGRAPH. Prof. Bau is also coauthor of the textbook, Numerical Linear Algebra.

  • (6/9, 11:00am): Meghyn Bienvenu (CNRS & University of Bordeaux): Ontologies for Data Access and Data Quality details)

    TITLE
    Ontologies for Data Access and Data Quality

    ABSTRACT
    Ontology-mediated query answering (OMQA) is a promising approach to data access and integration that has been extensively studied by the AI and database communities. Ontologies, which are typically specified using logic-based formalisms, serve both to provide a user-friendly vocabulary for query formulation and to capture domain knowledge, which can be exploited by reasoning algorithms to infer implicit information and return more complete query results. While the addition of an ontology brings significant benefits, it also renders the query answering task more computationally involved, which has led both to theoretical studies aimed at understanding the complexity of OMQA for a range of ontology and query languages, as well as the development of new algorithmic techniques.

    I will start this talk by giving a brief introduction to OMQA, outlining the two main algorithmic approaches (query rewriting and materialization), and highlighting some notable research questions and results related to query rewriting. In the second part of the talk, I will discuss data quality in the context of OMQA, in particular, how to handle data inconsistencies.

    RELATED PAPERS
    1. Bienvenu : A Short Survey on Inconsistency Handling in Ontology-Mediated Query Answering. Special issue on "Ontologies and Data Management" of Künstliche Intelligenz 34(4): 443-451. 2020.
    https://www.labri.fr/perso/meghyn/papers/Bienvenu-KI20.pdf

    2. Bienvenu, Kikot, Kontchakov, Podolskiii, Zakharyaschev: Ontology-Mediated Queries: Combined Complexity and Succinctness of Rewritings via Circuit Complexity. Journal of the ACM (JACM) 65:5, 2018.
    https://www.labri.fr/perso/meghyn/papers/BieKikKonPodZak-JACM18.pdf

    3. Bienvenu, ten Cate, Lutz, Wolter: Ontology-based Data Access: A Study Through Disjunctive Datalog, CSP, and MMSNP. ACM Transactions on Database Systems (TODS) 39(4): 33:1 -33:44.
    https://www.labri.fr/perso/meghyn/papers/BieCatLutWol-TODS14.pdf

    BIO
    Meghyn Bienvenu is a CNRS research director at LaBRI (Bordeaux, France). Born in Canada, she obtained her undergraduate degree from the University of Toronto before moving to France to continue her studies at the University of Toulouse. Her PhD thesis, defended in 2009, was awarded the AFIA Prize for best French dissertation in artificial intelligence. Her research interests span a range of topics in knowledge representation and reasoning and database theory, with a main focus on description logic ontologies and their use in querying data. She currently leads an ANR AI Chair on the topic of intelligent handling of imperfect data. Since January 2023, she is co-director of the CNRS National Research Network RADIA (Reasoning, Learning, and Decision in AI). She is an associate editor of Artificial Intelligence Journal (AIJ) and ACM Transactions on Computational Logic (TOCL), and she served as PC co-chair for KR 2021, the leading conference on knowledge representation and reasoning. Her research has been recognized by an invited Early Career Spotlight talk at IJCAI’16 and the 2016 CNRS Bronze Medal in the area of computer science.

  • (6/16, 2:00pm): Kurt Stockinger (Zurich University of Applied Sciences/ZHAW): Building Natural Language Interfaces for Databases with and without AI details)

    TITLE
    Building Natural Language Interfaces for Database with and without AI

    ABSTRACT
    Information systems are the core of modern enterprises and scientific exploration. They are often based on fundamental research developed in the database community. While enterprise data is typically stored in relational databases, data-intensive scientific disciplines such as bioinformatics often store their data in graph databases. To query these databases, end-users need to know the formal query languages SQL or SPARQL as well as the logical structure of the databases. However, even for technology experts it is very challenging to write the right queries to retrieve the desired data. Hence, a large part of the end-users is basically not able to effectively query their databases.

    In this talk we discuss how to build intelligent information systems that enable end-users to talk to their data similar to humans. The major goal is to combine artificial intelligence with human intelligence for novel ways of data exploration. In particular, we will show how we have built various natural language interfaces for databases using pattern-based and machine learning-based approaches to significantly increase the productivity of scientists and knowledge workers when interacting with data. We demonstrate that our system INODE (Intelligent Open Data Exploration), which we have been building as part of a European Union project with 9 partners across Europe, is uniquely accessible to a wide range of users from large scientific communities to the public. Finally, we elaborate on the lessons learned when developing such a system and discuss how the technology can be enhanced by researchers or knowledge workers for exploring their own databases in natural language.

    MAJOR REFERENCES
    * Amer-Yahia, Koutrika, Braschler, Calvanese, Lanti, Lücke-Tieke, ... & Stockinger. INODE: Building an End-to-End Data Exploration System in Practice. ACM SIGMOD Record. 2022.
    https://sigmodrecord.org/?smd_process_download=1&download_id=%2012688
    * Sima, Mendes de Farias, Anisimova, Dessimoz, Robinson-Rechavi, Zbinden, Stockinger. Bio-SODA UX: enabling natural language question answering over knowledge graphs with user disambiguation. Distributed and Parallel Databases. 2022.
    https://link.springer.com/article/10.1007/s10619-022-07414-w
    * Brunner, Stockinger. ValueNet: A Natural Language-to-SQL System that Learns from Database Information. ICDE 2021.
    https://arxiv.org/pdf/2006.00888

    BIO
    Prof. Dr. Kurt Stockinger is Professor of Computer Science, Director of Studies in Data Science at Zurich University of Applied Sciences (ZHAW) and Co-Head of the ZHAW Datalab in Switzerland. He is also an external lecturer at University of Zurich. Kurt Stockingers’ research focuses on Data Science with emphasis on Big Data, Natural Language Query Processing, Query Optimization and Quantum Computing. Essentially, his research interests are at the intersection of databases, natural language processing and machine learning. He is also on the Advisory Board of Callista Group AG. Previously Kurt Stockinger worked at Credit Suisse in Zurich, Switzerland, at Lawrence Berkeley National Laboratory in Berkeley, California, at California Institute of Technology, California as well as at CERN in Geneva, Switzerland. He holds a Ph.D. in computer science from CERN / University of Vienna.

  • (6/30, 10:00am): Dan Suciu (University of Washington): Applications of Information Inequalities to Database Theory Problems details)

    TITLE
    Applications of Information Inequalities to Database Theory Problems

    ABSTRACT
    The talk describes applications of information inequalities to three database problems: upper bounds of a query's output (a.k.a. pessimistic cardinality estimation), the query domination problem (a.k.a. query containment under bag semantics), and the implication problem for approximate integrity constraints. No prior knowledge of query processing or information theory is needed: all required concepts and results from information inequalities are introduced here, gradually, and motivated by database problems.

    RELATED PAPERS
    * Applications of Information Inequalities to Database Theory Problems, 2023. (invited talk at LICS'2023)
    https://arxiv.org/abs/2304.11996
    * Kyle B. Deeds, Dan Suciu, Magdalena Balazinska: SafeBound: A Practical System for Generating Cardinality Bounds. Proc. ACM Manag. Data 1(1): 53:1-53:26 (2023)
    https://dl.acm.org/doi/pdf/10.1145/3588907
    * Mahmoud Abo Khamis, Phokion G. Kolaitis, Hung Q. Ngo, Dan Suciu: Bag Query Containment and Information Theory. ACM Trans. Database Syst. 46(3): 12:1-12:39 (2021)
    https://dl.acm.org/doi/abs/10.1145/3472391
    * Batya Kenig, Dan Suciu: Integrity Constraints Revisited: From Exact to Approximate Implication. Log. Methods Comput. Sci. 18(1) (2022)
    https://drops.dagstuhl.de/opus/volltexte/2020/11942/pdf/LIPIcs-ICDT-2020-18.pdf
    * Batya Kenig, Pranay Mundra, Guna Prasaad, Babak Salimi, Dan Suciu: Mining Approximate Acyclic Schemes from Relations. SIGMOD Conference 2020: 297-312
    https://dl.acm.org/doi/pdf/10.1145/3318464.3380573

    BIO
    Dan Suciu is a Microsoft Endowed Professor in the Paul G. Allen School of Computer Science and Engineering at the University of Washington. Suciu is conducting research in data management, on topics such as query optimization, probabilistic data, data pricing, parallel data processing, data security. He is a co-author of two books Data on the Web: from Relations to Semistructured Data and XML, 1999, and Probabilistic Databases, 2011. He received the ACM SIGMOD Codd Innovation Award, received several best paper awards and test of time awards, and is a Fellow of the ACM. Suciu is currently an associate editor for the Journal of the ACM. Suciu's PhD students Gerome Miklau, Christopher Re and Paris Koutris received the ACM SIGMOD Best Dissertation Award in 2006, 2010, and 2016 respectively, and Nilesh Dalvi was a runner up in 2008.

Spring 2023 (Neha)

  • (1/13, 11:00am): Remy Wang (University of Washington): Relational Programming details)

    TITLE
    Relational Programming

    ABSTRACT
    Modern data analytics demands capabilities beyond those offered by traditional data management systems. On one hand, today’s data come in many different forms, be it graphs, matrices, or tensors; on the other, analysis tasks are becoming more and more complex, often involving complex operations running iteratively. These new demands call for new ways of working with data and innovative data analytic systems. In this talk, I make the case for relational programming, an emerging programming paradigm for modern data analytics. I describe a new programming language, a new algorithm for the relational join, and optimization techniques for relational programs. I conclude with my vision for a relational future of programming.

    RELATED PAPERS
    1) Optimizing Recursive Queries with Program Synthesis (https://arxiv.org/abs/2202.10390)
    2) Free Join: Unifying Worst-Cast Optimal and Traditional Joins (in submission)
    3) SPORES: Sum-Product Optimization via Relational Equality Saturation for Large Scale Linear Algebra (http://www.vldb.org/pvldb/vol13/p1919-wang.pdf)

    BIO
    Yisu Remy Wang is a PhD student at University of Washington. He works at the intersection of Databases and Programming Languages. He develops languages, theory and tools to help data analysts conduct complex tasks efficiently. His research has been recognized with best paper awards from PODS, POPL and OOPSLA.

  • (1/20, 11:00am): Antonia Kormpa (University of Oxford): On the Complexity of Joins on Interval Data details)

    TITLE
    On the Complexity of Joins on Interval Data

    ABSTRACT 
    Interval data abound in spatio-temporal databases, as they represent spatial objects or time periods. This work established the complexity of Boolean Conjunctive Queries (IJs) with Intersection Joins, which are joins often applied to spatio-temporal databases. It was found that IJs can be solved by known algorithms for Boolean Conjunctive Queries. This can be accomplished by reducing the former problem to the latter. Additionally, this work showed that a reverse reduction could be used to derive lower bounds as well. Last but not least, a new acyclicity class emerged through this work to characterise syntactically the IJs that can be computed in linear time (up to logarithmic factors).
     
    RELATED PAPERS
    This talk is based on the article with title "The Complexity of Boolean Conjunctive Queries with Intersection Joins" which was published in PODS 2022.(https://dl.acm.org/doi/abs/10.1145/3517804.3524156)
     
    BIO 
    Antonia Kormpa has studied computer engineering and informatics at the University of Patras, Greece. She is currently pursuing a PhD in computer science at the University of Oxford, UK. Her research is about the complexity of join queries on interval data.

  • (1/27, 11:00am): Ibrahim Sabek (MIT): Machine Learning Enhanced Query Scheduling and Execution Operations in Database Systems details)

    TITLE
    Machine Learning Enhanced Query Scheduling and Execution Operations in Database Systems

    ABSTRACT
    Database systems have traditionally relied on empirical approaches and handcrafted rules that encode human intuitions or heuristics to store large-scale data and process user queries over them. These well-tuned empirical approaches and rules work well for the general-purpose case, but are seldom optimal for any actual application because they are not tailored for the specific application properties (e.g., user workload patterns). Further, they fail to consider the complex interaction with the environment running the database systems (e.g., hardware and operating system). In this talk, I will show how we leverage machine learning to instance-optimize the performance of the query execution engine, the heart of any database system, using efficient learned models, data structures, and algorithms for three essential operations: query scheduling, hashing, and join processing. 
     
    RELATED PAPERS
    1. Learned query scheduling: https://dl.acm.org/doi/abs/10.1145/3514221.3526158
    2. Learned hashing: https://dl.acm.org/doi/10.14778/3570690.3570702
    3. Learned in-memory joins: https://arxiv.org/abs/2111.08824  

    BIO
    Ibrahim Sabek is a postdoctoral associate at the MIT Data Systems Group and an NSF/CRA Computing Innovation Fellow, working with Tim Kraska and Michael Cafarella. Before that, he completed his Ph.D. at the University of Minnesota, Twin Cities, under the supervision of Mohamed Mokbel. Ibrahim is interested in building the next generation of machine learning-empowered data management, processing, and analysis systems. His research focuses on deeply understanding machine learning and systems techniques, resulting in entirely new designs, algorithms, and data structures for data-intensive systems and applications. His research areas broadly include instance-optimized database systems, scalable knowledge base construction, big spatial data management and analysis, and data processing in containerization environments. For more information about Ibrahim, please check his website: http://people.csail.mit.edu/ibrahimsabek/

  • (2/8, 11:00am): Batya Kenig (Technion): Acyclic Schemas and Information Theory details)

    TITLE
    Acyclic Schemas and Information Theory
     
    ABSTRACT
    Acyclic schemas possess known benefits for database design, speeding up queries, and reducing space requirements. An acyclic join dependency (AJD) is lossless with respect to a universal relation if joining the projections associated with the schema results in the original universal relation.
    An intuitive and standard measure of loss entailed by an AJD is the number of redundant tuples generated by the acyclic join. 
    In this talk, I will show how the loss associated with an AJD can be characterized 
    by an information-theoretic measure, which gives rise to a principled definition of approximation for acyclic schemas. I will then describe how this definition was applied in Maimon: a system for automatically discovering approximate acyclic schemas that hold in the data.  
     
    RELATED PAPERS:
    1. Mining Approximate Acyclic Schemas from Relations (SIGMOD 2020) (https://dl.acm.org/doi/pdf/10.1145/3318464.3380573)
    2. Quantifying the loss of Acyclic Join Dependencies (To Appear in PODS 2023) (https://arxiv.org/pdf/2210.14572.pdf) 
    Code for MAIMON can be found here:
    https://github.com/orglassman/MAIMON

    BIO
    Batya Kenig is an assistant professor in the faculty of data and decision sciences at the Technion. Previously, she was a postdoctoral research associate at the database group at the University of Washington. Her research interests include probabilistic reasoning, uncertain data, and enumeration algorithms.

  • (2/10, 11:00am): Neha Makhija (Northeastern): Toward a Unified Solution for Reverse Data Management details)

    TITLE
    Toward a Unified Solution for Reverse Data Management 
     
    ABSTRACT
    The notion of Reverse Data Management captures various problems in database theory such as view maintenance, deletion propagation, and various interventions for fairness. We study Resilience: a key algorithmic problem that underlies the various forms of reverse data management. Resilience asks the question, "What is the minimal number of tuples to delete from a database in order to remove all answers from a query?" A long-open question is determining those conjunctive queries (CQs) for which this problem can be solved in guaranteed PTIME. We shed new light on this and the related problem of Causal Responsibility by proposing and studying the properties of a unified Integer Linear Programming (ILP) formulation.

    Our algorithm is unified in that it can solve both prior studied restrictions (e.g., self-join-free CQs under set semantics that allow a PTIME solution) and new cases (e.g., all CQs under set or bag semantics). It is also unified in that all queries and all instances are treated with the same approach, and the algorithm is guaranteed to terminate in PTIME for the easy cases. Concretely, we prove that for all easy self-join-free CQs, the Linear Programming (LP) relaxation of our encoding is identical to the ILP solution and thus standard ILP solvers are guaranteed to return the solution in PTIME. Our approach opens the door to new variants and new fine-grained analysis: 1) It also works under bag semantics, and we give the first dichotomy result for bags semantics in the problem space. 2) We give a more fine-grained analysis of the complexity of causal responsibility revealing new PTIME cases. 3) We recover easy instances for generally hard queries, such as instances with read-once provenance and instances that become easy because of Functional Dependencies in the data. 4) We solve an open conjecture from PODS 2020 about a structural hardness criterion. 5) Experiments confirm that our results indeed predict the asymptotic running times, and that our universal ILP encoding is at times even faster to solve for the PTIME cases than a prior proposed dedicated flow algorithm. 6) Our technique shifts the perspective from finding PTIME algorithms for open problems to using a unified algorithm and resolving open complexity questions by “merely” proving that the algorithm terminates in PTIME.
     
    RELATED PAPERS:
    1. Neha Makhija, Wolfgang Gatterbauer. "A Unified Approach for Resilience and Causal Responsibility with Integer Linear Programming (ILP) and LP Relaxations". Arxiv Preprint (https://arxiv.org/abs/2212.08898)

    BIO
    Neha Makhija is a PhD student at the DataLab at Northeastern University, advised by Prof. Wolfgang Gatterbauer. Her research aims to find efficient algorithms and borders of tractability for problems in reverse data management and minimal representations of knowledge.

  • (2/15, 10:30am): Fatemeh Nargesian (University of Rochester): Data Acquisition for AI on Lakes of Data details)

    TITLE
    Data Acquisition for AI on Lakes of Data
     
    ABSTRACT
    Data for AI is increasingly reliant on the integration of multiple sources – sometimes obtained from public data or data marketplaces. Despite decades of research in data integration and cleaning, we are still not sure how to construct AI-ready datasets – data with relevant features and representative distribution. In this talk, first, I will describe how to discover relevant datasets from large-scale data repositories, by designing (semantic) similarity measures and efficient index structures. Next, I will show how to use ideas from randomized algorithms to tailor a dataset with desired distribution requirements from multiple sources, in order to construct fair datasets. We will also see how to obtain an IID sample over normalized data, to improve the efficiency of model training and approximate query answering. Finally, I will conclude by discussing distribution-aware and human-centric data acquisition.

    BIO 

    Fatemeh Nargesian is an assistant professor of computer science at the University of Rochester. She obtained her PhD at the University of Toronto. Her research interests are in data management for AI-based data analytics and scientific time-series management. Her work has appeared at top-tier venues including VLDB, SIGMOD, and ICDE, and has won the best demo award of VLDB 2017. 

  • (2/22, 4:00pm): Bernhard Scholz (University of Sydney): Commercial-Grade Static Analyzers in Datalog details)

    TITLE:
    Commercial-Grade Static Analyzers in Datalog 
     
    ABSTRACT:

    Software development and quality assurance teams use static analyzers to find bugs and adhere to industry standards. Static analyzers must rapidly adapt to various programming styles and project requirements. In addition, they must scale to process large code bases with adequate precision. Unfortunately, implementing commercial-grade static analyzers is tedious, error-prone, time-consuming, and costly.

    This talk will outline the rapid implementation of commercial-grade static analyzers in modern Datalog languages and present some industrial use cases. We will discuss the performance challenges of modern Datalog engines used for implementing static analyzers and techniques to make static analyzers scale and truly declarative.

    RELATED PAPERS:

    Souffle Papers can be found here but there are also many other Souffle papers out there: https://souffle-lang.github.io/publications

    BIO:
     
    Bernhard Scholz is a Full Professor in the School of Computer Science at The University of Sydney. His research interests are in Programming Language Research. Prof Scholz is the creator of Souffle, a Datalog language that is mainly used for static program analysis by companies such as Oracle to find security bugs in Java JDK and check the correctness of Amazon’s VPC networks.

    Prof Scholz co-founded the SUN Microsystems Labs in Brisbane. He co-initiated the bug-checking tool Parfait, a de facto tool used by thousands of Oracle developers for bug and vulnerability detection in real-world, commercially sized C/C++/Java applications. Prof Scholz introduced the Partitioned Boolean Quadratic Problem (PBQP) for compiler construction with his students. It is employed by compilers such as LLVM for register allocation.

    Before joining The University of Sydney, he worked for the Technical University of Vienna and the University of Vienna in academic/research roles. He has held various visiting professorships at the University of Victoria, BC, Canada, Yonsei University, South Korea, at the Sun Microsystems Laboratories, and Oracle Labs. He has taken a leave of absence to work full-time for the blockchain foundation Fantom as a chief research officer. Prof. Scholz’s blockchain research focuses on smart-contract execution, security, and performance.

  • (2/24, 11:00am): Amir Gilad (Duke University): Enhancing Database Applications With Causality details)

    TITLE
    Enhancing Database Applications With Causality
     
    ABSTRACT

    Causal inference is at the heart of multiple branches of science as means of drawing conclusions and assisting in decision-making. In database research, notions from causal inference have already proved useful in several applications, such as providing explanations for query results. In this talk, I will describe additional useful database applications that can also benefit from a causality-informed approach. In particular, I will present two recent works that incorporate the advantages of causal inference into two well-established database uses: hypothetical reasoning and differentially-private synthetic data generation. I will begin by discussing our work, called HypeR, a framework for hypothetical reasoning in databases that uses a causal inference approach to capture fine-grained hypothetical updates. I will proceed by presenting our recent work, PreFair, a framework for synthetic data generation that is both differentially-private and fair, where fairness is modeled by an adaptation of the causal definition for justifiable fairness.  
     
    RELATED PAPERS
     
    1. Sainyam Galhotra, Amir Gilad, Sudeepa Roy, Babak Salimi-  HypeR: Hypothetical Reasoning With What-If and How-To Queries Using a Probabilistic Causal Approach- SIGMOD '22(https://dl.acm.org/doi/10.1145/3514221.3526149)
    2. David Pujol, Amir Gilad, Ashwin Machanavajjhala - PreFair: Privately Generating Justifiably Fair Synthetic Data- To appear in VLDB '23(https://arxiv.org/abs/2212.10310)

    BIO

    Amir Gilad is currently a postdoctoral researcher in the Database Group at Duke University and will join The School of Computer Science and Engineering at the Hebrew University as an Assistant Professor at the end of this year. 
    Amir has received his Ph.D. in Computer Science from Tel Aviv University. His work focuses on developing tools and algorithms that assist users in understanding and gaining insights into data and the systems that manipulate it. His research relates to classic database tools such as data provenance, as well as natural language processing, causal inference, and privacy. Amir is the recipient of the VLDB best paper award, the SIGMOD research highlight award, and the Google Ph.D. fellowship for Structured Data and Database Management. 

  • (3/3, 12:00pm): Grace Fan, Aamod Khatiwada (Northeastern University): Table Discovery and Integration in Data Lakes: Challenges and Solutions details)

    TITLE:
    Table Discovery and Integration in Data Lakes: Challenges and Solutions
     
    ABSTRACT:
    Data scientists use tabular data to support decision-making processes, train machine learning models, perform statistical analysis, and more. The enormous size and heterogeneity of data lakes, together with the fact that metadata may be missing or unreliable, make it challenging for data scientists to find suitable tables for their analyses. In this talk, we will cover recent results that help data scientists efficiently and effectively find tables from the data lake that union with their query table. We will also discuss new results on how to best integrate the discovered tables particularly when the discovered tables may be incomplete and simple union is insufficient.
    First, we will present SANTOS, a recent technique that finds tables from data lakes that can union with a query table given by the user. SANTOS introduces a new notion of unionability that considers relationships between columns, together with the semantics of columns, in a principled way. To do so, we present two methods to discover semantic relationships between pairs of columns, using existing and data-driven synthesized knowledge bases. We next present our work on Starmie, which learns the semantics of columns in a natural language domain by using the context of the entire table. Moreover, Starmie is the first to employ the HNSW (Hierarchical Navigable Small World) index to accelerate query processing of table union search.
    Lastly, we will cover another recent work called ALITE (Align and Integrate) that integrates the tables that may have been discovered using join, union, or related table search. Specifically, ALITE aligns the columns of discovered tables using holistic schema matching and then applies the Full Disjunction over the matching columns to get an integrated table.
     
    RELATED WORK:
    1) Aamod Khatiwada, Grace Fan, Roee Shraga, Zixuan Chen, Wolfgang Gatterbauer, Renée J. Miller, Mirek Riedewald: SANTOS: Relationship-based Semantic Table Union Search. To appear in SIGMOD 2023 (arXiv)
    2) Grace Fan, Jin Wang, Yuliang Li, Dan Zhang, Renée J. Miller: Semantics-aware Dataset Discovery from Data Lakes with Contextualized Column-based Representation Learning. To appear in PVLDB 2023 (arXiv)
    3) Aamod Khatiwada, Roee Shraga, Wolfgang Gatterbauer, Renée J. Miller: Integrating Data Lake Tables. Proc. VLDB Endow. 16(4): 932-945 (2022) (https://www.vldb.org/pvldb/vol16/p932-khatiwada.pdf)
     
    BIO:
    Grace Fan is a third year PhD student at the Data Lab at Northeastern University, advised by Professor Renée J. Miller. Her research interests include data discovery and data integration in data lakes. 
    Aamod Khatiwada is a third year PhD Student at the Data Lab at Northeastern University, advised by Professor Renée J. Miller. His research aims to develop novel techniques and algorithms that can find and integrate open data tables in a principled way.

  • (3/24, 11:00am): Dimitris Koutsoukos (ETH Zurich): Tensors: An abstraction for data processing details)

    TITLE
    Tensors: An abstraction for data processing
     
    ABSTRACT
    Deep Learning (DL) has created a growing demand for simpler ways to develop complex models and efficient ways to execute them. Thus, a significant effort has gone into frameworks like PyTorch or TensorFlow to support a variety of DL models and run efficiently and seamlessly over heterogeneous and distributed hardware. Since these frameworks will continue improving given the predominance of DL workloads, it is natural to ask what else can be done with them. This is not a trivial question since these frameworks are based on the efficient implementation of tensors, which are well adapted to DL but, in principle, to nothing else. In this paper, we explore to what extent Tensor Computation Runtimes (TCRs) can support non-ML data processing applications, so that other use cases can take advantage of the investments made on TCRs. In particular, we are interested in graph processing and relational operators, two use cases very different from ML, in high demand, and complement quite well what TCRs can do today. Building on Hummingbird, a recent platform converting traditional machine learning algorithms to tensor computations, we explore how to map selected graph processing and relational operator algorithms into tensor computations. Our vision is supported by the results: our code often outperforms custom-built C++ and CUDA kernels, while massively reducing the development effort, taking advantage of the cross-platform compilation capabilities of TCRs.


     
    RELATED PAPERS:
    1) He, Dong, et al. "Query processing on tensor computation runtimes." arXiv preprint arXiv:2203.01877 (2022) (https://arxiv.org/abs/2203.01877).
    2) Müller, Ingo, et al. "The collection Virtual Machine: an abstraction for multi-frontend multi-backend data analysis." Proceedings of the 16th International Workshop on Data Management on New Hardware. 2020 (https://doi.org/10.1145/3399666.3399911).
    3) Koutsoukos, Dimitrios, et al. "Modularis: modular relational analytics over heterogeneous distributed platforms." arXiv preprint arXiv:2004.03488 (2020) (https://arxiv.org/abs/2004.03488).


    BIO 
     
    Dimitris Koutsoukos is a PhD student in the Systems Group at ETH Zurich, supervised by Gustavo Alonso and Ana Klimovic. He is broadly interested in database systems and more specifically on how we can execute database workloads on heterogeneous architectures (RDMA, serverless, NVM) and adapt the next generation of DBMSs in the cloud. More recently, he is looking into using compiler infrastructure like MLIR for databases and how we can optimize data layouts for serverless databases. He holds a MSc in Data Science from ETH Zurich. Before coming to ETH, he obtained my undergraduate degree in Electrical and Computer Engineering from the National Technical University of Athens. In the past, he has worked for Meta, Orfium, Maastricht University and he has collaborated with Microsoft Gray Systems Lab.

  • (4/7, 11:00am): Dakuo Wang (Northeastern University): Designing Human-Centered AI Systems for Human-AI Collaboration details)

    TITLE
    Designing Human-Centered AI Systems for Human-AI Collaboration

    ABSTRACT
    Human-Centered AI (HCAI) refers to the research effort that aims to design and implement AI techniques to support various human tasks, while taking human needs into consideration and preserving human control. Prior work has focused on human-AI interaction interface design and explainable AI research (XAI). However, despite these fruitful research results, why do many so-called “human-centered” AI systems still fail in the real world? In this talk, I will discuss the human-AI collaboration interaction paradigm, and show how we can learn from human-human collaboration to design and build AI systems that lead to this interaction paradigm. This work serves as a cornerstone towards the ultimate goal of Human-AI Collaboration, where AI and humans can take complementary and indispensable roles to achieve a better outcome and experience.

    RELATED PAPERS
    * Human-AI Collaboration in Data Science: Exploring Data Scientists’ Perceptions of Automated AI. CSCW 2019. https://arxiv.org/pdf/1909.02309.pdf
    * Telling Stories from Computational Notebooks: AI-Assisted Presentation Slides Creation for Presenting Data Science Work. CHI 2022. https://arxiv.org/pdf/2203.11085.pdf
    * Model Sketching: Centering Concepts in Early-Stage Machine Learning Model Design. CHI 2023. https://arxiv.org/pdf/2303.02884.pdf

    BIO
    Dr. Dakuo Wang is an Associate Professor at Northeastern University and Visiting Scholar at Stanford University. His research lies at the intersection of human-computer interaction (HCI), artificial intelligence (AI), and computer-supported team collaboration (CSCW), with a focus on the exploration, development, and evaluation of human-centered AI (HCAI) systems. The overarching research goal is to democratize AI for every person and every organization, so that they can easily access AI and collaborate with AI to accomplish real-world tasks better -- the “human-AI collaboration” paradigm. Before joining Northeastern, Dr.Wang worked as a research lead at IBM Research, and a principal investigator at MIT-IBM Watson AI Lab. He got his Ph.D. from the University of California Irvine (“how people write together now” co-advised by Judith Olson and Gary Olson). He has worked as a designer, researcher, and engineer in the U.S., China, and France. He has served in various organizing committees, program committees, and editorial boards for conferences and journals, and ACM has recognized him as an ACM Distinguished Speaker.

Fall 2022 (Nikos)

  • (9/29, 11:45am): Boris Glavic (Illinois Institute of Technology): Interpretable Data-Based Explanations for Fairness Debugging details)

    TITLE
    Interpretable Data-Based Explanations for Fairness Debugging

    ABSTRACT
    A wide variety of fairness metrics and eXplainable Artificial Intelligence (XAI) approaches have been proposed in the literature to identify bias in machine learning models that are used in critical real-life contexts. However, merely reporting on a model’s bias or generating explanations using existing XAI techniques is insufficient to locate and eventually mitigate sources of bias. In this presentation, I will introduce the audience to Gopher, a system that produces compact, interpretable, and causal explanations for bias or unexpected model behavior by identifying coherent subsets of the training data that are root-causes for this behavior. Gopher is based on the concept of causal responsibility that quantifies the extent to which intervening on training data by removing or updating subsets of it can resolve the bias. Intuitively, if removing a subset of the training data significantly reduces bias, then this subset of the training data bears strong responsibility for the observed bias. Building on this concept, I will introduce an efficient approach for generating the top-k patterns that explain model bias. This approach utilizes techniques from the machine learning (ML) community to approximate causal responsibility, and uses pruning rules to manage the large search space for patterns. The experimental evaluation of Gopher demonstrated the effectiveness of the system in generating interpretable explanations for identifying and debugging sources of bias.

    RELATED PAPERS
    https://dl.acm.org/doi/10.1145/3514221.3517886

    BIO
    Boris received a Diploma (Master) in Computer Science from the RWTH Aachen in Germany, and a PhD in Computer Science from the University of Zurich in Switzerland being advised by Michael Böhlenand Gustavo Alonso. Afterwards, he spent to two years as a PostDoc in the Department of Computer Science at the University of Toronto working at the Database Research Group under Renée J. Miller.

  • (10/3, 11:45am): Oktie Hassanzadeh (IBM): Knowledge Graphs for Data Exploration and Data Science details)

    TITLE 
    Knowledge Graphs for Data Exploration and Data Science

    ABSTRACT
    In this seminar, I will present a number of prototype data management solutions developed at IBM Research that leverage knowledge graphs for large-scale data exploration, data curation, and data science. In particular, I will describe an application of knowledge graphs for large-scale data exploration and curation in the context of the IBM Research Helix project, where large-scale similarity analysis and automated linkage discovery enable the discovery and construction of data sets from a large number of data sets (or a data lake) for a given analytics task. I will then present our work on building a toolkit that uses knowledge graphs for automating machine learning tasks. I will show how mapping tabular data to knowledge graphs can assist with feature engineering, and the challenges we faced in implementing our prototype solution. Finally, I will present present a number of techniques we experimented with for semantic annotation of tabular data, and our ongoing work on evaluation of table annotation systems as a part of the annual ISWC SemTab Challenge (https://www.cs.ox.ac.uk/isg/challenges/sem-tab/).

    RELATED PAPERS
    1. Exploring Big Data with Helix: Finding Needles in a Big Haystack, Jason B. Ellis, Achille Fokoue, Oktie Hassanzadeh, Anastasios Kementsietsidis, Kavitha Srinivas, Michael Jeffrey Ward. SIGMOD Rec. 43(4): 43-54 (2014)
    https://sigmodrecord.org/publications/sigmodRecord/1412/pdfs/09_industry_Ellis.pdf
    2. Automated Feature Enhancement for Predictive Modeling using External Knowledge. Sainyam Galhotra, Udayan Khurana, Oktie Hassanzadeh, Kavitha Srinivas, Horst Samulowitz, Miao Qi. ICDM Workshops 2019: 1094-1097
    https://udayankhurana.com/wp-content/uploads/2019/12/KBFeature_Demo-3.pdf
    3. Matching Web Tables with Knowledge Base Entities: From Entity Lookups to Entity Embeddings. Vasilis Efthymiou, Oktie Hassanzadeh, Mariano Rodriguez-Muro, Vassilis Christophides. ISWC (1) 2017: 260-277
    https://iswc2017.ai.wu.ac.at/wp-content/uploads/papers/MainProceedings/98.pdf

    BIO
    Dr. Oktie Hassanzadeh is a Senior Research Staff Member at IBM T.J. Watson Research Center. He is the recipient of several academic and corporate awards, including a top prize at the FinCausal-2022 Shared Task, a top prize at the Semantic Web Challenge at ISWC conference, and two best-paper awards at ESWC conferences. He has received his M.Sc. and Ph.D. degrees from the University of Toronto, where he received the IBM PhD fellowship and the Yahoo! Key Scientific Challenges awards. He is also a two-time recipient of the first prize at the Triplification Challenge at the SEMANTiCS Conference for his projects in the areas of Semantic Technologies and Linked Data. For more information, refer to his home page: http://researcher.watson.ibm.com/person/us-hassanzadeh

  • (10/7, 1:00pm): Philippe Cudre-Mauroux (University of Fribourg): Human-AI Techniques to Handle Open-Ended Answers details)

    TITLE
    Human-AI Techniques to Handle Open-Ended Answers

    ABSTRACT
    Collecting answers to open-ended questions is essential for many applications.For instance, collecting the names of social influencers is useful for brandmarketing and opinion mining. Another example of an open-ended task is theextraction of rationales from documents for topic classification, which ishelpful in literature review and contextual search. Many efforts have beendedicated to automatically curating such data through ML models. However, thesemethods have limited performance since responding to open-ended questionsrequires intuition, domain knowledge, and reasoning abilities, which are stilllargely missing in state-of-the-art ML methods. In this talk, I will presentseveral human-AI collaborative approaches for handling open-ended data. Overall,our frameworks consist of a human computation component and an AI model thatinteract with each other. The human computation component and the AI model areupdated iteratively, allowing their learning processes to benefit from eachother until an agreement on the quality of the answers is reached. Thus, theinteraction between the human computation component and the AI model isbidirectional, which is fundamental to ensuring the effectiveness of thehuman-AI team.

    RELATED PAPERS
    - Ines Arous, Jie Yang, Mourad Khayati, andPhilippe Cudre-Mauroux. ”Opencrowd: A human-AI collaborative approach forfinding social influencers via open-ended answers aggregation.” WWW 2020. https://exascale.info/assets/pdf/arous2020www.pdf

    - Ines, Arous, JieYang, Mourad Khayati, and Philippe Cudre-Mauroux. ”Peer grading the peerreviews: a dual-role approach for lightening the scholarly paper reviewprocess.” WWW 2021. https://exascale.info/assets/pdf/arous2021www.pdf

    - Ines Arous, LjiljanaDolamic, Jie Yang, Akansha Bhardwaj, Giuseppe Cuccu, and PhilippeCudre-Mauroux. ”Marta: Leveraging human rationales for explainable textclassification.” AAAI 2021. https://exascale.info/assets/pdf/arous2021aaai.pdf

    BIO
    Philippe Cudre-Mauroux is a Full Professor and the Director of the eXascaleInfolab at the University of Fribourg in Switzerland. He received his Ph.D.from the Swiss Federal Institute of Technology EPFL, where he won both theDoctorate Award and the EPFL Press Mention in 2007. Before joining theUniversity of Fribourg, he worked on information management infrastructures atIBM Watson (NY), Microsoft Research Asia and Silicon Valley, and MIT. Herecently won the Verisign Internet Infrastructures Award, a Swiss NationalCenter in Research award, a Google Faculty Research Award, as well as a 2million Euro grant from the European Research Council. His research interestsare in next-generation, Big Data management infrastructures for non-relationaldata and AI. Webpage: http://exascale.info/phil

  • (10/20, 11:45am): Fatemeh Nargesian (University of Rochester): Responsible Data Integration details)

    TITLE
    Responsible Data Integration

    ABSTRACT
    Data integration has been extensively studied by the data management community and is a core task in the data pre-processing step of ML pipelines. When the integrated data is used for analysis and model training, responsible data science requires addressing concerns about data quality and bias. This talk revolves around the synergies of data integration and data responsibility, highlighting the existing efforts in responsible data integration. In this talk, first, I will describe the requirements to be considered for evaluating and auditing data integration tasks for bias. Second, I present a summary of data integration tasks that elicit attention to data responsibility measures and the existing methods to satisfy these requirements. Finally, we dive deeper into distribution-aware data integration as a tool for satisfying these constraints. 

    RELATED PAPERS
    Fatemeh Nargesian, Abolfazl Asudeh, H. V. Jagadish:
    Responsible Data Integration: Next-generation Challenges. SIGMOD Conference 2022: 2458-2464
    https://doi.org/10.1145/3514221.3522567

    BIO
    Fatemeh Nargesian is an assistant professor in the Department of Computer Science, at the University of Rochester. She got her PhD at the University of Toronto and was a research intern at IBM Watson in 2014 and 2016. Before the University of Toronto, she worked at Clinical Health and Informatics Group at McGill University. Her primary research interests are in data intelligence focused on data for ML as well as time-series analysis. 

  • (10/21, 1:00pm): Sudeepa Roy (Duke): Interpretable and Actionable Data Analysis with Explanations and Causality details)

    TITLE
    Interpretable and Actionable Data Analysis with Explanations and Causality

     

    ABSTRACT
    In current times, data is considered synonymous with knowledge, profit, power, and entertainment, requiring development of new techniques to extract useful information and insights from data. In this talk, I will describe some concepts and techniques in interpretable data analysis from the viewpoint of a database researcher. First, I will talk about our work on explaining query answers, in terms of "intervention" or how changes in the input data changes the output of a query, and "context" or how input data not contributing to the answers of interest can help explain them. In particular, I will describe a recent work on an explanation framework called DPXPlain that preserves Differential Privacy for sensitive data. Then I will talk about true causal inference from observational data without randomized controlled experiments and how database techniques can help with causal inference for large complex data.

     

    RELATED PAPERS

    https://www.vldb.org/pvldb/vol16/p113-tao.pdf

    https://dl.acm.org/doi/abs/10.5555/3546258.3546289

     

    BIO
    Sudeepa Roy is an Associate Professor of Computer Science at Duke University. She works broadly in data management, with a focus on foundational aspects of big data analysis, including causality and explanations, data provenance, uncertain data, data repair, query optimization, and database theory. Prior to joining Duke in 2015, she did a postdoc at the University of Washington, and obtained her Ph.D. from the University of Pennsylvania. She is a recipient of an NSF CAREER Award, the Very Large Databases (VLDB) Early Career Research Contributions Award, and a Google Ph.D. fellowship in structured data. She co-directs the Almost Matching Exactly (AME) lab for interpretable causal inference at Duke (https://almost-matching-exactly.github.io/). 

  • (10/28, 1:00pm): Eriq Augustine (UC Santa Cruz): Collective Grounding: Applying Database Research to SRL Scalability details)

    TITLE

    Collective Grounding: Applying Database Research to SRLScalability

    ABSTRACT

    The process of instantiating, or "grounding", a first-order model is a fundamental component of reasoning in logic.
    It has been widely studied in the context of theorem proving, database theory, and artificial intelligence.
    Within the relational learning community, it is required for instantiating a distribution over possible worlds and is often the key computational bottleneck.
    In this work, we investigate how key concepts from relational database theory can be utilized to improve the computational efficiency of the grounding process.
    We introduce the notion of collective grounding, which treats logical programs not as a collection of independent rules, but instead as a joint set of interdependent workloads that can be shared.
    We introduce the theoretical concept of collective grounding, the components necessary in a collective grounding system, implementations of these components, and how to use database theory to speed up these components.
    We demonstrate collective groundings effectiveness on seven popular datasets, and show up to an 80\% reduction in runtime using collective grounding.

    RELATED PAPERS

    Hinge-Loss Markov Random Fields and Probabilistic Soft Logic
    https://linqs.org/publications/#id:bach-jmlr17

    A Comparison of Bottom-Up Approaches to Grounding for Templated Markov Random Fields
    https://linqs.org/publications/#id:augustine-mlsys18

    Tuffy: Scaling up Statistical Inference in Markov Logic Networks using an RDBMS
    http://i.stanford.edu/hazy/papers/tuffy-vldb11.pdf

    BIO

    Eriq Augustine is a doctoral candidate with the LINQS labheaded by Dr. Lise Getoor at the University of California at Santa Cruz.
    Eriq works at the intersection of relational information (e.g., logic) and machinelearning with a focus on scalability, data management, and systems engineering.
    Before starting at UCSC, Eriq was a lecturer at the California StatePolytechnic University at San Luis Obispo, a senior software developer at GaineSolutions (a master data management startup), a software developer at Google,and a software developer at Netflix.
    eriqaugustine.com

     

  • (11/3, 11:45am): Brit Youngmann (MIT): Data Tools for Accelerated Scientific Discoveries details)

    TITLE
    Data Tools for Accelerated Scientific Discoveries

    ABSTRACT
    Causal inference is fundamental to empirical research in natural and social sciences and is essential for scientific discoveries. Two key challenges for conducting causal inference are (i) acquiring all attributes required for the analysis, and (ii) identifying which attributes should be included in the analysis. Failing to include all necessary attributes may lead to false discoveries and erroneous conclusions. However, in real-world settings, analysts may only have access to partial data. Further, to identify which attributes should be included in the analysis, analysts critically rely on domain knowledge, often given in the form of a causal DAG. However, such domain knowledge is often unavailable and cannot be fully recovered from data. In this talk we will present two works that address these challenges by leveraging data management techniques and ideas.

    RELATED PAPERS
    On Explaining Confounding Bias

    BIO
    Brit is a postdoc researcher at CSAIL MIT, working with Prof. Michael Cafarella. She received her Ph.D. at Tel-Aviv University under the supervision of Prof. Tova Milo. Her research is centered around informative and responsible data science and causal analysis. Brit is the recipient of several awards, including the data science fellowship for outstanding Ph.D. students of the planning and budgeting committee of the Israeli council for higher education (VATAT), the Schmidt postdoctoral award for women in mathematical and computing sciences, and the planning and budgeting committee of the Israeli council for higher education (VATAT) postdoctoral scholarship in Data Science. She served on multiple program committees, including at the SIGMOD and ICDE conferences.

  • (11/7, 11:45am): Jin Wang & Zhengjie Miao (Megagon Labs) details)

    TALK 1:

    TITLE
    Contrastive Learning for Data IntegrationApplications

    ABSTRACT
    In this talk, I will introduce Sudowoodo, acontrastive representation learning based framework for Data Integration as themain use case. Contrastive learning enables Sudowoodo to learn similarity-awaredata representations from a large corpus of data items, e.g., entity entries,without using any labels. The learned representations can later be eitherdirectly used or facilitate fine-tuning with only a few labels to support bothblocking and matching stages of Entity Matching. Besides, it can also supportother applications such as data cleaning and semantic type detection. Ourexperiment results show that our proposed method achieves multiplestate-of-the-art results on different levels of supervision and outperformsprevious best specialized blocking or matching solutions for EM. Sudowoodo alsoachieves promising results in data cleaning and semantic type detection tasksshowing its versatility in multiple  applications. 

    BIO
    Jin Wang is aresearch scientist in Megagon Labs. Before that, he obtained thePh.D. degree from the Computer Science Department, University of California,Los Angeles in 2020 and the master degree from Tsinghua Univeristy in 2015. Hisresearch interests lies in the broad areas of database, machine learning andinformation retrieval. In particular, his research focuses on Database QueryLanguage, Database System and using database technologies to boost theapplications in data integration and natural language processing.

    TALK 2:

    TITLE
    Understanding Queries by Examples

    ABSTRACT
    Data analytics skills have become anindispensable part of education and modern workforce, relational queries, theessential tools of the trade for manipulating and analyzing structured data,are still notoriously challenging to learn and debug even for people withconsiderable experience in programming. In this talk, I will introduce works inthe HNRQ project (Helping Novices learn and debug Relational Queries), whichprovide users with small counterexample database instances pointing out errorsin the user query and allow users to trace how the query executes, based ontheoretical foundations for data provenance and incomplete data. The practicalsystem for debugging database queries was successfully deployed at DukeUniversity and used by more than 1000 students in introductory databasecourses.

    BIO
    Zhengjie Miao is a research scientist at Megagon Labs. He is broadly interestedin data management and analysis, with a focus on developing algorithms andtools to democratize data science. He obtained his Ph.D. in Computer Sciencefrom Duke University. He was a finalist for the Microsoft Research Ph.D.Fellowship.

  • (11/8, 2:00pm): Estevam Hruschka & Eser Kandogan : Megagon Labs Research Overview: Data- and Knowledge-Intensive AI/NLP in HR domain details)

    TITLE 
    Megagon Labs Research Overview: Data- andKnowledge-Intensive AI/NLP in HR domain

    ABSTRACT
    Real-world AI applications, particularly in theHuman Resources (HR) domain, demand fairness, factuality, controllability,consistency, interpretability, and reasoning. At Megagon Labs we investigate,innovate, propose and build symbiotic systems that mutually exploit languagemodels (LM), knowledge graphs (KG) and Human-In-The-Loop approaches in acontinuous and (semi-) automated learning paradigm to address these challenges.In this talk we will show some of the work we are doing on building nextgeneration knowledge graphs for the HR domain that count on Intelligent agentsand  humans in the loop to drive many AI applications, such as matchingcandidates to jobs, developing deep understanding of skills and occupations,identifying skills gaps and recommending career opportunities.

    BIO
    Estevam Hruschka is the (interim) Head ofResearch at Megagon Labs where he leads the efforts to advance state-of-the-artresearch in AI, ML, NLP, knowledge representation and data management and tobuild technologies that impact the world through online services. Prior toMegagon Labs, he was co-leader of NELL (the Never-Ending Language LearningSystem - http://rtw.ml.cmu.edu). Also, hewas an associate professor of computer science at the Federal University of SaoCarlos (Brazil), Fellow of the Brazilian National Research Agency, YoungResearch Fellow of the Sao Paulo State Research Agency (in Brazil), GoogleResearch Award recipient (for Latin America) and Adjunct Professor at CarnegieMellon University (Pittsburgh, PA). His work focuses on both theoretical andapplied problems and he is mainly interested in how to build intelligentcomputer systems capable of deeply understanding Natural Language in aNever-Ending Learning approach. Between 2017 and 2020, he was with Amazon inSeattle, WA, where he helped to build the Alexa Search Team from scratch.

    Eser Kandogan is Headof Engineering at Megagon Labs. Between 2000-2019, he worked as a researchstaff member at IBM Almaden Research Center, conducting research on visualanalytics, human-computer interaction, computer-supported cooperative work,semantic search, data science, search, and graphs. At IBM he contributed toseveral IBM products and patents in data and systems management areas. Prior toIBM, he worked at Silicon Graphics (SGI) in data mining and visualizationgroup. He holds a Ph.D. degree from the University of Maryland in ComputerScience. He is a co-author of Taming Information Technology by OxfordUniversity Press.

  • (11/11, 1:00pm): Fatma Özcan (Google): Semantic Advancement of Data for Data Analytics details)

    TITLE
    Semantic Advancement of Data for Data Analytics


    ABSTRACT
    In this work, we use semantic knowledge sources, such as cross-domain knowledge graphs (KGs) and domain-specific ontologies, to enrich structured data for various anaytics applications. By enriching our understanding of the underlying data with semantics brought in from external ontologies and KGs, we can better interpret the data as well as the queries to answer more questions, provide more complete answers, and deal with entity disambiguation. To semantically enrich the data with external knowledge sources, we need to find the correspondences between the structured data and the entities in the cross-domain KGs and/or the domain-specific ontologies. In this presentation, we break this problem into several steps, and provide solutions for each step. We showcase the practical value of semantic enrichment of data using our proposed techniques in entity disambiguation, natural language querying and conversational interfaces to data, query relaxation, with promising initial results.


    RELATED PAPERS

    1. J. Hao, C. Lei, V. Efthymiou, A. Quamar, F. Özcan, Y. Sun, and W. Wang, “MEDTO: Medical Data to Ontology Matching Using Hybrid Graph Neural Networks”, KDD 2021
    2. A. Vretinaris, C. Lei, V. Efthymiou, X. Qin, and F. Özcan,Medical Entity Disambiguation Using Graph Neural Networks”, SIGMOD 2021
    3. C. Lei, V. Efthymiou, R.Geis, and F. Özcan, “Expanding Query Answers on Medical Knowledge Bases”, EDBT 2020 (industrial)
    4. A. Quamar, F. Özcan, D. Miller, R. Moore, R. Niegus, and J. Kreulen, “Conversational BI: An Ontology-Driven Conversation System for Business Intelligence Applications”, PVLDB Volume 13, No 12, August 2020
    5. J. Sen,  C. Lei,  A. Quamar, F. Özcan, V. Efthymiou, A. Dalmia, G. Stager,  A. Mittal,  D. Saha, K. Sankaranarayanan, "ATHENA++: Natural Language Querying for Complex Nested SQL Queries", PVLDB Volume 13, No 11, August 2020

    BIO
    Fatma Özcan is a Principal Engineer at Systems Research@Google. Before that, she was a Distinguished Research Staff Member and a senior manager at IBM Almaden Research Center. Her current research focuses on platforms and infra-structure for large-scale data analysis, query processing and optimization of semi-structured data, and democratizing analytics via NLQ and conversational interfaces to data. Dr Özcan got her PhD degree in computer science from University of Maryland, College Park, and her BSc degree in computer engineering from METU, Ankara. She has over 21 years of experience in industrial research, and has delivered core technologies into IBM products. She has been a contributor to various SQL standards, including SQL/XML, SQL/JSON and SQL/PTF. She is the co-author of the book "Heterogeneous Agent Systems", and co-author of several conference papers and patents. She received the VLDB Women in Database Research Award in 2022. She is an ACM Distinguished Member, and the vice chair of ACM SIGMOD. She has been serving on the board of directors of CRA (ComputingResearch Association) since 2020, and is a steering committee member of the CRA-Industry.

  • (11/14, 11:45am): Mahdi Esmailoghli (Leibniz Universität Hannover) details)

    TITLE
    MATE:Multi-Attribute Table Extraction 

    ABSTRACT
    Acore operation in data discovery is to find joinable tables for a given table.Real-world tables include both unary and n-ary join keys. However, existingtable discovery systems are optimized for unary joins and are ineffective andslow in the existence of n-ary keys. In this paper, we introduce Mate, a tablediscovery system that leverages a novel hash-based index that enables n-aryjoin discovery through a space-efficient super key. We design a filtering layerthat uses a novel hash, Xash. This hash function encodes the syntactic featuresof all column values and aggregates them into a super key, which allows thesystem to efficiently prune tables with non-joinable rows. Our join discoverysystem is able to prune up to 1000𝑥 more false positives and leads to over 60𝑥 faster table discovery incomparison to state-of-the-art.

    RELATED PAPERS
    MATE: Multi-Attribute Table Extraction  VLDB 2022
    https://www.vldb.org/pvldb/vol15/p1684-esmailoghli.pdf

    BIO
    MahdiEsmailoghli is a Ph.D. student at Leibniz Universität Hannover in the DatabaseSystems Group (DBS) under the supervision of Prof. Ziawasch Abedjan. He startedhis Ph.D. at the Big Data Management group (BigDaMa) at TU Berlin with Prof.Abedjan. Before joining DBS and BigDaMa group, he finished his master's at theAmirkabir University of Technology in Tehran. His area of research is scalabledata discovery for machine learning.

  • (11/18, 1:00pm): Amine Mhedhbi (University of Waterloo): Taming Large Intermediate Results for Joins over Graph-Structured Relations: A System Perspective details)

    TITLE
    Taming Large Intermediate Results for Joins over Graph-Structured Relations: A System Perspective

    ABSTRACT
    Querying graph structured relations, i.e., those with many-to-many relationships between entities, is ubiquitous and integral to a wide range of analytical applications such as recommendations on social networks and fraud detection in financial transactional networks. These queries are heavy on joins and lead to an explosion in the size of intermediate results. In this talk, I cover the details of the integration of two important modern techniques into an analytical DBMS to optimize for many-to-many joins. These two techniques are: (i) novel worst-case-optimal join algorithms that adopt column-at-a-time over the traditional table-at-a-time join paradigm; and (ii) factorized query processing, which is a technique to compress the intermediate and final results of queries. I will present a comprehensive query processor design and implementation that goes over various research questions tackled to make the integration possible and these techniques practical. Our query processor is capable of having a robust query plan space with many efficient plans that lead to orders of magnitude speedups. 


    RELATED PAPERS
    1) Optimizing One-time and Continuous Subgraph Queries using Worst-Case Optimal Joins [PDF] [Code] [BibTeX]
    Amine Mhedhbi, Chathura Kankanamge, Semih Salihoglu
    In ACM Transactions on Database Systems (TODS) 2021.
     
    2) Columnar Storage and List-based Processing for Graph Database Management Systems [PDF] [Code] [BibTeX]
    Pranjal Gupta, Amine Mhedhbi, Semih Salihoglu
    In Proceedings of the VLDB Endowment (PVLDB) 2021.


    BIO
    Amine Mhedhbi is a Ph.D. student at the University of Waterloo. His research focuses on building data systems capable of efficient processing of graph data. To that end, he led the design and implementation of the GraphflowDB system to rethink core database components. He received the VLDB Best Paper Award in 2018 and his research is supported by a Microsoft Ph.D. Research Fellowship.

  • (12/2, 1:00pm): Julia Stoyanovich (NYU): Fairness in Ranking: from values to technical choices and back details)

    TITLE
    Fairness in Ranking: from values to technical choices and back

    ABSTRACT
    Algorithmic rankers take a collection of candidates as input and produce a ranking (permutation) of the candidates as output. In the past few years, there has been much work on incorporating fairness requirements into rankers, with contributions from the data management, algorithms, information retrieval, and recommender systems communities. Many fair ranking methods establish their own understanding of fairness, yet they stop short of categorizing themselves with respect to the normative frameworks they embed. This makes it difficult to understand what values these methods encode, in what specific scenarios they are applicable, and how they relate to one another. Further, methods differ on which technical choices they make, including, notably, their data representation choices. These technical differences are not value-neutral, and have profound implications on the applicability of the methods. In my talk, I will give an overview of the field of fairness in raking, offering a perspective that connects formalizations and algorithmic approaches across subfields. My perspective will be based on the interplay between the value frameworks that motivate specific fairness-enhancing interventions, and the technical choices that impact the properties of the methods and of their results. I will compare and contrast several representative fair ranking methods, and will provide concrete recommendations for those wishing to incorporate fairness objectives into algorithmic rankers. My talk will be based on a two-part survey that I co-authored with Meike Zehlike and Ke Yang.

    RELATED PAPERS
    The survey was recently published in ACM Computing Surveys, and is available at these links:
    1. Part I https://dl.acm.org/doi/abs/10.1145/3533379
    2. Part II https://dl.acm.org/doi/10.1145/3533380

    BIO
    Julia Stoyanovich is an Institute Associate Professor of Computer Science & Engineering at the Tandon School of Engineering, Associate Professor of Data Science at the Center for Data Science, and Director of the Center for Responsible AI at New York University (NYU). Her research focuses on responsible data management and analysis: on operationalizing fairness, diversity, transparency, and data protection in all stages of the data science lifecycle. She established the "Data, Responsibly" consortium and served on the New York City Automated Decision Systems Task Force, by mayoral appointment. Julia developed and has been teaching courses on Responsible Data Science at NYU, and is a co-creator of an award-winning comic book series on this topic. In addition to data ethics, Julia works on the management and analysis of preference and voting data, and on querying large evolving graphs. She holds M.S. and Ph.D. degrees in Computer Science from Columbia University, and a B.S. in Computer Science and in Mathematics & Statistics from the University of Massachusetts at Amherst. She is a recipient of an NSF CAREER award and a Senior Member of the ACM.

  • (12/9, 1:00pm): Yifan Wu (ex-UC Berkeley): Interaction History for Building Human Data Interfaces details)

    TITLE
    Interaction History for Building Human Data Interfaces

    ABSTRACT
    History provides context for the present. In the same way, past user interactions provide context for present explorations. This thesis investigates ways to reify user interaction history to address emerging challenges in the design and programming of human data interfaces.
    We leverage interaction history in three different but connected ways. First is to enhance the design of interactions with delays, such as when working with remote databases. We use interaction history as a visual anchor to facilitate concurrent interactions, which ameliorate the cognitive burdens caused by delays. Second is to facilitate the programming of interactive visualizations involving asynchronous communication with remote databases. We capture event histories as a first-class programming construct, allowing the developer to declaratively specify what data to compute and how to update the state of the user interface. This way, developers avoid the low-level details of accessing remote data and coordinating events. The third and last way is to use interaction history to create a bridge between interaction design and programming. We capture and reify interaction history in both computational notebooks and interactive visualizations. Affordances on these reified histories help data scientists move fluidly between the two mediums.

    RELATED PAPERS
    https://ieeexplore.ieee.org/abstract/document/9331306
    https://ieeexplore.ieee.org/abstract/document/9552893
    https://dl.acm.org/doi/pdf/10.1145/3379337.3415851

    BIO
    Yifan Wu graduated from UC Berkeley in 2020, advised by Joe Hellerstein, and in collaboration with Arvind Satyanarayan, Eugene Wu, and Remco Chang. After graduation, Yifan founded and left a MLOps startup.

  • (12/16, 1:00pm): Maria Luisa Sapino (University of Torino): Leveraging tensor train decomposition to deal with uncertainty in multidimensional data details)

    TITLE
    Leveraging tensor train decomposition to deal with uncertainty in  multidimensional data

    ABSTRACT
    In many domains, multidimensional data are encoded as multi-dimensional arrays, or Tensors, and traditional data mining techniques might be limited due to the curse of dimensionality – the memorization space is indeed exponential in the number of dimensions. Tensor decomposition techniques alleviate this issue, in that they allow to represent tensors in terms of a reduced number of parameters, thus reducing the space required to describe the relevant information. Commonly used tensor decomposition algorithms include CP-decomposition (which seeks a diagonal core) and Tucker-decomposition (which seeks a dense core). Naturally, Tucker maintains more information, but due to the denseness of the core, it also is subject to exponential memory growth with the number of tensor modes. Tensor train (TT) decomposition addresses this problem by seeking a sequence of three-mode cores.
    I will present the main ideas underlying tensor train decomposition and show how its properties can be leveraged to deal with uncertain information (both uncertainty modelled probabilistically, and uncertainty in terms of intervals, ranges, of possible values). Moreover, I will show how knowledge about data characteristics and relationships potentially defined on the tensor modes can be leveraged to guide the decomposition order, to improve the effectiveness of the decomposition.

    RELATED PAPERS
    Mao-Lin Li, K. Selçuk Candan, Maria Luisa Sapino
    GTT: Leveraging data characteristics for guiding the tensor train decomposition.
    Inf. Syst. 108: 102047 (2022)
    https://scholar.google.com/scholar?cluster=9452783075522390103
     
    Francesco Di Mauro; K. Selcuk Candan; Maria Luisa Sapino
    Tensor-Train Decomposition in Presence of Interval-Valued Data
    IEEE Transactions on Knowledge and Data Engineering, 2021
    https://scholar.google.com/scholar?cluster=9373470206136825
     
    Xinsheng Li, K. Selçuk Candan, Maria Luisa Sapino:
    Noise Adaptive Tensor Train Decomposition for Low-Rank Embedding of Noisy Data.
    SISAP 2020: 203-217
    https://scholar.google.com/scholar?cluster=8066453498002786335

    BIO

    Maria Luisa Sapino is a Professor at the University of Torino and, since 2006, Adjunct Professor at the School of Computing and Augmented Intelligence, SCAI, at the Arizona State University. At ASU she is also affiliated with the Center for Assured and Scalable Data Engineering (CASCADE). Her initial contributions to computer science were in the area of logic programming and artificial intelligence, specifically in the semantics of negation in logic programming, and in the abductive extensions of logic programs. Since mid-90s she has been applying these techniques to the challenges associated with data security and access control, and with heterogeneous data management. Her research contributions are in the area of  data management and analysis, with strong emphasis on tackling the so called "Big Data challenges", including aspects related to the development of efficient techniques for tensor based big data analysis,  pattern detection in voluminous and heterogeneous data collections,  with special focus on various aspects related to indexing, classification and querying of (possibly multivariate) time series. Maria Luisa Sapino co-authored more than 100 research papers, most of them published on top ranked journals and conference proceedings. Her recent research includes interdisciplinary projects focusing on securing and optimizing health and infrastructure (study of the infectious disease propagation, study of real-time evacuation solutions in case of disaster management) as well as cyber-physical systems (building energy systems) that can benefit from "smart data oriented" fundamental technological innovations.

Summer 2022 (Wolfgang)

  • (5/17, 1:00pm): Roee Shraga (Northeastern University): HumanAL: Calibrating Human Matching Beyond a Single Task details)

    TITLE:
    HumanAL: Calibrating Human Matching Beyond a Single Task

    ABSTRACT
    This work offers a novel view on the use of human input as labels, acknowledging that humans may err. We build a behavioral profile for human annotators which is used as a feature representation of the provided input. We show that by utilizing black-box machine learning, we can take into account human behavior and calibrate their input to improve the labeling quality. To support our claims and provide a proof-of-concept, we experiment with three different matching tasks, namely, schema matching, entity matching and text matching. Our empirical evaluation suggests that the method can improve the quality of gathered labels in multiple settings including cross-domain (across different matching tasks).

    RELEVANT PAPER:
    https://arxiv.org/abs/2205.03209

    SHORT BIO
    Roee Shraga is a Postdoctoral fellow at the Khoury College of Computer Science at Northeastern University in Boston. He received his PhD degree in 2020 in the area of Data Science. Roee has published more than a dozen papers in leading journals and conferences and is a recipient of several PhD fellowships including the Miriam and Aaron Gutwirth Memorial Fellowship (2020).
    https://sites.google.com/view/roee-shraga/

  • (6/3, 12:00pm): Mike Cafarella (MIT): Data Systems for Economic Discovery details)

    TITLE
    Data Systems for Economic Discovery

    ABSTRACT
    Macroeconomics resembles a data science task that predates modern data science, and arguably predates digital computing as well. It is possible to make economics easier and to make novel discovery through application of modern data systems and data science methods. This talk will describe several areas for data management to make contributions to economics, in particular recent work on the crucial task of creating high-quality price indices, which describe inflation and consumption.

    RELEVANT PAPERS
    https://www.nber.org/system/files/working_papers/w20010/w20010.pdf
    http://www-personal.umich.edu/~shapiro/papers/Price_Quantity_Scale.pdf
    https://web.eecs.umich.edu/~michjc/papers/p145-antenucci.pdf

    BIO
    Michael Cafarella is a Principal Research Scientist at MIT CSAIL. His research interests include databases, information extraction, data integration, and data mining. He has published extensively in venues such as SIGMOD, VLDB, and elsewhere. Mike received his PhD from the University of Washington, Seattle, in 2009 with advisors Oren Etzioni and Dan Suciu. From 2010 to 2020, he was on the faculty of Computer Science and Engineering at the University of Michigan. His academic awards include the NSF CAREER award, the Sloan Research Fellowship, and the 2018 VLDB Ten-Year Best Paper award. In addition to his academic work, he costarted (with Doug Cutting) the Hadoop open-source project, which is widely used at Facebook, Yahoo!, and elsewhere. In 2015 he cofounded (with Chris Re and Feng Niu) Lattice Data, Inc, which is now part of Apple.
    https://www.csail.mit.edu/person/michael-cafarella

  • (7/7, 12:00pm): Ioanna Tsalouchidou (Amazon): Scalable Dynamic Graph Summarization details)

    TITLE
    Scalable Dynamic Graph Summarization

    ABSTRACT
    Large-scale dynamic interaction graphs can be challenging to process and store, due to their size and the continuous change of communication patterns between nodes. In this work, we address the problem of summarizing large-scale dynamic graphs, while maintaining the evolution of their structure and interactions. Our approach is based on grouping the nodes of the graph in supernodes according to their connectivity and communication patterns. The resulting summary graph preserves the information about the evolution of the graph within a time window. We propose two online algorithms for summarizing this type of graphs. Our baseline algorithm kC based on clustering is fast but rather memory expensive. The second method we propose, named /LC, reduces the memory requirements by introducing an intermediate step that keeps statistics of the clustering of the previous rounds. Our algorithms are distributed by design, and we implement them over the Apache Spark framework, so as to address the problem of scalability for large-scale graphs and massive streams. We apply our methods to several dynamic graphs, and show that we can efficiently use the summary graphs to answer temporal and probabilistic graph queries.

    RELATED PAPERS
    1. https://link.springer.com/article/10.1007/s10618-016-0468-8
    2. https://www.sciencedirect.com/science/article/pii/B9780127224428500161
    3. https://cs-people.bu.edu/evimaria/papers/Social-net.pdf

    BIO
    Ioanna Tsalouchidou received the PhD degree from the Department of Information and Communication Technologies, University Pompeu Fabra, in 2018. She is currently working at Redshift AWS as a software engineer. Her main interests are data bases, geospatial systems, spatial algorithms and data mining, with an emphasis on large-scale and dynamic graphs.
    https://www.amazon.science/author/ioanna-tsalouchidou

  • (7/8, 12:00pm): Emanuel Sallinger (TU-Vienna): Fantastic Knowledge Graph Embeddings and How to Find the Right Space for Them details)

    TITLE
    Fantastic Knowledge Graph Embeddings and How to Find the Right Space for Them:
    The Quest for Hybrid Logic- and ML-based Reasoning on Knowledge Graphs

    ABSTRACT
    In this talk, we are going to start by exploring “fantastic knowledge graph embeddings and how to find the right space for them” [1]. The starting point of our quest, namely that of hybrid logic- and ML-based reasoning in Knowledge Graphs (KGs), is simple:
    During the last few years, a large number of KG embedding (KGE) models have been devised for handling machine learning problems for KGs. Interestingly, many models capable of inferring relational patterns, such as symmetry or transitivity, show lower performance in practice than those not able to, with the factors contributing to this largely unknown. In this part, we are going to show how to solve this puzzle, and practically improve models [1].
    This brings us to the second part of our quest: considering e.g. symmetry or transitivity mentioned above, we of course know reasoning methodologies that are very effective at handling them, namely logic-based KG reasoners (building upon the scalability of techniques from the data management community). Yet can we bring KGE-based and logic-based methods together in meaningful ways? This is a challenge that a large part of our community is currently looking at, and we present interesting parts of this quest including how to improve logic-based reasoning using embeddings in the Vadalog system [2].
    We round up the talk by looking at the general picture of Knowledge Graph Management Systems (KGMSs), with a focus on the Vadalog system, and its various applications including anti-money laundering, hostile takeover prevention during crises, etc., as time allows [3].

    RELATED PAPERS
    1. ”Fantastic Knowledge Graph Embeddings and How to Find the Right Space for Them”
    https://link.springer.com/chapter/10.1007/978-3-030-62419-4_25
    2. http://ceur-ws.org/Vol-3135/EcoFinKG_2022_paper8.pdf
    3. https://www.sciencedirect.com/science/article/abs/pii/S0306437920300351

    BIO
    Emanuel Sallinger is Assistant Professor at TU Wien and Departmental Lecturer at Oxford University. He leads the Knowledge Graph Lab at TU Wien, the SIG Knowledge Graphs of the Center for AI and ML, and is a member of the Databases and AI group (DBAI). At Oxford University, he is lecturer of the courses “Knowledge Graphs” and “Database Design”. Before joining TU Wien, he directed the VADA (Value-Added Data Systems and Architecture) Laboratory at Oxford University for over six years. The Knowledge Graph Lab (formally “Scalable Reasoning in Knowledge Graphs”) is a WWTF Vienna Research Group, an ERC-sized funding program for creating long-term research groups. The VADA project was an EPSRC programme grant under Georg Gottlob, bringing together the universities of Oxford (directed by Emanuel Sallinger), Manchester (headed by Norman Paton), and Edinburgh (headed by Leonid Libkin).
    https://kg.dbai.tuwien.ac.at/

  • (7/19, 12:00pm): Matthias Lanzinger (Oxford University): RAISON DATA: Some Perspectives on the Intersection of Data, AI, and Logic details)

    TITLE
    RAISON DATA: Some Perspectives on the Intersection of Data, AI, and Logic

    ABSTRACT
    In the RAISON DATA project, we investigate possible interactions between data, AI, and logic. After a brief overview of motivations and goals for the project, followed by the presentation of two recent publications and ongoing follow-up work.

    Part I: Deriving value and insights from data is a central task in industry and extensively studied in data management research. While most work in this area has focused on analysis using complex Machine Learning models and other forms of statistical analysis, there is a renewed interest in methods for deriving symbolic rather than statistical knowledge from data. We investigate the complexity of inductively (exactly) learning guarded rules from examples and discuss ongoing work on applying outcomes from this analysis to relational databases.

    Part II: Current trends in data management introduce various data from sources that label data points with a degree of certainty. Common examples include data obtained by crowd-sourced data, AI-generated observations, and KG-completion/link detection techniques. These developments raise new challenges for data management and the need for formalisms that can account for degrees of certainty in data. To this end, we investigate Datalog with fuzzy logic semantics. In the accompanying paper, we show the possibility of semantics based on linear programming for one variant of fuzzy logic Datalog and how this provides one way of introducing fuzzy Datalog+/-. Additionally, we discuss ongoing work on extending the framework from the paper and connections to other recent developments in rule-based reasoning.

    RELATED PAPERS
    * On the Complexity of Inductively Learning Guarded Clauses (AAAI'22)
    https://www.aaai.org/AAAI22Papers/AAAI-1260.DraghiciA.pdf
    * MV-Datalog+-: Effective Rule-based Reasoning with Uncertain Observations (ICLP'22)
    https://arxiv.org/abs/2202.01718

    BIO
    Matthias Lanzinger is a senior postdoctoral researcher at the University of Oxford. He obtained his Ph.D. from TU Wien under the supervision of Reinhard Pichler. His main research interests are the complexity of reasoning, with a focus on database query languages, as well as interactions between logic and machine learning systems.
    https://www.jesus.ox.ac.uk/about-jesus-college/our-community/people/dr-matthias-lanzinger/

  • (7/22, 10:00am): Paolo Papotti (EURECOM): Explainable Checking for Factual Claims details)

    TITLE
    Explainable Checking for Factual Claims

    ABSTRACT
    Misinformation is an important problem but fact checkers are overwhelmed by the amount of false content that is produced online every day. To support fact checkers in their efforts, we are creating data driven verification methods that assess textual claims and explain their decisions. However, using structured data for facts expressed in natural language is a complex task. In this talk, we first discuss how logical rules can be used to verify claims in a checking task modeled as a probabilistic inference problem. We show that this form of reasoning can also be “learned” by transformer-based language models. Experiments show that both methods enable the labeling of claims with interpretable explanations.

    RELATED WORK
    * Naser Ahmadi, Joohyung Lee, Paolo Papotti, Mohammed Saeed:
    Explainable Fact Checking with Probabilistic Answer Set Programming. TTO 2019
    https://arxiv.org/pdf/1906.09198
    * Mohammed Saeed, Naser Ahmadi, Preslav Nakov, Paolo Papotti:
    RuleBERT: Teaching Soft Rules to Pre-Trained Language Models. EMNLP (1) 2021: 1460-1476
    https://arxiv.org/pdf/2109.13006

    BIO
    Paolo Papotti is an Associate Professor at EURECOM, France since 2017. He got his PhD from Roma Tre University (Italy) in 2007, and had research positions at the Qatar Computing Research Institute (Qatar) and Arizona State University (USA). His research is focused on data integration and information quality. He has authored more than 100 publications and his work has been recognized with two "Best of the Conference" citations (SIGMOD 2009, VLDB 2016), two best demo paper awards at SIGMOD (2015, 2022), and two Google Faculty Research Awards (2016, 2020). He is associate editor for the ACM Journal of Data and Information Quality (JDIQ).
    https://www.eurecom.fr/~papotti/

  • (7/26, 10:30am): Anton Tsitsulin (Google): Benchmarking graph learning algorithms with synthetic data details)

    TITLE
    Benchmarking graph learning algorithms with synthetic data

    ABSTRACT
    Graph learning algorithms have attained state-of-the-art performance on many graph analysis tasks such as node classification, link prediction, and clustering. It has, however, become hard to track the field's burgeoning progress. One reason is due to the very small number of datasets used in practice to benchmark the performance of graph learning algorithms. This shockingly small sample size (~10) allows for only limited scientific insight into the problem. In this talk, we try to address this deficiency with synthetic data. We generate synthetic graphs and study the behaviour of graph learning algorithms in controlled scenarios. We develop a fully-featured synthetic graph generator that allows deep inspection of different models. We argue that synthetic graph generation allows for thorough investigation of algorithms and provides more insights than overfitting on three citation datasets. In the case study, we show how our framework provides insight into unsupervised and supervised graph neural network models. In this talk, we will explore the benefits, challenges, and insights from synthetic data benchmarking.

    RELATED PAPERS
    * Tsitsulin, Rozemberczki, Palowitch, Perozzi: Synthetic Graph Generation to Benchmark Graph Learning. 2022
    https://arxiv.org/abs/2204.01376
    * Palowitch, Tsitsulin, Mayer, Perozzi: GraphWorld -- fake graphs bring real insights for GNNs. 2022.
    https://arxiv.org/abs/2203.00112
    * Tsitsulin, Palowitch, Perozzi, Mueller: Graph Clustering with Graph Neural Networks. 2020
    https://arxiv.org/abs/2006.16904

    BIO
    Anton Tsitsulin has been a research scientist at Google's graph mining team since 2021. His research is in scalable graph learning algorithms, particularly in unsupervised learning and axiomatic representations. Before joining Google, he received his PhD degree from the university of Bonn advised by Prof. Emmanuel Müller.
    http://tsitsul.in/

  • (7/28, 12:00pm): Bas Ketsman (VUB: Vrije Universiteit Brussel): Datalog with Negation and Monotonicity details)

    TITLE
    Datalog with Negation and Monotonicity

    ABSTRACT
    Positive Datalog has several nice properties that are lost when the language is extended with negation. One example is that fixpoints of positive Datalog programs are robust w.r.t. the order in which facts are inserted, which facilitates efficient evaluation of such programs in distributed environments. A natural question to ask, given a (stratified) Datalog program with negation, is whether an equivalent positive Datalog program exists.

    In this context, it is known that positive Datalog can express only a strict subset of the monotone queries, yet the exact relationship between the positive and monotone fragments of semi-positive and stratified Datalog was previously left open. In this paper, we complete the picture by showing that monotone queries expressible in semi-positive Datalog exist which are not expressible in positive Datalog. To provide additional insight into this gap, we also characterize a large class of semi-positive Datalog programs for which the dichotomy ‘monotone if and only if rewritable to positive Datalog’ holds. Finally, we give best-effort techniques to reduce the amount of negation that is exhibited by a program, even if the program is not monotone.

    RELATED WORK
    * Ketsman, Koch. Datalog with Negation. ICDT 2020.
    https://drops.dagstuhl.de/opus/volltexte/2020/11943/pdf/LIPIcs-ICDT-2020-19.pdf
    * Ameloot, Ketsman, Neven, Zinn. Weaker Forms of Monotonicity for Declarative Networking: A More Fine-Grained Answer to the CALM-Conjecture, TODS 2016.
    https://doi.org/10.1145/2809784

    BIO
    Bas Ketsman is assistant professor at the Vrije Universiteit Brussel (VUB). His research focuses on formal aspects of large-scale data systems. He obtained his Ph.D. from Hasselt University in 2017 under the advice of Frank Neven and was a postdoctoral researcher at the Swiss Federal Institute of Technology in Lausanne (EPFL) where he worked with Christoph Koch. Bas' research has been awarded best paper awards at the ACM PODS Conference (the year 2014 and 2015), the 2018 European Association for Theoretical Computer Science distinguished dissertation award, and also the 2019 SIGMOD Jim Gray Doctoral Dissertation Award Honorable Mention.
    https://www.basketsman.com/

  • (7/29, 12:00pm): Atri Rudra (University of Buffalo): Worst-case Optimal Binary Join Algorithms under General Lp Constraints details)

    TITLE
    Worst-case Optimal Binary Join Algorithms under General L_p Constraints

    ABSTRACT
    Worst-case optimal join algorithms have so far been studied in two broad contexts – (1) when we are given input relation sizes [Atserias et al., FOCS 2008, Ngo et al., PODS 2012, Velduizhen et. al, ICDT 2014] (2) when in addition to size, we are given a degree bound on the relation [Abo Khamis et al., PODS 2017]. To the best of our knowledge, this problem has not been studied beyond these two statistics even for the case when input relations have arity (at most) two.

    In this paper, we present a worst-case optimal join algorithm when are given ` p -norm size bounds on input relations of arity at most two for p ∈ (1,2]. (p = 1 corresponds to relation size bounds and p = ∞ correspond to the degree bounds.) The worst-case optimality holds any fixed p ∈ (2,∞) as well (as long as the join query graph has large enough girth). Our algorithm is simple, does not depend on p (or) the ` p -norm bounds and avoids the (large) poly-log factor associated with the best known algorithm PANDA [Abo Khamis et al., PODS 2017] for the size and degree bounds setting of the problem. In this process, we (partially) resolve two open question from [Ngo, 2018 Gems of PODS]. We believe our algorithm has the potential to pave the way for practical worst-case optimal join algorithms beyond the case of size bounds.

    RELATED WORK:
    https://arxiv.org/pdf/2112.01003

    BIO
    Atri Rudra is a Professor of Computer Science and Engineering at University at Buffalo, SUNY. Atri received his bachelor's degree from the Indian Institute of Technology, Kharagpur, India in 2000, and his Ph.D. from the University of Washington in 2007. From 2000 to 2002, he was a Research Staff Member at IBM India Research Lab, New Delhi, India.
    His current research interests include structured linear algebra, issues at the intersection of society and computing and database algorithms. He is a recipient of an NSF CAREER Award (2009), an HP Labs Innovation Research Award (2010), an ESA Best Paper Award (2010), a UB Exceptional Scholars-Young Investigator Award (2011), PODS Best Paper Awards (2012 and 2016), an IBM Faculty Award (2013), SIGMOD research highlights (2016), ACM PODS Alberto O. Mendelzon Test-of-Time Award (2022) and an ICML Outstanding Paper Runner Up Award (2022). He is a co-editor of the Mozilla Teaching Responsible Computing playbook and has won a UB Teaching Innovation award (2021) and a SUNY Chancellor’s Award for Excellence in Teaching (2022).
    https://cse.buffalo.edu/faculty/atri/

  • (8/5, 12:00pm): Abolfazl Asudeh (University of Illinois Chicago): Scalable Signal Reconstruction for a Broad Range of Applications details)

    TITLE
    Scalable Signal Reconstruction for a Broad Range of Applications

    ABSTRACT
    Signal reconstruction problem (SRP) is an important opti- mization problem where the objective is to identify a solu- tion to an underdetermined system of linear equations that is closest to a given prior. It has a substantial number of applications in diverse areas, such as network traffic engineering, medical image reconstruction, acoustics, astronomy, and many more. Unfortunately, most of the common approaches for solving SRP do not scale to large problem sizes. We propose a novel and scalable algorithm for solving this critical problem. Specifically, we make four major contributions. First, we propose a dual formulation of the problem and develop the Direct algorithm that is significantly more efficient than the state of the art. Second, we show how adapting database techniques developed for scalable similarity joins provides a substantial speedup over Direct. Third, we describe several practical techniques that allow our algorithm to scale—on a single machine—to settings that are orders of magnitude larger than previously studied. Finally, we use the database techniques of materialization and reuse to extend our result to dynamic settings where the input to the SRP changes. Extensive experiments on real- world and synthetic data confirm the efficiency, effective- ness, and scalability of our proposal.

    RELATED PAPERS
    * Asudeh, Abolfazl, et al. Scalable algorithms for signal reconstruction by leveraging similarity joins. The VLDB Journal 29.2 (2020): 681-707.
    https://link.springer.com/article/10.1007/s00778-019-00562-z
    * Abolfazl Asudeh, et al. Scalable Signal Reconstruction for a Broad Range of Applications. Communications of the ACM (CACM), Vol. 64(2), pages 106–115, 2021, ACM.
    https://dl.acm.org/doi/pdf/10.1145/3441689
    * Ives, Zachary G. "Technical perspective: Solving the signal reconstruction problem at scale." Communications of the ACM 64.2 (2021): 105-105.
    https://dl.acm.org/doi/pdf/10.1145/3441688

    BIO
    Abolfazl Asudeh is an assistant professor of Computer Science at the University of Illinois Chicago and the director of Innovative Data Exploration Laboratory (InDeX Lab). He is a VLDB Ambassador, and a regular PC member of Database flagship venues.
    His research spans to different aspects of Big Data and Data Science, including Data Management, Information Retrieval, and Data Mining, for which he designs efficient, accurate, and scalable algorithmic solutions.
    Data Equity Systems, Algorithmic Fairness, and Data-centric AI are his major focus in research. His research interest also includes: Ranking, Social Networks Analysis, Large-Scale Computation on Limited Resources, Computational Fact Checking, Data management for Machine Learning, Web Databases, and applied Computational Geometry.
    https://asudeh.github.io/

  • (8/12, 12:00pm): Cheng Tan (Northeastern University): Constructing Certified Neural Networks for Computer Systems details)

    TITLE
    Constructing Certified Neural Networks for Computer Systems

    ABSTRACT
    Deep learning and neural networks are powerful tools. Applying them in computer systems---operating systems, databases, and networked systems---attracts much attention. However, neural networks are complicated black boxes that sometimes produce unexpected results. It is risky to use networks with uncertain behaviors in computer systems that require strict safety rules.

    To tame networks' uncertainty, we introduce Ouroboros, a system that constructs certified neural networks. Certified neural networks are neural networks that satisfy user-defined safety properties, named specifications. Ouroboros achieves this through a training-verification loop that combines deep learning training and neural network verification. In addition, Ouroboros supports new categories of specifications that enable developers to specify safety properties for systems applications. With Ouroboros, system developers can have faith in their neural networks.

    RELATED WORK
    * APSys'21: http://naizhengtan.github.io/doc/papers/building21tan.pdf

    SHORT BIO
    Cheng Tan is an assistant professor of Khoury College of Computer Sciences at Northeastern University. His research interests are in systems and security, focusing on building verifiable outsourced services and certified neural networks for systems. His work has won the SOSP’17 best paper award and Janet Fabri Prize for Outstanding Dissertation.
    https://naizhengtan.github.io/

  • (8/19, 12:00pm): Vibhav Gogate (UT Dallas): Algorithms for Solving The Constrained Most Probable Explanation Problem details)

    TITLE
    Algorithms for Solving The Constrained Most Probable Explanation Problem

    ABSTRACT
    Recently there has been growing interest in developing probabilistic models that are robust to minor perturbations in input and are explainable in that they are able to explain why they made a particular decision to a user. In this talk, I will describe a new unifying optimization task called constrained most probable explanation (CMPE), and show that the two aforementioned tasks, making models robust and explainable, can be reduced to CMPE. I will show that CMPE is strongly NP-hard in general on arbitrary probabilistic models, but only weakly NP-hard on probabilistic models having small k-separators (a sub-class of tractable models that admit poly-time marginal inference). The main virtue of this weakly NP-hard property is that we can leverage it to derive efficient cutset sampling and local search approximations for lower bounding the optimal value of CMPE. For upper bounding, I will present efficient approaches that combine graph-based partitioning techniques with approximations developed in the literature on knapsack problems. These upper bounding techniques are guaranteed to be better than linear programming based bounds on probabilistic models that admit tractable marginal inference and have much smaller computational complexity. I will end my talk by presenting experimental results and applications as well as avenues for future work.
    (Joint work with Sara Rouhani, Rohith Peddi and Tahrima Rahman)

    RELEATED PAPERS
    * Tahrima Rahman, Sara Rouhani, and Vibhav Gogate. Novel upper bounds for the constrained most probable explanation task. NeurIPS, 2021
    https://proceedings.neurips.cc/paper/2021/file/4fc7e9c4df30aafd8b7e1ab324f27712-Paper.pdf
    * Sara Rouhani, Tahrima Rahman, and Vibhav Gogate. A novel approach for constrained optimization in graphical models. NeurIPS, 2020
    https://proceedings.neurips.cc/paper/2020/file/8ab9bb97ce35080338be74dc6375e0ed-Paper.pdf
    * Sara Rouhani, Tahrima Rahman, and Vibhav Gogate. Algorithms for the nearest assignment problem. IJCAI, pages 5096-5102, 2018
    https://www.ijcai.org/proceedings/2018/0707.pdf

    BIO
    Vibhav Gogate is an Associate Professor in the Computer Science Department at the University of Texas at Dallas. He got his Ph.D. at University of California, Irvine in 2009 and then did a two-year post-doc at University of Washington. His research interests are in AI, machine learning and data mining. His ongoing focus is on probabilistic graphical models; their first-order logic based extensions such as Markov logic and probabilistic programming; tractable probabilistic models and explainable AI. He is a recipient of the national science foundation CAREER award and the co-winner of 2010 and 2012 UAI inference competitions.
    https://cs.utdallas.edu/people/faculty/gogate-vibhav/

Spring 2022 (Mirek / part of cs7280)

  • (1/24, 3:00pm): Phil Bernstein (Microsoft Research): Adding Data Management to Orleans – A Journey details)

    TITLE
    Adding Data Management to Orleans – A Journey

    ABSTRACT
    In this talk, I'll describe my work over eight years to add database features to the Orleans object-oriented programming framework: replication, geo-distribution, transactions, and indexing. The challenge is how to do it when storage is a plug-in service that you don’t control. In this talk, I’ll describe the journey, summarizing the main technical ideas and recounting the ups and downs of a friendly collaboration with a product group. Along the way, I’ll discuss where database system research problems come from and the special challenges of industrial research.

    BIO
    Philip A. Bernstein is a Distinguished Scientist at Microsoft Research. Over the past 40 years, he has been a product architect at Microsoft and Digital Equipment Corp., a professor at Harvard University, and a VP Software at Sequoia Systems. From 1994-1996 he was architect of Microsoft Repository, a metadata store for Visual Studio and SQL Server. He has published over 150 papers and two books on the theory and implementation of database systems, especially on transaction processing and data integration, as well as many other database topics. He is an ACM Fellow, a winner of the ACM SIGMOD Innovations Award, a member of the Washington State Academy of Sciences, and a member of the National Academy of Engineering. He received a B.S. degree from Cornell and M.Sc. and Ph.D. from University of Toronto.

  • (1/31, 3:00pm): Jonathan Goldstein (Microsoft Research): A.M.B.R.O.S.I.A. - Conferring Immortality on Distributed Applications details)

    TITLE
    A.M.B.R.O.S.I.A. - Conferring Immortality on Distributed Applications

    ABSTRACT
    When writing today's distributed programs, which frequently span both devices and cloud services, programmers are faced with complex decisions and coding tasks around coping with failure, especially when these distributed components are stateful. If their application can be cast as pure data processing, they benefit from the past 40-50 years of work from the database community, which has shown how declarative database systems can completely isolate the developer from the possibility of failure in a performant manner. Unfortunately, while there have been some attempts at bringing similar functionality into the more general distributed programming space, a compelling general-purpose system must handle non-determinism, be performant, support a variety of machine types with varying resiliency goals, and be language agnostic, allowing distributed components written in different languages to communicate. This talk describes the first system, publicly available on GitHub, called Ambrosia, to satisfy all these requirements. We coin the term "virtual resiliency", analogous to virtual memory, for the platform feature which allows failure oblivious code to run in a failure resilient manner. We also introduce a programming construct, the "impulse", which resiliently handles non-deterministic information originating from outside the resilient component. Of further interest to our community is the effective reapplication of much database performance optimization technology to make Ambrosia more performant than many of today's non-resilient cloud solutions.

    RELATED LINKS
    * https://github.com/Microsoft/AMBROSIA

    * http://www.vldb.org/pvldb/vol13/p588-goldstein.pdf/

    BIO
    Over the last 20 years, I have worked at Microsoft in a combination of research and product roles. In particular, I've spent about 20 years as a researcher at MSR, doing fundamental research in streaming, big data processing, databases, and distributed computing. My style of working is to attack difficult problems, and through fundamental understanding and insight, create new artifacts that enable important problems to be solved in better ways. For instance, my work on streaming data processing enabled people with real time data processing problems to specify their processing logic in new, powerful ways, and also resulted in an artifact called Trill, which was orders of magnitude more performant than anything which preceded it. Within the academic community, I have published many papers, some with best paper awards (e.g. Best Paper Award at ICDE 2012), and two with test of time awards (e.g. SIGMOD 2011 Test of Time award and ICDT 2018 Test of Time award), and have also taken many organizational roles in database conferences. My research has also had significant impact on many Microsoft products, including SQL Server, Office, Windows, Bing, and Halo, as well as leading to the creation of entirely new products like Microsoft StreamInsight, Azure Streaming Analytics, Trill, and most recently, Ambrosia. I spent 5 years building Microsoft StreamInsight, serving as a founder and architect for the product. Trill has become the de-facto standard for temporal and stream data processing within Microsoft, and years after creation, is still the most expressive and performant general purpose stream data processor in the world. I am also an inventor of 30+ patents.

  • (2/16, 3:00pm): Soheil Behnezhad (Stanford): Large-Scale Graph Algorithms: Massively Parallel Computation details)

    TITLE
    Large-Scale Graph Algorithms: Massively Parallel Computation

    ABSTRACT
    Graphs today are massive. Social network graphs, the web graph, and models of the brain are just a few examples of graphs with billions of vertices and trillions of edges. Graphs of this sheer size do not fit a single machine's memory, invalidating many assumptions of traditional algorithms. Massively Parallel Computation (à la MapReduce) is a popular method of processing such large-scale problems which has been extremely successful in practice. The idea is to distribute the workload into several machines running in parallel that communicate over rounds. In this talk, I will give an introduction to two algorithms for maximal matching and graph connectivity in the MPC model. The maximal matching algorithm takes O(log log n) rounds, which exponentially improves upon previous logarithmic-round algorithms known from the 1980s. I will also present a conditionally near-optimal algorithm for graph connectivity in this model.

    BIO
    Soheil Behnezhad is currently a Motwani Postdoctoral Fellow at Stanford and will join the Khoury College of Computer Sciences at Northeastern as an Assistant Professor in Fall 2022. He obtained his PhD from the University of Maryland. Soheil's research focus is on the theoretical foundations of big data algorithms. He studies algorithms in a variety of large-scale computation settings, including massively parallel computation (MapReduce), graph sparsification, streaming algorithms, and dynamic algorithms.

  • (2/23, 3:00pm): Divy Agrawal (UC Santa Barbara): Big Data, Small Summaries details)

    ABSTRACT
    During the past two decades we have seen an unprecedented increase in the amount of data that is being generated from numerous internet-scale applications. As hundreds of millions to billions of users interact with these applications, there is a continuous flow of interaction or log data that is collected by internet companies hosting these applications. Before this data can be subject to modeling and analysis, it becomes necessary to obtain summary statistics such as the cardinality of unique visitors, frequency counts of users from different states or countries, and in general finding the quantile and median information from the dataset. Efficient algorithms exist for computing the exact information over the data. Unfortunately, these algorithms require a considerable amount of time, scanning the data multiple times, or require additional storage that is linear to the size of the dataset itself. Approximation methods, with guaranteed error bounds, developed in the context of streaming data are extremely effective to extract useful and relatively accurate knowledge from big data. In this talk, we will start with a review of one such approximation technique for estimating frequency counts over streaming data where only insert operations are permitted. We will then present refinement of this technique in the context of large datasets that need to handle both insert and delete operations. We will conclude the presentation with an application of the frequency counting technique for an innovative internet caching architecture in which a small front-end caching at the client side significantly mitigates the load imbalance problem at the back-end cache servers. The meta-objective of this talk is to demonstrate the usefulness of approximation techniques in the context of real applications.

    SPEAKER BIOGRAPHY
    Divyakant Agrawal is a Professor of Computer Science at the University of California at Santa Barbara. His research expertise is in the areas of distributed systems and databases. He is a fellow of the AAAS, ACM, and IEEE. He has also had visiting appointments at IBM Almaden Research Labs, NEC Research, ASK.com, Google, Qatar Computing Research Institute, and National University of Singapore over the course of his professional career.

  • (2/28, 3:00pm): Floris Geerts (U of Antwerp): Matrix query languages details)

    TITLE
    Matrix query language

    ABSTRACT
    Imagine that instead of the relational data model and relational algebra,
    data would be stored as matrices and queries would consist of linear algebra
    operators. How would such a matrix query language look like? What can we
    do and cannot do in such a language? And how does it relate to query languages
    based on first-order logic or the relational algebra? How do these languages
    connect to arithmetic circuits, which are often said to "capture" linear algebra?
    In this talk we answer some of these questions.

    BIO
    Floris Geerts holds a Professor position at the University of Antwerp, Belgium. Before that, he was a senior research fellow in the database group at the University of Edinburgh, and held a postdoc position in the data mining group at the University of Helsinki. He received his PhD in 2001 at the University of Hasselt, Belgium. His research interests include the theory and practice of databases, the study of data quality, and more recently, the interaction between linear algebra, relational databases and graph neural networks.

    PAPERS

    [1] Floris Geerts, Thomas Muñoz, Cristian Riveros, Jan Van den Bussche, Domagoj Vrgoc: Matrix Query Languages. SIGMOD Rec. 50(3): 6-19 (2021). https://sigmodrecord.org/?smd_process_download=1&download_id=%2011461.

    [2] Floris Geerts, Thomas Muñoz, Cristian Riveros, Domagoj Vrgoc: Expressive Power of Linear Algebra Query Languages. PODS 2021: 342-354. https://dl.acm.org/doi/10.1145/3452021.3458314.

    [3] Robert Brijder, Floris Geerts, Jan Van den Bussche, Timmy Weerwag: On the Expressive Power of Query Languages for Matrices. ACM Trans. Database Syst. 44(4): 15:1-15:31 (2019). https://dl.acm.org/doi/10.1145/3331445.

    [4] Floris Geerts: On the Expressive Power of Linear Algebra on Graphs. Theory Comput. Syst. 65(1): 179-239 (2021). https://drops.dagstuhl.de/opus/volltexte/2019/10309/pdf/LIPIcs-ICDT-2019-7.pdf.

  • (3/2, 3:00pm): Floris Geerts (U of Antwerp): Graph Neural Networks through the eyes of matrix query languages details)

    ABSTRACT
    Graph neural networks (GNNs) are commonly used to compute graph/vertex
    embeddings. The capabilities of GNNs are often measured in terms of their distinguishing
    power, that is, their ability to distinguish graphs/vertices based on the computed embeddings.
    In this talk we view GNNs as queries in a matrix query language, dubbed "Tensor Language".
    We show that by using this language, we can obtain insights on the distinguishing power
    of GNNs and that it provides an easy toolbox for GNN designers to analyze their own GNNs.

    BIO
    Floris Geerts holds a Professor position at the University of Antwerp, Belgium. Before that, he was a senior research fellow in the database group at the University of Edinburgh, and held a postdoc position in the data mining group at the University of Helsinki. He received his PhD in 2001 at the University of Hasselt, Belgium. His research interests include the theory and practice of databases, the study of data quality, and more recently, the interaction between linear algebra, relational databases and graph neural networks. 

    PAPERS
    [1] Floris Geerts, Juan L Reutter: Expressiveness and Approximation Properties of Graph Neural Networks, ICLR 2022.
    [2] Martin Grohe: The Logic of Graph Neural Networks. LICS 2021: 1-17.
    [3] Martin Grohe: Word2vec, node2vec, graph2vec, X2vec: Towards a Theory of Vector Embeddings of Structured Data. PODS 2020: 1-16.

  • (3/7, 3:00pm): Manos Athanassoulis (Boston U): Building Robust LSM-based Key-Value Stores details)

    ABSTRACT
    Log-structured merge (LSM) trees are becoming the de facto standard for write-intensive storage layers for both production NoSQL data stores and relational systems. As LSM-based systems are used by various applications and deployed in shared infrastructure (e.g., public or private cloud), they are tasked to support a number of requirements varying from performance, to cost, and privacy, being robust to external requirements and the inherent workload unpredictability. At the heart of any LSM-based engine, we have the background re-organization mechanism (or compaction), the behavior of which affects essentially every aspect of the LSM-tree including write amplification, write throughput, point and range lookup performance, space amplification, and delete performance. In this presentation, we will first introduce in detail the design space of LSM compactions and discuss their tradeoffs, we will then take a deep dive on a new family of delete-aware compactions. We will define as a design goal the delete persistence latency, and discuss how to bound it. Finally, we will discuss the importance of tuning in the presence of uncertainty and present a sneak-peek of a new methodology for near-optimal LSM tuning in the presence of uncertainty of the expected workload vs. the observed one.

    BIO
    Manos Athanassoulis is an Assistant Professor of Computer Science at Boston University, Director and Founder of the BU Data-intensive Systems and Computing Laboratory and co-director of the BU Massive Data Algorithms and Systems Group. His research is in the area of data management focusing on building data systems that efficiently exploit modern hardware (computing units, storage, and memories), are deployed in the cloud, and can adapt to the workload both at setup time and, dynamically, at runtime. Before joining Boston University, Manos was a postdoc at Harvard University, earlier he obtained his PhD from EPFL, Switzerland, and spent one summer at IBM Research, Watson. Manos’ work has been recognized by awards like “Best of SIGMOD” in 2016, “Best of VLDB” in 2010 and 2017, and “Most Reproducible Paper” at SIGMOD in 2017, and has been supported by NSF and industry funds including two Red Hat Incubation Awards, a Facebook Faculty Research Award and gifts from Meta, Cisco, and Red Hat.

    RELEVANT PAPERS (especially 1 & 3)
    1. Constructing and Analyzing the LSM Compaction Design Space
    Proceedings of the VLDB Endowment, Vol. 14(11), 2021
    Subhadeep Sarkar, Dimitris Staratzis, Zichen Zhu, Manos Athanassoulis
    2. Reducing Bloom Filter CPU Overhead in LSM-Trees on Modern Storage Devices
    Proceedings of the International Workshop on Data Management on New Hardware (DaMoN), 2021
    Zichen Zhu, Ju Hyoung Mun, Aneesh Raman, Manos Athanassoulis
    3. Lethe: A Tunable Delete-Aware LSM Engine
    Proceedings of the ACM SIGMOD International Conference on Management of Data, 2020
    Subhadeep Sarkar, Tarikul Islam Papon, Dimitris Staratzis, Manos Athanassoulis
    4. And a paper under review: https://arxiv.org/abs/2110.13801

  • (3/21, 3:00pm): Yael Amsterdamer (Bar-Ilan University): Automated Selection of Multiple Datasets for Extension by Integration details)

    ABSTRACT
    Organizations often seek to extend their data by integration with available datasets originating from external sources. While there are many tools that recommend how to perform the integration for given datasets, the selection of what datasets to integrate is often challenging in itself. First, the relevant candidates must be efficiently identified among irrelevant ones. Next, relevant datasets need to be evaluated according to issues such as low quality or poor matching to the target data and schema. Last, jointly integrating multiple datasets may have significant benefits such as increasing completeness and information gain, but may also greatly complicate the task due to dependencies in the integration process.
    To assist administrators in this task, we quantify to what extent an integration of multiple datasets is valuable as an extension of an initial dataset and formalize the computational problem of finding the most valuable subset to integrate by this measure. We formally analyze the problem, showing that it is NP-hard; we nevertheless introduce heuristic efficient algorithms, which our experiments show to be near-optimal in practice and highly effective in finding the most valuable integration.

    SHORT BIO
    Yael Amsterdamer is a Senior Lecturer at the Department of Computer Science, Bar-Ilan University (Ramat-Gan, Israel), and the head of the Data Management Lab. She received her Ph.D. in Computer Science from Tel-Aviv University, and has been a visiting Scholar at the University of Pennsylvania, Philadelphia, PA and jointly at Télécom Paris and INRIA institute (Paris, France). Her research is in the field of data management spanning topics such as crowd-powered data management, provenance and interactive summarization. She has won competitive grants such as the Israeli Science Foundation and Ministry of Science and regularly serves in program committees of top conferences.

    RELEVANT PAPER
    https://dl.acm.org/doi/10.1145/3459637.3482322

  • (3/23, 3:00pm): Avigdor Gal (Technion): Loch Data and Other Monsters: on Creating Data Ecosystems, the Intelligent Way details)

    TITLE
    Loch Data and Other Monsters: on Creating Data Ecosystems, the Intelligent Way

    ABSTRACT
    A data ecosystem offers an alliance-driven infrastructure that enables the interaction of different stakeholders and the resolution of interoperability issues among shared data. Despite years of research in data governance and management, trustability is still affected by the absence of transparent and traceable data-driven pipelines. Data integration is the main facilitator of such data-driven pipelines and matching is a task at the heart of any data integration process, aimed at identifying correspondences among data elements. Matching problems were traditionally performed in a semi-automatic manner, with correspondences being generated by matching algorithms and outcomes subsequently validated by human experts. Human-in-the-loop data integration has been recently challenged by the introduction of big data and recent studies have analyzed obstacles to effective human matching and validation. In this talk, we focus on the tension between human and machine matching. We propose a novel data ecosystem architecture that relies on both human knowledge and machine learning and offer a concrete algorithmic solution for effective data integration within this architecture. In particular, we shall present the limitations of human matching and offer a method for learning to characterize reliable and valuable matching experts.

    SHORT BIO
    Avigdor Gal is The Benjamin and Florence Free Chaired Professor of data science at the Technion - Israel Institute of Technology, where he heads the Big Data Integration laboratory. He specializes in various aspects of data management and mining with about 150 publications in leading journals, books, and conference proceedings. In the current age of big data, his research is focused on developing novel models and algorithms for data integration. In the past he gave keynotes and tutorials in leading conferences in the areas of data and process management. Avigdor Gal is a recipient of the prestigious Yannai award for excellence in academic education, and multiple best paper and test-of-time awards.

  • (3/28, 3:00pm): Magda Balazinska (U Washington): Video Data Management details)

    Abstract:

    The proliferation of inexpensive high-quality cameras coupled
    with recent advances in machine learning and computer vision have enabled new applications on video data.
    This in turn has renewed interest in video data management systems (VDMSs).
    In this talk, we explore how to build a modern data management system
    for video data. We focus, in particular, on the storage manager
    and present several techniques to store video data in a way that accelerates
    queries over that data. We then move up the stack and discuss different
    types of data models that can be exposed to applications. Finally, we
    discuss the additional challenges of the end-to-end video
    analytics pipeline and how a VDMS can support applications throughout
    that pipeline.

    Bio:

    Magdalena Balazinska is Professor and Director of the Paul G. Allen School of Computer Science & Engineering at the University of Washington. Magdalena's research interests are in the field of database management systems. Her current research focuses on data management for data science, big data systems, cloud computing, and image and video analytics. Prior to her leadership of the Allen School, Magdalena was the Director of the eScience Institute, the Associate Vice Provost for Data Science, and the Director of the Advanced Data Science PhD Option. She also served as Co-Editor-in-Chief for Volume 13 of the Proceedings of the Very Large Data Bases Endowment (PVLDB) journal and as PC co-chair for the corresponding VLDB'20 conference. Magdalena is an ACM Fellow. She holds a Ph.D. from the Massachusetts Institute of Technology (2006). Shortly after her arrival at the University of Washington, she was named a Microsoft Research New Faculty Fellow (2007). Magdalena received the inaugural VLDB Women in Database Research Award (2016) for her work on scalable distributed data systems. She also received an ACM SIGMOD Test-of-Time Award (2017) for her work on fault-tolerant distributed stream processing and a 10-year most influential paper award (2010) from her earlier work on reengineering software clones.

  • (3/30, 3:00pm): Michael Benedikt (Oxford University): New ideas and old tasks for reasoning and data management details)

    TITLE
    New ideas and old tasks for reasoning and data management

    ABSTRACT
    Reasoning is relevant to many aspects of data management, including schema design, data integration, and query optimization. In this talk I give a quick tour of a few major innovative techniques developed in the last decades for applying reasoning to data management. I'll then discuss some of the roadblocks that have been encountered in applying these tools, and mention -- briefly -- some of our current projects to overcome these issues. I'll touch upon work on extending reasoning-based query optimization to complex data types, applying data management-for-reasoning to reasoning-for-data management, mapping rule learning, and applying RL-guided theorem-proving to query optimization.

    RELATED PAPERS
    1. M. Benedikt. How Can Reasoners Simplify Database Querying (And Why Haven't They Done It Yet)? PODS 2018.
    https://doi.org/10.1145/3196959.3196989
    2. M. Benedikt, P. Pradic. Generating collection transformations from proofs. POPL 2021.
    https://dl.acm.org/doi/10.1145/3434295

    BIO
    Michael Benedikt is Professor of Computer Science at Oxford University. Prior to that he was distinguished member of technical staff at Bell Laboratories.

  • (4/11, 3:00pm): Joe Hellerstein (UC Berkeley): A Programmable Cloud: CALM Foundations and Open Challenges details)

    ABSTRACT:
    The public cloud emerged a decade ago, yet distributed systems are still programmed using models from sequential computing. All the traditional challenges of distributed programming and data are still present in the cloud, only they are now faced by the general population of software developers. Added to these challenges are new desires for "serverless" computing, including consumption-based pricing and autoscaling.
     
    This talk will highlight principles for cloud programming that I have explored with colleagues over the past decade, including the CALM Theorem and languages like Dedalus and Bloom that encourage monotonic coordination-free consistency via logic and lattices. The Anna "any-scale" KVS will be presented as a petri dish for the potential of these ideas and many remaining challenges.

    Time permitting, I will conclude by overviewing new work in the Hydro project, which is aimed at bringing research ideas to programmers in an practical, evolutionary fashion. Key to our approach is a separation of distributed programs into a PACT of four facets: Program semantics, Availablity, Consistency and Trust. We propose to migrate developers gradually to PACT programming by lifting familiar code into our more declarative level of abstraction. This agenda raises challenges across multiple areas including language design, query optimization, transactions, distributed consistency, compilers and program synthesis.

    BIO:
    Joe Hellerstein is the Jim Gray Professor of Computer Science at the University of California, Berkeley, and currently on leave as a Faculty Fellow at Sutter Hill Ventures. His research focuses on data-centric systems and the way they drive computing. In addition to his academic work, Hellerstein has been involved in a number of startup companies including Trifacta, which brought academic research on data wrangling to market.

    Relevant papers:

  • (4/13, 3:00pm): Reinhard Pichler (TU Vienna): Hypertree Decompositions and all that... details)

    Abstract:
    In the first part of the talk, I will revisit various notions of decompositions and widths and recapitulate some of their crucial properties. In particular, we thus mention:
    *) hypertree-width (hw)
    *) generalized hypertree-width (ghw)
    *) fractional hypertree-width (fhw)

    Time permitting, I will then present some recent results on the task of actually computing such decompositions:
    *) NP-hardness of checking if fhw or ghw is at most 2
    *) Tractable cases of ghw and fhw computation
    *) Parallel approach to computing ghw and hw

    Curriculum Vitae:
    I got a master's degree from the University of Innsbruck (Mag.rer.nat. in Mathematics, 1991), another master's degree from the University of London (MSc in Mathematical Computation, 1992, at the QMW College), and a PhD from the TU Wien (Dr.techn. in Computer Science, 2000). In May 2001, I received my "habilitation" in Theoretical Computer Science.

    From 1992 to 2005, I worked in the Program and Systems Engineering Department (PSE) of the Siemens AG Österreich. Since July 2005, I have been a professor in the Databases and Artificial Intelligence Group (DBAI) of the TU Wien.

    Relevant publications:

    Wolfgang Fischl, Georg Gottlob, Reinhard Pichler:
    General and Fractional Hypertree Decompositions: Hard and Easy Cases. PODS 2018: 17-32.
    (full version in J. ACM, 2021)

    Wolfgang Fischl, Georg Gottlob, Davide Mario Longo, Reinhard Pichler: HyperBench: A Benchmark and Tool for Hypergraphs and Empirical Findings. PODS 2019: 464-480
    (full version in ACM J. Exp. Alg., 2021)

    Georg Gottlob, Matthias Lanzinger, Reinhard Pichler, Igor Razgon: Fractional Covers of Hypergraphs with Bounded Multi-Intersection. MFCS 2020: 41:1-41:14

    Hubie Chen, Georg Gottlob, Matthias Lanzinger, Reinhard Pichler: Semantic Width and the Fixed-Parameter Tractability of Constraint Satisfaction Problems. IJCAI 2020: 1726-1733

    Georg Gottlob, Cem Okulmus, Reinhard Pichler:
    Fast and Parallel Decomposition of Constraint Satisfaction Problems. IJCAI 2020: 1155-1162.

    Georg Gottlob, Matthias Lanzinger, Cem Okulmus, Reinhard Pichler: Fast Parallel Hypertree Decompositions in Logarithmic Recursion Depth. CoRR abs/2104.13793 (2021)
    To appear at PODS 2022

  • (4/25, 3:00pm): Huan Sun (Ohio State): Self-supervised Pre-training on Tabular Data details)

    Abstract:
    Pre-training/fine-tuning paradigms have transformed the natural language processing field. For tabular data-based tasks, however, their potential has been far less explored. In this talk, we will discuss our recent efforts: (1) TURL, a pre-training/fine-tuning paradigm on relational Web tables, which benefits a wide range of tasks for table understanding (e.g., relation extraction, entity linking). (2) StruG, a weakly supervised Structure-Grounded pretraining framework for text-to-SQL, which effectively learns to capture the text-table alignment essential for the task. (3) ReasonBERT, a pre-training method that augments language models for multi-step reasoning over hybrid contexts (textual and tabular). Among them, we will cover TURL in greater detail. Finally, we will conclude the talk with some of the open questions in the space.

    Bio:
    Huan Sun is an assistant professor in the Department of Computer Science and Engineering at the Ohio State University. Before joining OSU, she was a visiting scientist at the University of Washington (01-06/2016), received a Ph.D. in Computer Science from University of California, Santa Barbara (2015) and a B.S. in EEIS from the University of Science and Technology of China (2010). Her research interests lie in natural language processing, data mining and management, and artificial intelligence, with emphasis on building various kinds of natural language interfaces, task-oriented dialogue and conversational AI systems. Huan received the 2022 SIGMOD Research Highlight Award, 2021 BIBM Best Paper Award, Google Research Scholar Award (2022), NSF CAREER Award (2020), OSU Lumley Research Award (2020), SIGKDD Ph.D. Dissertation Runner-Up Award (2016), among others.

    Relevant papers:
    Turl: https://arxiv.org/abs/2006.14806 Please feel free to check out more papers on learning neural representations for tables mentioned in this tutorial. Unfortunately, we won’t be able to cover them in this talk.
    StruG: https://arxiv.org/abs/2010.12773
    ReasonBert: https://arxiv.org/abs/2109.04912

  • (4/27, 3:00pm): Yanlei Diao (UMass): UDAO: A Next-Generation Cloud Data Analytics Optimizer via Large-Scale Machine Learning details)

    ABSTRACT
    Data analytics in the cloud has become an integral part of enterprise businesses. Big data analytics systems, however, still lack the ability to take task objectives such as user performance goals and budgetary constraints and automatically configure an analytical job to achieve these objectives. This talk presents UDAO, a Unified Data Analytics Optimizer, that can automatically determine a cluster configuration with a suitable number of cores as well as other system parameters that best meet the task objectives. At a core of our work is a principled multi-objective optimization (MOO) approach that computes a Pareto optimal set of configurations to reveal tradeoffs between different objectives, recommends a new cluster configuration that best explores such tradeoffs, and employs novel optimizations to enable such recommendations within a second. Such optimization is further enabled by a Deep Learning-based modeling approach that can learn a model for each user objective as complex as necessary for the underlying computing environment. Evaluation using production workloads shows that our optimizer could reduce 36-67% latency and, at the same time, 37-75% cost in an existing cloud analytics system, while running under 200 msec.

    BIO
    Yanlei Diao is Professor of Computer Science at Ecole Polytechnique, France and the University of Massachusetts Amherst, USA. She recently joined Amazon as an Amazon Scholar. Her research interests lie in big data analytics and scalable intelligent information systems, with a focus on optimization in cloud analytics, data stream analytics, explanation discovery, interactive data exploration, and uncertain data management. She received her PhD in Computer Science from the University of California, Berkeley in 2005.

    Prof. Diao was a recipient of the 2016 ERC Consolidator Award, 2013 CRA-W Borg Early Career Award (one female computer scientist selected each year for outstanding contributions), IBM Scalable Innovation Faculty Award, and NSF Career Award. She has given keynote speeches at the ACM DEBS Conference, the ExploreDB workshop, and the Distinguished Lecture Series at the IBM Almaden Research Center, the University of Texas at Austin and Technische Universitaet Darmstadt. She has served as Editor-in-Chief of the ACM SIGMOD Record, Associate Editor of ACM TODS, Chair of the ACM SIGMOD Research Highlight Award Committee, and member of the SIGMOD and PVLDB Executive Committees. She was PC Co-Chair of IEEE ICDE 2017 and ACM SoCC 2016, and served on the organizing committees of SIGMOD, PVLDB, and CIDR, as well as on the program committees of many international conferences and workshops.

Fall 2021 (Wolfgang)

  • (9/2, 12:00pm): Robin Walters (Northeastern University): Equivariant Neural Networks for Learning Spatiotemporal Dynamics details)

    TITLE
    Equivariant Neural Networks for Learning Spatiotemporal Dynamics

    ABSTRACT
    Applications such as climate science and transportation require learning complex dynamics from large-scale spatiotemporal data. Existing machine learning frameworks are still insufficient to learn spatiotemporal dynamics as they often fail to exploit the underlying physics principles. Representation theory can be used to describe and exploit the symmetry of the dynamical system. We will show how to design neural networks that are equivariant to various symmetries for learning spatiotemporal dynamics. Our methods demonstrate significant improvement in prediction accuracy, generalization, and sample efficiency in forecasting turbulent flows and predicting real-world trajectories. This is joint work with Rose Yu, Rui Wang, and Jinxi Li.

    RELATED PAPERS
    * Walters, R.*, Wang, R.*, Yu, R. (2021). Incorporating Symmetry into Deep Dynamics Models for Improved Generalization. International Conference on Learning Representations (ICLR). https://arxiv.org/abs/2002.03061
    * Walters, R.*, Li, J.*, Yu R. (2021). Trajectory Prediction using Equivariant Continuous Convolution. International Conference on Learning Representations (ICLR). https://arxiv.org/abs/2010.11344

    BIO
    Robin Walters is a postdoctoral research fellow in the Khoury College of Computer Sciences. He joined Khoury in July 2020 through the Experiential AI program. Formerly, Robin was a Zelevinsky Research Instructor in the Mathematics Department at Northeastern. His research studies the connections between representation theory and differential equations both theoretically and practically using equivariant neural networks.

  • (9/9, 12:00pm): Wolfgang Gatterbauer (Northeastern University): Structure-preserving diagrams for relational queries and first-order logic details)

    TITLE
    Structure-preserving diagrams for relational queries and first-order logic

    ABSTRACT
    The relative expressiveness of relational query languages has been studied extensively: the logical expressiveness of particular fragments of relational algebra, relational calculus, Datalog with negation and restricted SQL under set semantics is well known to be equivalent. Yet what does it take to represent ``logical patterns'' and compare languages by their abilities to represent particular structures of reasoning that are common across various syntactic conventions?
    We describe a complete diagrammatic representation system that preserves the logical structure for non-disjunctive safe relational calculus. It also solves a problem that has vexed the logical community for over 100 years: finding an unambiguous and complete representation system for first-order logic sentences.
    Topics to discuss: 1. Why are disjunctions inherently more difficult to visualize than conjunctions? 2. Why do visual query languages based on relational algebra lack in their abilities to represent certain structural patterns? 3. A discussion of 3 common ``abuses of the line" that have created conceptual difficulties in past related work (with a particular focus on the influential and widely studied Peirce's existential beta graphs). 4. Ideas on how to solve the representation problem for disjunctions.

    RELATED WORK
    https://queryvis.com/

    SHORT BIO
    Wolfgang Gatterbauer is an Associate Professor in the Khoury College of Computer Sciences at Northeastern University. A major focus of his research is to extend the capabilities of modern data management systems in generic ways and to allow them to support novel functionalities that seem hard at first.
    https://gatterbauer.name

  • (9/16, 12:00pm): Amir Ilkhechi (Brown University): DeepSqueeze: Deep Semantic Compression for Tabular Data details)

    TITLE
    DeepSqueeze: Deep Semantic Compression for Tabular Data

    ABSTRACT
    With the rapid proliferation of large datasets, efficient data compression has become more important than ever. Columnar compression techniques (e.g., dictionary encoding, run-length encoding, delta encoding) have proved highly effective for tabular data, but they typically compress individual columns without considering potential relationships among columns, such as functional dependencies and correlations. Semantic compression techniques, on the other hand, are designed to leverage such relationships to store only a subset of the columns necessary to infer the others, but existing approaches cannot effectively identify complex relationships across more than a few columns at a time.
    In this talk, I will describe DeepSqueeze, a novel semantic compression framework that can efficiently capture these complex relationships within tabular data by using autoencoders to map tuples to a lower-dimensional representation. DeepSqueeze also supports guaranteed error bounds for lossy compression of numerical data and works in conjunction with common columnar compression formats. Our experimental evaluation uses real-world datasets to demonstrate that DeepSqueeze can achieve over a 4x size reduction compared to state-of-the-art alternatives.

    PAPER and recorded video from SIGMOD 2010:
    http://cs.brown.edu/people/acrotty/pubs/3318464.3389734.pdf
    https://dl.acm.org/doi/abs/10.1145/3318464.3389734

    BIO
    Amir Ilkhechi is a Ph.D. student in the Computer Science Department at Brown University, where he is a member of the Database Group advised by Professor Ugur Cetintemel. His research explores applications of deep learning to fundamental data management problems, most recently focusing on novel approaches to compression.

  • (9/23, 12:00pm): Roee Shraga (Northeastern University): PoWareMatch: a Quality-aware Deep Learning Approach to Improve Human Schema Matching details)

    TITLE
    PoWareMatch: a Quality-aware Deep Learning Approach to Improve Human Schema Matching

    ABSTRACT
    Schema matching is a core task of any data integration process. Being investigated in the fields of databases, AI, Semantic Web and data mining for many years, the main challenge remains the ability to generate quality matches among data concepts (e.g., database attributes). In this work, we examine a novel angle on the behavior of humans as matchers, studying match creation as a process. We analyze the dynamics of common evaluation measures (precision, recall, and f-measure), with respect to this angle and highlight the need for unbiased matching to support this analysis. Unbiased matching, a newly defined concept that describes the common assumption that human decisions represent reliable assessments of schemata correspondences, is, however, not an inherent property of human matchers. In what follows, we design PoWareMatch that makes use of a deep learning mechanism to calibrate and filter human matching decisions adhering the quality of a match, which are then combined with algorithmic matching to generate better match results. We provide an empirical evidence, established based on an experiment with more than 200 human matchers over common benchmarks, that PoWareMatch predicts well the benefit of extending the match with an additional correspondence and generates high quality matches. In addition, PoWareMatch outperforms state-of-the-art matching algorithms.

    RELATED WORK
    https://arxiv.org/abs/2109.07321

    BIO
    Roee Shraga is a Postdoctoral fellow at the Khoury College of Computer Sciences at Northeastern University. He received his PhD degree in 2020 from the Technion – Israel Institute of Technology in the area of Data Science. Roee has published more than a dozen papers in leading journals and conferences on the topics of data integration, human-in-the-loop, machine learning, process mining, and information retrieval. He is also a recipient of several PhD fellowships including the Leonard and Diane Sherman Interdisciplinary Fellowship (2017), the Daniel Excellence Scholarship (2019), and the Miriam and Aaron Gutwirth Memorial Fellowship (2020).

  • (9/30, 12:00pm): Can Qin (Northeastern University): Neural Pruning via Growing Regularization details)

    TITLE
    Neural Pruning via Growing Regularization

    ABSTRACT
    Regularization has long been utilized to learn sparsity in deep neural network pruning. However, its role is mainly explored in the small penalty strength regime. In this work, we extend its application to a new scenario where the regularization grows large gradually to tackle two central problems of pruning: pruning schedule and weight importance scoring. (1) The former topic is newly brought up in this work, which we find critical to the pruning performance while receives little research attention. Specifically, we propose an L_2 regularization variant with rising penalty factors and show it can bring significant accuracy gains compared with its one-shot counterpart, even when the same weights are removed. (2) The growing penalty scheme also brings us an approach to exploit the Hessian information for more accurate pruning without knowing their specific values, thus not bothered by the common Hessian approximation problems. Empirically, the proposed algorithms are easy to implement and scalable to large datasets and networks in both structured and unstructured pruning. Their effectiveness is demonstrated with modern deep neural networks on the CIFAR and ImageNet datasets, achieving competitive results compared to many state-of-the-art algorithms.

    RELATED WORK
    ICLR 2021: https://openreview.net/pdf?id=o966_Is_nPA

    BIO
    Can Qin has received the B.E. degree from the Xidian University (XDU), China, in 2018. Currently, he is a Ph.D. candidate in the Department of Electrical and Computer Engineering, Northeastern University, under the supervision of Prof. Yun Raymond Fu. He has been awarded the Best Paper Award in ICCV Workshop on Real-World Recognition from Low-Quality Images and Videos 2019. He also has some top-tier conference papers accepted at NeurIPS, AAAI, ECCV, ICLR, et al. His research interests broadly include the transfer learning and deep learning.

  • (10/7, 12:00pm): Senjuti Basu Roy (New Jersey Institute of Technology): Optimization Opportunities in Human-in-the-loop Systems details)

    TITLE
    Optimization Opportunities in Human-in-the-loop Systems

    ABSTRACT
    An emerging trend is to leverage an under-explored and richly heterogeneous pool of human knowledge inside machine algorithms, a practice popularly termed as human-in-the-loop (HIL) processes. A wide variety of applications, starting from sentiment analysis to image recognition, query processing to text translation, or even feature engineering stand to benefit from such synergistic man-machine collaboration. This talk will explore optimization opportunities inside such HIL systems, considering the roles and responsibilities of three key stakeholders - humans (workers), machines (algorithms), and platforms (online infrastructure where the work takes place). Optimization inside such HIL systems investigates judicious involvement of workers inside machine algorithms, as well as the desired functionality of the platforms to satisfy a variety of goals pertinent to the aforementioned stakeholders. Following that, the talk will specifically discuss both modeling as well as algorithmic challenges in task design and deployment inside large-scale HIL systems.

    RELATED WORK
    Multi-Session Diversity to Improve User Satisfaction in Web Applications. WWW 2021.
    https://dl.acm.org/doi/10.1145/3442381.3450046
    Recommending Deployment Strategies for Collaborative Tasks. SIGMOD 2020.
    https://dl.acm.org/doi/10.1145/3318464.3389719
    Making AI Machines Work for Humans in FoW. SIGMOD Record 2020.
    https://dl.acm.org/doi/10.1145/3442322.3442327

    BIO
    Senjuti Basu Roy is the Panasonic Chair in Sustainability and an Associate Professor at the Department of Computer Science at the New Jersey Institute of Technology. Her broader research interests lie in the area of large scale data management with the focus on designing principled algorithms for "human-in-the-loop" systems. She is the tutorial co-chair of The Web Conference 2022, has served as the Mentorship co-chair of SIGMOD 2018, PhD workshop co-chair of VLDB 2018, co-chair of SEADATA Workshop 2021 (colocated with VLDB 2021), and HMData Workshops 2017-2021 (colocated with IEEE BigData conference). She is a recipient of the NSF CAREER Award, and one of the 100 invited early career engineers to attend the National Academy of Engineering’s 2021 US Frontiers of Engineering Symposium.

  • (10/14, 12:00pm): Steven Holtzen (Northeastern University): Exploiting Symmetry for Scaling Discrete Factor Graph Inference details)

    TITLE
    Exploiting Symmetry for Scaling Discrete Factor Graph Inference

    ABSTRACT
    A key goal in the design of probabilistic inference algorithms is identifying and exploiting properties of the distribution that make inference tractable. One such property is symmetry, which is characterized by points in the distribution that are guaranteed to have the same probability. In this talk I will describe two inference algorithms for discrete factor graphs that scale in the degree of symmetry of the distribution. The first inference algorithm, called orbit generation, is the first exact inference algorithm for factor graphs that scales in the degree of symmetry of the distribution. The second inference algorithm is a Markov-Chain Monte-Carlo algorithm that mixes rapidly in the degree of symmetry.

    RELATED WORK
    Generating and Sampling Orbits for Lifted Probabilistic Inference. Steven Holtzen, Todd Millstein, and Guy Van den Broeck. In Uncertainty in Artificial Intelligence (UAI), 2019.
    https://arxiv.org/abs/1903.04672

    BIO
    Steven Holtzen is an assistant professor at Northeastern University. His research focuses on programming languages, artificial intelligence, and machine learning. In particular he is interested in probabilistic programming languages, foundations of probabilistic inference, tractable probabilistic modeling, automated reasoning, and probabilistic verification. His work has been recognized by an ACM SIGPLAN distinguished paper award.

  • (10/21, 12:00pm): Nesime Tatbul (MIT): Towards Explainable Time Series Anomaly Detection details)

    TITLE
    Towards Explainable Time Series Anomaly Detection

    ABSTRACT
    Time series is a ubiquitous data type with a growing range of applications from telemetry to finance. A key capability that lies at the core of managing time series data is the identification of unusual patterns of interest called anomalies. Detecting and explaining anomalies not only finds use in many mission-critical domains, but also empowers systems and users with the ability to handle large data volumes by guiding attention and resources to information that matters the most. Despite years of effort, diversity of time series applications, often noisy nature of datasets, and contextual variations in anomaly types and instances challenge the creation of robust and generalizable solutions. In this talk, I will present a collection of novel data science tools and techniques to help overcome such challenges in practice, including Exathlon -- the first public benchmark for explainable anomaly detection over high-dimensional time series.

    RELATED LINKS
    Exathlon: http://vldb.org/pvldb/vol14/p2613-tatbul.pdf, https://github.com/exathlonbenchmark/exathlon
    TSAD-Evaluator: https://arxiv.org/pdf/1803.03639.pdf, https://github.com/IntelLabs/TSAD-Evaluator
    Metro-Viz: https://dl.acm.org/doi/pdf/10.1145/3299869.3320247

    SHORT BIO
    Nesime Tatbul is a senior research scientist at Intel Labs and MIT, currently serving as an industry co-PI for MIT's Data Systems and AI Lab jointly funded by Intel, Google, and Microsoft. Previously, she served on the computer science faculty of ETH Zurich after receiving a Ph.D. degree from Brown University. Her research interests are broadly in large-scale data management systems and modern data-intensive applications, with a current focus on time series analytics and learned data systems. She has been an active member of the database research community for 20+ years, serving in various roles for the VLDB Endowment, ACM SIGMOD, and other organizations.

  • (10/28, 12:00pm): Azza Abouzied (NYU): Scalable Prescriptive Analytics in Database Systems details)

    TITLE
    Scalable Prescriptive Analytics in Database Systems

    ABSTRACT
    Currently, database systems do not natively support the many data processing needs of data-driven decision making, leaving experts to develop their own custom, ad hoc application-level solutions that are difficult to scale and may produce sub-optimal results. While many systems provide support for scalable descriptive analytics (like statistics and summaries of the raw data) and even some predictive analytics (such as forecasts), there is little support for prescriptive analytics, which searches for the best course of action given the available data. As we move from "what is the data?" to "what to do with it?,” we need to augment database systems with efficient computational problem-solving capabilities that take into consideration the inherent uncertainty of data and models. In this talk, I will explore some of the systems we built to integrate state-of-the-art solvers within the DBMS to scalably solve stochastic constrained optimization problems. I will also describe our work in building a scalable system to support sequential decision-making, and how this system is being used to help public health policy makers construct cost-effective policies that curb epidemics like COVID-19.

    PAPER LINKS
    https://dl.acm.org/doi/10.1145/3472749.3474794
    https://dl.acm.org/doi/10.1145/3318464.3389765
    http://packagebuilder.cs.umass.edu/papers/p576-brucato.pdf

    BIO
    Azza Abouzied is an associate professor of computer science at New York University Abu Dhabi. Her research focuses on designing intuitive data querying tools and combines techniques from various research fields such as HCI, machine learning, and database systems. In 2019, she won a VLDB test of time award for her work on HadoopDB. Her work on integrating decision-making support in database systems received a Best of VLDB recognition, a SIGMOD Research Highlight, a CACM research highlight and a Best VLDB demo award. She earned her doctoral degree from Yale in 2013. She spent a year as a visiting scholar at UC Berkeley.

  • (11/4, 12:00pm): Brian Hentschel (Harvard): Cerebral Data Structures details)

    TITLE
    Cerebral Data Structures

    ABSTRACT
    Data structures form the basis of all data-driven software applications and operations on data structures form a critical part of the overall cost of systems. For instance, just a singular data structure, hash tables, and in just a single language, C++, accounts for 2% of total CPU usage and 5% of total RAM usage across all of Google.
    We make the case for cerebral data structures, which use machine learning and statistical modeling at a high level to redesign base data structures in computer science. This redesign keeps the core structure of classical data structures; by keeping this core, the operations remain simple and therefore efficient as well as robust to shifting workloads. Additionally, keeping this core allows for theoretical arguments about data structure performance. At the same time, machine learning and statistics are used to transfer properties of the workload and data into the data structure through a redesign, achieving better expected performance than classic designs.
    As well as discussing the general approach, we present in detail two applications of this approach. First, we present Stacked Filters, which uses workload skew to produce 100X lower false positive rates for filter data structures. Second, we present Entropy-Learned Hashing, which separates hashing speed from input size, producing 10X faster hash function evaluation and 4X faster hash tables than state-of-the-art approaches from Facebook and Google.

    RELATED WORK
    https://stratos.seas.harvard.edu/files/stratos/files/stackedfilters_vldb2021_extended_version.pdf

    SHORT BIO
    Brian Hentschel is a Ph.D. candidate at Harvard University advised by Stratos Idreos. He is interested in blending statistics and machine learning with classical techniques to improve computer systems. His research has been awarded the SIGMOD research highlight award as well a best paper from EDBT. He earned his BA in mathematics and computer science at Pomona College and has previously spent time at Microsoft, Amazon, LinkedIn, and IBM.

  • (11/5, 12:00pm): Kuldeep Meel (National University of Singapore): Statistical Learning and Symbolic Reasoning: Better Together for Software 2.0 details)

    TITLE
    Statistical Learning and Symbolic Reasoning: Better Together for Software 2.0

    ABSTRACT
    The advent of personal computing has made manual tabling of data obsolete in today's context. What about the future where software engineers would balk at the prospect of the data-driven manual design of heuristics?. Well, we think such a future is on the horizon: in this talk, I will discuss our ambitious project, CrystalBall, that seeks to automate the design of heuristics in modern SAT solvers. The past two decades have witnessed how SAT went from the dreaded Non-deterministic Polytime (NP)-complete problem to Not a Problem (NP) for formal methods and AI community. Such progress was fueled by experts' careful design of heuristics, and every year's SAT competition sees experts continue to spend hundreds of hours tuning these heuristics. I will discuss how such heuristics can be learned automatically by statistical techniques and glimpses of the future where experts only focus on high-level ideas.

    RELEVANT PAPER
    https://www.comp.nus.edu.sg/~meel/Papers/sat19skm.pdf

    BLOG
    https://www.msoos.org/2019/06/crystalball-sat-solving-data-gathering-and-machine-learning/

    BIO
    Kuldeep Meel holds the NUS Presidential Young Professorship in the School of Computing at the National University of Singapore. His research interests lie at the intersection of Formal Methods and Artificial Intelligence. He is a recipient of the 2019 NRF Fellowship for AI (accompanied with S$2.5 million funding) and was named AI’s 10 to Watch by IEEE Intelligent Systems in 2020. His work received the 2020 Amazon Research Award, the 2018 Ralph Budd Award for Best PhD Thesis in Engineering, 2014 Outstanding Masters Thesis Award from Vienna Center of Logic and Algorithms and Best Student Paper Award at CP 2015. His CP-18 paper, CAV-20, and PODS-21 papers received IJCAI-19 Sister conferences best paper award track invitation, "Best of PODS-21" invite from from ACM TODS, and "Best Papers of CAV-20" invite from FMSD journal respectively.

  • (11/11, 12:00pm): Xiao Hu (Duke): Enumeration Algorithms for Conjunctive Queries with Projection details)

    TITLE
    Enumeration Algorithms for Conjunctive Queries with Projection

    ABSTRACT
    We investigate the enumeration of query results for an important subset of CQs with projections, namely star and path queries. The task is to design data structures and algorithms that allow for efficient enumeration with delay guarantees after a preprocessing phase. Our main contribution is a series of results based on the idea of interleaving precomputed output with further join processing to maintain delay guarantees, which may be of independent interest. In particular, we design combinatorial algorithms that provide instance-specific delay guarantees in nearly linear preprocessing time. These algorithms improve upon the currently best known results. Further, we show how existing results can be improved upon by using fast matrix multiplication. We also present new results involving tradeoff between preprocessing time and delay guarantees for enumeration of path queries that contain projections. CQs with projection where the join attribute is projected away is equivalent to boolean matrix multiplication. Our results can therefore be also interpreted as sparse, output-sensitive matrix multiplication with delay guarantees.

    RELATED WORK
    [1] Enumeration Algorithms for Conjunctive Queries with Projection, Shaleen Deep, Xiao Hu, Paraschos Koutris, ICDT 2021. https://arxiv.org/abs/2101.03712
    [2] Trade-offs in Static and Dynamic Evaluation of Hierarchical Queries, Ahmet Kara, Milos Nikolic, Dan Olteanu, Haozhe Zhang, PODS 2020. https://arxiv.org/abs/1907.01988
    [3] On acyclic conjunctive queries and constant delay enumeration, Guillaume Bagan, Arnaud Durand, and Etienne Grandjean, CSL 2007. https://grandjean.users.greyc.fr/Recherche/PublisGrandjean/EnumAcyclicCSL07.pdf
    [4] Structural tractability of enumerating csp solutions, G. Greco and F. Scarcello, Constraints 2013. https://arxiv.org/abs/1005.1567

    SHORT BIO
    Xiao Hu is a postdoctoral associate in the Department of Computer Science at Duke University, co-supervised by Prof. Pankaj Agarwal and Prof. Jun Yang. Prior to that, she received her Ph.D. in Computer Science and Engineering from HKUST, and a BE degree in Computer Software from Tsinghua University. Her research has focused on studying fundamental problems in database theory and their implications to practical systems. Her work on massively parallel join algorithms has been invited to ACM Transactions on Database Systems as a research paper, as well as a feature article in the Database Principles Column in SIGMOD Record.

  • (11/18, 12:00pm): Stijn Vansummeren (Hasselt University): General Dynamic Yannakakis: Conjunctive Queries with Theta Joins Under Updates details)

    TITLE
    General Dynamic Yannakakis: Conjunctive Queries with Theta Joins Under Updates

    ABSTRACT
    The ability to efficiently analyze changing data is a key requirement of many real-time analytics applications. In database terms, such analysis is closely related to the problem of Incremental View Maintenance (IVM) where we are asked to maintain the result Q(db) of a query Q on a database db under updates.
    In this talk I will summarize selected algorithmic ideas of the IVM literature and illustrate that they inherently rely on either (1) recomputation of query subresults or (2) materialization of subresults. Both have their drawbacks: recomputation is detrimental to update latency while materialization may waste both memory and be detrimental to latency.
    By moving to the framework of query evaluation with small delay, introduced by the seminal paper of Baghan, Durand, and Grandjean in 2007, we are able to circumvent both drawbacks. In particular, for the so called q-hierarchical queries, it is possible to (1) represent query results and subresults succinctly, in space at most linear in the database; (2) enumerate the results from this representation efficiently: with constant delay; and (3) maintain the representation efficiently: in constant time for updates of constant size.
    I will illustrate how we can obtain these features by modifying Yannakakis' seminal algorithm for evaluating Acyclic Conjunctive Queries. The resulting algorithm, which is called Dynamic Yannakakis, can be generalized to not only apply to Acyclic Conjunctive Queries (which only contain equality joins by definition), but also to such queries endowed with theta-joins (in particular: inequality predicates like A < B).
    This talk summarizes our results published in SIGMOD 2017, VLDB 2018, and VLDB Journal 2020, which received the SIGMOD Research Highlights Award 2018.

    RELATED WORK
    * On acyclic conjunctive queries and constant delay enumeration, Guillaume Bagan, Arnaud Durand, and Etienne Grandjean, CSL 2007.
    https://grandjean.users.greyc.fr/Recherche/PublisGrandjean/EnumAcyclicCSL07.pdf
    * General dynamic Yannakakis: conjunctive queries with theta joins under updates. Muhammad Idris, Martín Ugarte, Stijn Vansummeren, Hannes Voigt, Wolfgang Lehner. VLDB J. 29(2-3): 619-653 (2020).
    https://link.springer.com/article/10.1007/s00778-019-00590-9
    * Efficient Query Processing for Dynamically Changing Datasets. Muhammad Idris, Martín Ugarte, Stijn Vansummeren, Hannes Voigt, Wolfgang Lehner. SIGMOD Rec. 48(1): 33-40 (2019).
    https://sigmodrecord.org/?smd_process_download=1&download_id=3073
    * Conjunctive Queries with Inequalities Under Updates. Muhammad Idris, Martín Ugarte, Stijn Vansummeren, Hannes Voigt, Wolfgang Lehner. Proc. VLDB Endow. 11(7): 733-745 (2018).
    http://www.vldb.org/pvldb/vol11/p733-idris.pdf
    * The Dynamic Yannakakis Algorithm: Compact and Efficient Query Processing Under Updates. Muhammad Idris, Martín Ugarte, Stijn Vansummeren. SIGMOD Conference 2017: 1259-1274. https://martinugarte.com/media/pdfs/main_pDxeVno.pdf

    SHORT BIO
    Stijn Vansummeren is research professor of Data Management and Data Wrangling at the Data Science Institute of Hasselt University, Belgium. His research focuses on large scale data management, with as overarching theme the study, development, and application of formal approaches to wrangling, querying, and analyzing data at scale. Stijn Vansummeren obtained his PhD in 2005 from Hasselt University, where he was a PhD fellow of the Research Foundation Flanders (FWO) in the Databases and Theoretical Computer Science group, advised by Jan Van den Bussche. From 2005-2009, he was a postdoctoral fellow of the FWO in the same group. In 2009, he joined the Université Libre de Bruxelles (ULB), Belgium where he was associate professor of Computer Science at the Engineering faculty, until September 2020.
    His research has been awarded with the ACM SIGMOD Research Highlights Award (2018), a best paper award at WebDB (2016), and a best paper nomination award at the World Wide Web (WWW) conference (2008).

  • (12/2, 12:00pm): Asterios Katsifodimos (TU Delft): Valentine: Evaluating Matching Techniques for Dataset Discovery details)

    TITLE
    Valentine: Evaluating Matching Techniques for Dataset Discovery

    ABSTRACT
    Data scientists today search large data lakes to discover and integrate datasets. In order to bring together disparate data sources, dataset discovery methods rely on some form of schema matching: the process of establishing correspondences between datasets. This process has been traditionally taken care with schema matching techniques. After 20 years of research in schema matching, we are still missing a benchmark for schema matching, as well as proper datasets, and proper evaluation metrics! In this talk I will present Valentine, is an extensible open-source experiment suite to execute and organize large-scale automated matching experiments on tabular data. Valentine now includes implementations of 7 seminal schema matching methods that we either implemented from scratch (due to absence of open source code) or imported from open repositories. Finally, Valentine offers a data fabrication toolbox for constructing testing datasets with ground truth. I will conclude my talk with insights from a very large set of experiments we have been performing at TU Delft, focusing on the strengths and weaknesses of existing techniques, that can serve as a guide for employing schema matching in future dataset discovery methods.

    LINKS:
    Original Paper: https://ieeexplore.ieee.org/abstract/document/9458921
    Demo: http://www.vldb.org/pvldb/vol14/p2871-koutras.pdf
    Project: https://delftdata.github.io/valentine/
    Code: https://github.com/delftdata/valentine

    SHORT BIO
    Asterios Katsifodimos is an assistant professor the Delft University of Technology. Before TU Delft, Asterios worked at the SAP Innovation Center, and as a postdoc at TU Berlin. Asterios holds a PhD from INRIA & University of Paris 11 in France. Asterios currently works on scalable stream processing and data integration.

  • (12/3, 8:00am): Qichen Wang (Hong Kong UST): Maintaining Acyclic Foreign-Key Joins under Updates details)

    TITLE
    Maintaining Acyclic Foreign-Key Joins under Updates

    ABSTRACT
    In this paper, we study the problem of incrementally maintaining the query results of acyclic joins under updates, i.e., insertion and deletion of tuples to any of the relations. Prior work has shown that this problem is inherently hard, requiring at least $\Omega(|db|^{{1\over 2} -\epsilon})$ time per update, where $|db|$ is the size of the database, and $\epsilon > 0$ can be any small constant. However, this negative result holds only on adversarially constructed update sequences; on the other hand, most real-world update sequences are "nice", nowhere near these worst-case scenarios.
    We introduce a measure $\lambda$, which we call the enclosureness of the update sequence, to more precisely characterize its intrinsic difficulty. We present an algorithm to maintain the query results of any acyclic join in $O(\lambda)$ time amortized, on any update sequence whose enclosureness is $\lambda$. This is complemented with a lower bound of $\Omega(\lambda^{1-\epsilon})$, showing that our algorithm is essentially optimal with respect to $\lambda$. Moreover, the new measurement also recovers prior lower bounds on static as well as dynamic query evaluation. Next, using this algorithm as the core component, we show how all the 22 queries in the TPC-H benchmark can be supported in $\tilde{O}(\lambda)$ time. Finally, based on the algorithms developed, we built a continuous query processing system on top of Flink, and experimental results show that our system outperforms previous ones significantly.

    RELATED WORKS
    * Wang, Qichen, and Ke Yi. "Maintaining Acyclic Foreign-Key Joins under Updates." In SIGMOD 2020.
    https://www.cse.ust.hk/~yike/sigmod20.pdf
    * Chirkova, Rada, and Jun Yang. "Materialized views." Foundations and Trends in Databases 4, no. 4, 2011.
    http://db.cs.duke.edu/papers/fntdb12-ChirkovaYang-mat_views.pdf
    * Idris, Muhammad, Martín Ugarte, and Stijn Vansummeren. "The dynamic Yannakakis algorithm: Compact and efficient query processing under updates." In PVLDB 2017.
    https://dl.acm.org/doi/pdf/10.1145/3035918.3064027
    * Ahmad, Yanif, Oliver Kennedy, Christoph Koch, and Milos Nikolic. "DBToaster: Higher-order Delta Processing for Dynamic, Frequently Fresh Views," PVLDB 2012.
    http://vldb.org/pvldb/vol5/p968_yanifahmad_vldb2012.pdf
    * Berkholz, Christoph, Jens Keppeler, and Nicole Schweikardt. "Answering conjunctive queries under updates." In PODS 2017.
    https://dl.acm.org/doi/pdf/10.1145/3034786.3034789

    SHORT BIO
    Qichen Wang is a PhD candidate in the Department of Computer Science and Engineering at Hong Kong University of Science and Technology, supervised by Prof. Ke Yi. Prior to that, he received a BE degree in Computer Science from Zhejiang University. His research has focused on studying optimization techniques for query evaluation, from theoretical aspects to practical implementations. He is also interested in parallel and distributed algorithms, and algorithms for data streams.

  • (12/16, 12:00pm): Bill Howe (University of Washington): Data-Centric AI: Reuse, Integration, and Synthesis of Weakly Structured Data details)

    TITLE
    Data-Centric AI: Reuse, Integration, and Synthesis of Weakly Structured Data

    ABSTRACT
    What good is a collection of 1000s of loosely related tables? Repositories of weakly structured datasets -- datasets with rows and columns but few other guarantees -- have been built by organizations, cities, and in scientific domains, but the anticipated value of these systems has been difficult to realize. Urban open data repositories, data marketplaces, enterprise intranets, and scientific repositories are motivated by the idea that the data can be reused and integrated, but technical friction, limited provenance, data quality issues, limited search and discovery, and governance restrictions make widespread reuse difficult.
    Across a number of projects in the last decade, we have considered ways to make these repositories more valuable by enabling global reasoning over collections of locally defined datasets. I'll discuss a few projects in this space, including SQLShare, where we aimed to share and reuse the verbs (SQL queries) as well as the nouns (tables), ClaimJumper, where we aimed to enable repository-wide claim verification and statistical inference, and equitensors, where we aim to learn integrated, reusable, and unbiased representations from data with a shared time and space domain. Ultimately, our goal is to understand the limits of using diverse, undercurated, weakly structured datasets to train better models. I'll finish with some ideas for synthesizing data to enable better training and evaluation of ML models in specialized application domains where data is scarce.

    RELATED WORK
    https://faculty.washington.edu/billhowe/publications/pdfs/jain2016sqlshare.pdf
    https://faculty.washington.edu/billhowe/publications/pdfs/grechkin17ezlearn.pdf
    https://dl.acm.org/doi/abs/10.1145/3448016.3452777 (sorry for the paywall)
    bonus:
    https://faculty.washington.edu/billhowe//publications/pdfs/jain_cidr_2019.pdf

    BIO
    Bill Howe is Associate Professor in the Information School and Adjunct Associate Professor in the Allen School of Computer Science & Engineering and the Department of Electrical Engineering. His research interests are in data management, machine learning, and visualization, particularly as applied in the physical and social sciences. As Founding Associate Director of the UW eScience Institute, Dr. Howe played a leadership role in the Moore-Sloan Data Science Environment program through a $32.8 million grant awarded jointly to UW, NYU, and UC Berkeley, and founded UW’s Data Science for Social Good Program. With support from the MacArthur Foundation, NSF, and Microsoft, Howe directs UW’s participation in the Cascadia Urban Analytics Cooperative. He founded the UW Data Science Masters Degree, serving as its inaugural Program Chair, and created a first MOOC on data science that attracted over 200,000 students. His research has been featured in the Economist and Nature News, and he has authored award-winning papers in conferences across data management, machine learning, and visualization. He has a Ph.D. in Computer Science from Portland State University and a Bachelor’s degree in Industrial & Systems Engineering from Georgia Tech.

  • (12/17, 12:00pm): Luana Ruiz (UPenn): Graphon Signal Processing details)

    TITLE
    Graphon Signal Processing

    ABSTRACT
    Graphons are infinite-dimensional objects that represent the limit of convergent sequences of graphs as their number of nodes goes to infinity. This paper derives a theory of graphon signal processing centered on the notions of graphon Fourier transform and linear shift invariant graphon filters, the graphon counterparts of the graph Fourier transform and graph filters. It is shown that for convergent sequences of graphs and associated graph signals: (i) the graph Fourier transform converges to the graphon Fourier transform when the graphon signal is bandlimited; (ii) the spectral and vertex responses of graph filters converge to the spectral and vertex responses of graphon filters with the same coefficients. These theorems imply that for graphs that belong to certain families, i.e., that are part of sequences that converge to a certain graphon, graph Fourier analysis and graph filter design have well defined limits. In turn, these facts extend applicability of graph signal processing to graphs with large number of nodes — since signal processing pipelines designed for limit graphons can be applied to finite graphs — and to dynamic graphs — since we can relate the result of SP pipelines designed for different graphs from the same convergent graph sequence.

    RELATED WORK
    * Graphon Signal Processing, IEEE Transactions on Signal Processing ( Volume: 69), 2021.
    https://arxiv.org/pdf/2003.05030
    * Graphon Neural Networks and the Transferability of Graph Neural Networks
    https://proceedings.neurips.cc/paper/2020/file/12bcd658ef0a540cabc36cdf2b1046fd-Paper.pdf
    * Transferability Properties of Graph Neural Networks
    https://arxiv.org/pdf/2112.04629.pdf

    SHORT BIO
    Luana Ruiz received the B.Sc. degree in electrical engineering from the University of São Paulo, Brazil, and the M.Sc. degree in electrical engineering from the École Supérieure d'Electricité (now CentraleSupélec), France, in 2017. She is currently a Ph.D. candidate with the Department of Electrical and Systems Engineering at the University of Pennsylvania. Her research interests are in the areas of large-scale graph machine learning and the mathematical foundations of deep learning. She was awarded an Eiffel Excellence scholarship from the French Ministry for Europe and Foreign Affairs between 2013 and 2015, nominated an iREDEFINE fellow in 2019 and a MIT EECS Rising Star in 2021, and received best student paper awards at the European Signal Processing Conference (EUSIPCO) in 2019 and 2021.

Summer 2021 (Wolfgang)

  • (5/6, 12:00pm): Ioannis Liagouris (Boston University): Secrecy: Secure collaborative analytics on secret-shared data details)

    TITE
    Secrecy: Secure collaborative analytics on secret-shared data

    ABSTRACT
    In this talk I will present Secrecy, a new relational framework for secure collaborative analytics. Secrecy is based on replicated secret sharing and allows composing and executing end-to-end oblivious queries under secure multi-party computation (MPC). Secrecy’s core novelty is a set of logical and physical optimizations that effectively reduce query execution costs while retaining the full security guarantees of MPC. We evaluate Secrecy using real and synthetic queries from several application areas. Our experiments demonstrate that the optimizations we propose can improve query performance by orders of magnitude, enabling Secrecy to outperform state-of-the-art frameworks and scale to much larger datasets than those reported in prior works.

    RELATED WORK
    Secrecy: Secure collaborative analytics on secret-shared data John Liagouris, Vasiliki Kalavri, Muhammad Faisal, Mayank Varia
    https://arxiv.org/abs/2102.01048

    BIO
    John Liagouris is a research scientist at the Hariri Institute for Computing and an adjunct assistant professor at Boston University. His research interests lie in distributed systems and databases. Before joining BU, he was a visiting scholar at the RISELab, UC Berkeley, a senior researcher at the Systems Group, ETH Zurich, a visiting research fellow at the University of Hong Kong (HKU), and a research assistant at the “Athena” Research Center, Greece. John obtained his PhD from NTU Athens, Greece.

  • (6/2, 1:00pm): Fabrizio Silvestri (Sapienza University of Rome): Neural databases: Natural language generation meets databases details)

    TITE
    Neural databases: Natural language generation meets databases

    ABSTRACT
    We asked ourselves a question: can we build a database management system that doesn’t rely on the fundamental concept of a schema. In recent years, neural networks have shown impressive performance gains on long-standing AI problems, and in particular, answering queries from natural language text. These advances raise the question of whether they can be extended to a point where we can relax the fundamental assumption of database management, namely, that our data is represented as fields of a predefined schema. We present a first step in answering that question and we describe NeuralDB: a database system with no predefined schema. In NeuralDB updates and (select) queries are given in natural language. We develop query processing techniques that build on the primitives offered by the state-of-the-art Natural Language Processing methods. We begin by demonstrating that at the core, recent NLP transformers, powered by pre-trained language models, can answer select-project-join queries if they are given the exact set of relevant facts. However, they cannot scale to non-trivial databases and cannot perform aggregation queries. Based on these findings, we describe a NeuralDB architecture that runs multiple Neural Select Project Join (SPJ) operators in parallel, each with a set of database sentences that can produce one of the answers to the query. The result of these operators is fed to an aggregation operator if needed. We describe an algorithm that learns how to create the appropriate sets of facts to be fed into each of the Neural SPJ operators. Importantly, this algorithm can be trained by the Neural SPJ operator itself. We experimentally validate the accuracy of NeuralDB and its components, showing that we can answer queries over thousands of sentences with very high accuracy.

    RELATED WORK
    VLDB 2021: From natural language processing to neural databases
    James Thorne, Majid Yazdani, Marzieh Saeidi, Fabrizio Silvestri, Sebastian Riedel, Alon Halevy
    http://www.vldb.org/pvldb/vol14/p1033-thorne.pdf
    https://arxiv.org/pdf/2010.06973

    BIO
    Fabrizio Silvestri is a Full Professor at the Computer Engineering Dept. of the Sapienza University of Rome. Formerly a Research Scientist at Facebook AI in London, his interests are in AI applied to integrity-related problems and the application of Natural Language Processing. In the past, he has worked on web search research, and in particular, his specialization is building systems to better interpret queries s from search users. Prior to Facebook, Fabrizio was a principal scientist at Yahoo where he has worked on sponsored search and native ads within the Gemini project. Fabrizio holds a Ph.D. in Computer Science from the University of Pisa, Italy where he studied problems related to Web Information Retrieval with a particular focus on Efficiency-related problems like Caching, Collection Partitioning, and Distributed IR in general.

    This is part of the the Experiential AI Distinguished Lectures Series
    Register here: https://eaidistinguished.splashthat.com/

  • (6/16, 1:00pm): Ricardo Baeza-Yates (Northeastern University): Ethics in AI: A Challenging Task details)

    TITE
    Ethics in AI: A Challenging Task

    ABSTRACT
    In the first part of this seminar, we will cover five current specific challenges through examples: (1) discrimination (e.g., facial recognition, justice, sharing economy, language models); (2) phrenology (e.g., biometric based predictions); (3) unfair digital commerce (e.g., exposure and popularity bias); (4) stupid models (e.g., Signal, minimal adversarial AI) and (5) indiscriminate use of computing resources (e.g., large language models). These examples do have a personal bias but set the context for the second part where we will address four generic challenges: (1) too many principles (e.g., principles vs. techniques), (2) cultural differences (e.g., Christian vs. Muslim); (3) regulation (e.g., privacy, antitrust) and (4) our cognitive biases. We will wrap up the seminar by discussing what we can do to address these challenges in the near future.

    RELATED WORK
    Towards intellectual freedom in an AI Ethics Global Community (Opinion paper). AI and Ethics. 2021.
    https://link.springer.com/article/10.1007/s43681-021-00052-5

    BIO
    Ricardo Baeza-Yates is a Research Professor at the Institute for Experiential AI of Northeastern University. Before he was the CTO of NTENT, a semantic search technology company based in California and prior to these roles, he was VP of Research at Yahoo Labs, based in Barcelona, Spain, and later in Sunnyvale, California, from 2006 to 2016. He is co-author of the best-seller Modern Information Retrieval textbook published by Addison-Wesley in 1999 and 2011 (2nd ed), that won the ASIST 2012 Book of the Year award. From 2002 to 2004 he was elected to the Board of Governors of the IEEE Computer Society and between 2012 and 2016 was elected to the ACM Council. In 2009 he was named ACM Fellow and in 2011 IEEE Fellow, among other awards and distinctions. He obtained a Ph.D. in CS from the University of Waterloo, Canada, in 1989, and his areas of expertise are web search and data mining, information retrieval, bias and ethics on AI, data science and algorithms in general.

    Regarding the topic of the seminar, he is actively involved as an expert in many initiatives, committees, and advisory boards related to Responsible AI around the world including: Global AI Ethics Consortium, Global Partnership on AI, IADB's fAIr LAC Initiative (Latin America and the Caribbean), Council of AI (Spain) and ACM's Technology Policy Subcommittee on AI and Algorithms (USA). He is also a co-founder of OptIA in Chile - a NGO devoted to algorithmic transparency and inclusion - and a member of the editorial committee of the new AI and Ethics journal where he co-authored an article highlighting the importance of research freedom on ethical AI.

    This is part of the the Experiential AI Distinguished Lectures Series
    Register here: https://experientialaids2.splashthat.com/

  • (7/22, 12:00pm): Saket Gurukar (Ohio State University): MILE: A multi-level framework for scalable graph embedding details)

    TITLE
    MILE: A Multi-Level Framework for Scalable Graph Embedding

    ABSTRACT
    Recently there has been a surge of interest in designing graph embedding methods. Few, if any, can scale to a large-sized graph with millions of nodes due to both computational complexity and memory requirements. In this paper, we relax this limitation by introducing the MultI-Level Embedding (MILE) framework -- a generic methodology allowing contemporary graph embedding methods to scale to large graphs. The proposed MILE framework is agnostic to the underlying graph embedding techniques and can be applied to many existing graph embedding methods without modifying them. We employ our framework on several popular graph embedding techniques and experimental results on five large-scale datasets demonstrate that MILE significantly boosts the speed (order of magnitude) of graph embedding while generating embeddings of better quality. MILE can comfortably scale to a graph with 9 million nodes and 40 million edges, on which existing methods run out of memory or take too long to compute on a modern workstation.

    RELATED PAPERS
    * Liang, Jiongqian, Saket Gurukar, and Srinivasan Parthasarathy. "Mile: A multi-level framework for scalable graph embedding." ICWSM '21.
    https://arxiv.org/abs/1802.09612
    * Gurukar, Saket, et al. "Network representation learning: Consolidation and renewed bearing." arXiv preprint arXiv:1905.00987 (2019).
    https://arxiv.org/abs/1905.00987

    SHORT BIO
    Saket Gurukar is a Ph.D. candidate in CSE Department at The Ohio State University. His research interests include Network Representation Learning and Recommendation Systems. He has published papers in KDD, SIGMOD, ICWSM conferences. Prior to joining the Ph.D. program, he worked at IBM Research Labs, India as a Research Software Engineer and has completed his masters from IIT Madras.

  • (7/23, 12:00pm): Nikos Tziavelis (Northeastern University): Beyond Equi-joins: Ranking, Enumeration and Factorization details)

    TITLE
    Beyond Equi-joins: Ranking, Enumeration and Factorization

    ABSTRACT
    We study theta-joins in general and join predicates with conjunctions and disjunctions of inequalities in particular, focusing on ranked enumeration where the answers are returned incrementally in an order dictated by a given ranking function. Our approach achieves strong time and space complexity properties: with 𝑛 denoting the number of tuples in the database, we guarantee for acyclic full join queries with inequality conditions that for every value of 𝑘, the 𝑘 top-ranked answers are returned in O(𝑛 polylog𝑛 + 𝑘 log𝑘) time. This is within a polylogarithmic factor of the best known complexity for equi-joins and even of O(𝑛 + 𝑘), the time it takes to look at the input and return 𝑘 answers in any order. Our guarantees extend to join queries with selections and many types of projections, such as the so-called free-connex queries. Remarkably, they hold even when the entire output is of size 𝑛^ℓ for a join of ℓ relations. The key ingredient is a novel O(𝑛 polylog𝑛)-size factorized representation of the query output, which is constructed on-the-fly for a given query and database. In addition to providing the first non-trivial theoretical guarantees beyond equi-joins, we show in an experimental study that our ranked-enumeration approach is also memory-efficient and fast in practice, beating the running time of state-of-the-art database systems by orders of magnitude.

    RELATED PAPER AND LINKS
    "Beyond Equi-joins: Ranking, Enumeration and Factorization", Tziavelis, Gatterbauer, Riedewald, PVLDB 2021
    https://arxiv.org/pdf/2101.12158
    https://northeastern-datalab.github.io/anyk/
    https://db.khoury.northeastern.edu/activities/

    BIO
    Nikolaos Tziavelis is a 3rd year PhD student at the Khoury College of Computer Sciences of Northeastern University. His research interests lie in algorithms for database query processing and distributed computing. He holds a MEng in Electrical and Computer Engineering from the National Technical University of Athens.

Spring 2021 (Wolfgang)

  • (1/20, 4:00pm): Anna Fariha (UMass Amherst): Enhancing Usability and Explainability of Data Systems details)

    TITLE
    Enhancing Usability and Explainability of Data Systems

    ABSTRACT
    The recent growth of data science expanded its reach to an ever-growing user base of non-experts, increasing the need for democratization and transparency in these systems. Democratization demands that a system can be used by people with different skills and backgrounds alike. Transparency requires explainability that helps the users understand and trust the system function, especially when unexpected behavior occurs. Unfortunately, most existing data systems offer limited usability and support for explanations: these systems are usable only by experts with sound technical skills, and even experts are hindered by the lack of transparency into the inner workings of the systems. In this talk, I will talk about usability-enhancing systems, built in the programming-by-example paradigm, in two different settings: querying relational databases and personalized document summarization. I will then talk about a new data-profiling primitive that can characterize data points for which an ML model is likely to produce untrustworthy predictions. I will then briefly touch upon an explanation framework that can explain causes of non-deterministic software failures. Finally, I will talk about my future research agenda on how I plan to expand my current research to support a diverse group of users—ranging from end users to data scientists.

    PAPERS
    1. Anna Fariha, Alexandra Meliou: Example-Driven Query Intent Discovery: Abductive Reasoning using Semantic Similarity. PVLDB 2019. https://afariha.github.io/papers/SQuID_VLDB_2019.pdf
    2. Anna Fariha, Suman Nath, Alexandra Meliou: Causality-Guided Adaptive Interventional Debugging. SIGMOD 2020. https://afariha.github.io/papers/AID_SIGMOD_2020.pdf
    3. Anna Fariha, Ashish Tiwari, Arjun Radhakrishna, Sumit Gulwani, Alexandra Meliou: Conformance Constraint Discovery: Measuring Trust in Data-Driven Systems. SIGMOD 2021 (to appear). https://afariha.github.io/papers/Conformance_Constraints_SIGMOD_2021.pdf

    BIO
    Anna Fariha is a Ph.D. candidate in the College of Information and Computer Sciences at the University of Massachusetts, Amherst, advised by Professor Alexandra Meliou. While her primary area of research revolves around data management, the application areas of her research have been interdisciplinary—spanning from program synthesis and software engineering to machine learning, natural language processing, and human-computer interaction. She is interested in designing mechanisms for enhancing system usability and developing intelligent tools towards boosting productivity and agility for a diverse group of users—ranging from end users to data scientists. The outcome of her research has been published in SIGMOD and VLDB, and her work has been recognized by several awards, including Microsoft research dissertation grant 2020 and VLDB 2020 best demonstration runner-up award.

  • (1/21, 12:00pm): Prashant Pandey (Berkeley): Methods for Indexing and Searching Large-Scale Genomic Data details)

    TITLE
    Methods for Indexing and Searching Large-Scale Genomic Data

    ABSTRACT
    The ability to efficiently index and query large-scale genomic data is critical to achieving the full potential of many medical and scientific applications such as personalized medicine and population-level disease analysis. In this talk, I will describe two of my recent works on methods for large-scale indexing and searching on genomic data.
    First, Order Min Hash (OMH), a locality-sensitive hash for edit distance. This method is a refinement of the minHash LSH used to approximate the Jaccard similarity, in that OMH, is sensitive not only to the k-mer (length-k subsequence) contents of the sequences but also to the relative order of the k-mers in the sequences. We show that OMH is a better proxy for edit distance compared to the minHash LSH.
    Second, VariantStore, an index for large-scale genomic variant search. The genomic variant search involves querying variants across thousands of coordinate systems which makes this problem computationally intensive and hard to scale to a large number of samples. VariantStore is a system for efficiently indexing and querying genomic variants and their sequences in either the reference or sample-specific coordinate systems. It represents genomic variants as a variation graph and builds a light-weight position index on the graph to perform variant queries. We show the scalability of VariantStore by indexing variants from the TCGA project containing more than 8K samples while other systems could not scale beyond 2500 samples.

    RELATED PAPERS
    https://www.biorxiv.org/content/10.1101/2019.12.24.888297v2
    https://academic.oup.com/bioinformatics/article/35/14/i127/5529166

    BIO
    Prashant Pandey is a Postdoctoral Research Fellow at Lawrence Berkeley Lab and University of California Berkeley working with Prof. Kathy Yelick and Prof. Aydin Buluc. Prior to that, he spent one year as a postdoc at Carnegie Mellon University (CMU) working with Prof. Carl Kingsford. He obtained his Ph.D. in 2018 in Computer Science at Stony Brook University and was co-advised by Prof. Michael Bender and Prof. Rob Johnson.
    His research interests lie at the intersection of systems and algorithms. He designs and builds tools backed by theoretically well-founded data structures for large-scale data management problems across computational biology, stream processing, and storage. He is also the main contributor and maintainer of multiple open-source software tools that are used by hundreds of users across academia and industry.

  • (1/28, 12:00pm): Paolo Boldi (University of Milan): Centralities in Network Analysis details)

    ABSTRACT
    Given a social network, which of its nodes are more central? This question was asked many times in sociology, psychology and computer science, and a whole plethora of _centrality measures_ were proposed to account for the importance of the nodes of a network. Also, many modern IR problems call for a mixture of classical text-retrieval methods with graph mining techniques, especially within the area of social-network search, where node centrality can play a crucial role. But what do existing centrality measures actually measure? To what extent do they agree? What is their meaning when they disagree? And further: which of them can be computed or approximated reasonably on the large (huge) graphs under observation? In my talk, I will try to give a comprehensive, historically and mathematically coherent account of the most important centrality measures from the literature, and provide the audience with a tentative taxonomy that can be of help in trying to understand what these indices attempt at measuring. Then, I will propose to assume an axiomatic approach, wherein I suggest some simple, basic properties a centrality measure should have, and again I will propose a new classification of such axioms. Finally, I will provide a sketch of a result we recently obtained about rank monotonicity of damped spectral indices, like PageRank or Katz’s index.

    RELATED PAPERS
    • Paolo Boldi, Sebastiano Vigna: Axioms for Centrality. Internet Math. 10(3-4): 222-262 (2014). https://arxiv.org/abs/1308.2140
    • Paolo Boldi, Alessandro Luongo, Sebastiano Vigna: Rank monotonicity in centrality measures. Netw. Sci. 5(4): 529-550 (2017). https://air.unimi.it/retrieve/handle/2434/527940/947157/rank.pdf
    • Steve Chien, Cynthia Dwork, Ravi Kumar, Daniel R. Simon, & D. Sivakumar (2004). Link evolution: Analysis and algorithms. Internet math., 1(3), 277–304 https://www.tandfonline.com/doi/abs/10.1080/15427951.2004.10129090

    BIO
    Paolo Boldi is full Professor at the Università degli Studi di Milano since 2015, where he is currently the co-ordinator of the PhD Program in Computer Science and of the Computer Science Degree. His main research topics are algorithms and data structures for big data, web crawling and indexing, graph compression, succinct and quasi-succinct data structures, distributed systems, anonymity and alternative models of computation. Recently, his works focused on problems related to complex networks (especially, the World-Wide Web, social networks and biological networks), a field where his research has also produced software tools used by many people working in the same area. He chaired many important conferences in this sector (e.g., WSDM, WWW, ACM WebScience), and published over one hundred papers; he was also recipient of three Yahoo! Faculty Awards and co-recipient of a Google Focused Award, and member of many EU research projects. He was keynote speaker at many conferences such as ECIR, SPIRE, MFCS, IIR and invited scholar at the Institut des Hautes Études Scientifiques.

  • (2/4, 12:00pm): Pierre Bourhis (CNRS): How to extract subwords efficiently? An enumeration approach details)

    TITLE:
    How to extract subwords efficiently? An enumeration approach

    ABSTRACT
    The problem of extracting subwords through a query has been an important task in different fields: Information Extraction, Complex Event Processing, Bio-computer Sciences, NLP. It has also been studied formally in Theory of Computer Science through the notion of transducer. With a new formalism introduced in the Information Extraction setting: the spanners and its core formalized by a type of automata, VA automaton, there has been a new interest on extracting subwords, in particular on efficient evaluation of such formalism. In this context, the main question that has been studied is the problem of enumerating the solutions efficiently, i.e computing in a compact manner the solutions through a preprocessing and then finding the next solution through this structure.
    In this talk, we present some recent results on enumeration for different formalisms where automata are non-deterministic [1] or when the answers are ranked via a cost function [2].

    BIBLIOGRAPHY
    [1] Antoine Amarilli, Pierre Bourhis, Stefan Mengel, Matthias Niewerth: Constant-Delay Enumeration for Nondeterministic Document Spanners. SIGMOD Rec. 49(1). https://arxiv.org/pdf/2003.02576
    [2] Pierre Bourhis, Alejandro Grez, Louis Jachiet and Cristian Riveros. Ranked enumeration of MSO logic on words. ICDT 2021. https://arxiv.org/pdf/2010.08042

    BIO
    Pierre Bourhis is a tenured CNRS researcher at UMR 9189 CRIStAL at Lille and teaches Data Sciences at Ecole Polytechnique. He works at the team SPIRALS, a joined team of INRIA and the University of Lille. His main interests are data and knowledge management, reasoning on logical rules, query optimization, explanation in knowledge processing and security of data. He obtained his PhD in Computer Sciences in 2011 under the supervision of Serge Abiteboul. Before coming to CRIStAL in 2013, he did a post-doctorat at University of Oxford with Prof. Michael Benedikt.

  • (2/18, 12:00pm): Aristotelis Leventidis (Northeastern University): DomainNet: Homograph Detection for Data Lake Disambiguation details)

    TITLE
    DomainNet: Homograph Detection for Data LakeDisambiguation

    ABSTRACT
    Modern data lakes are deeply heterogeneous in the vocabulary that is used to describe data. We study a problem of disambiguation in data lakes: how can we determine if a data value occurring more than once in the lake has different meanings and is therefore a homograph? While word and entity disambiguation have been well studied in computational linguistics, data management and data science, we show that data lakes provide a new opportunity for disambiguation of data values since they represent a massive network of interconnected values. We investigate to what extent this network can be used to disambiguate values. DomainNet uses network-centrality measures on a bipartite graph whose nodes represent values and attributes to determine, without supervision, if a value is a homograph. A thorough experimental evaluation demonstrates that state-of-the-art techniques in domain discovery cannot be re-purposed to compete with our method. Specifically, using a domain discovery method to identify homographs has a precision and a recall of 38% versus 69% with our method on a synthetic benchmark. By applying a network-centrality measure to our graph representation, DomainNet achieves a good separation between homographs and data values with a unique meaning. On a real data lake our top-200 precision is 89%.

    RELATED PAPER
    • DomainNet: Homograph Detection for Data LakeDisambiguation. Aristotelis Leventidis, Laura Di Rocco, Wolfgang Gatterbauer, Renée J. Miller, Mirek Riedewald. EDBT 2021. https://northeastern-datalab.github.io/table-as-query/download/EDBT21-DomainNet-Homograph-Detection.pdf
    https://northeastern-datalab.github.io/table-as-query/

  • (2/25, 12:00pm): Nikos Tziavelis (Northeastern): Beyond Equi-joins: Ranking, Enumeration and Factorization details)

    TITLE
    Beyond Equi-joins: Ranking, Enumeration and Factorization

    ABSTRACT
    How can we efficiently represent and enumerate the results of theta-join queries? Factorization techniques have found notable success as a compact representation scheme that allows for enumeration algorithms where the results are produced incrementally after a preprocessing phase. However, these factorized representations have mainly been limited so far to equi-joins. This talk will present factorization techniques for general theta-join queries where the join conditions between relations go beyond equality. These are particularly useful for the task of ranked enumeration, where the query results are returned in the order specified by a ranking function. Perhaps surprisingly, for acyclic queries with general DNFs of inequality predicates we give asymptotic guarantees that are within a polylogarithmic factor of the best-known guarantee for equi-joins. We also target cases that occur often in practice such as a single predicate or a band-join and further improve the succinctness of our representation.

    PREPRINT
    https://arxiv.org/abs/2101.12158
    https://northeastern-datalab.github.io/anyk/

    BIO
    Nikolaos Tziavelis is a 3rd year PhD student at the Khoury College of Computer Sciences of Northeastern University. His research interests lie in algorithms for database query processing and distributed computing. He holds a MEng in Electrical and Computer Engineering from the National Technical University of Athens.

  • (3/4, 12:00pm): Ariful Azad (Indiana University Bloomington): Computational Building Blocks for Machine Learning on Graphs details)

    TITLE
    Computational Building Blocks for Machine Learning on Graphs

    ABSTRACT
    A graph is a beautiful mathematical concept that can concisely model any interacting system such as protein interactions in organisms, chemical bonds in compounds, and friendships on social networks. Thus, machine learning on graphs for predicting edges, node features, and community structures plays an important role in computational biology, social science, neurology, and computational chemistry.
    In this talk, I will discuss computational building blocks needed to design graph machine learning algorithms including graph embedding, graph neural networks (GNNs) and graph visualization. Computationally, major steps in these algorithms can be mapped to sampled dense-dense matrix multiplication (SDDMM) and sparse-dense matrix multiplication (SpMM). We can even fuse these two operations into a single kernel called FusedMM that captures almost all computational patterns needed by popular graph embedding and GNN approaches. I will then discuss parallel algorithms for these kernels. As the performances of these linear-algebraic operations are bound by the memory bandwidth, we develop algorithms that minimize data movements from the main memory and utilize cache and registers as much as possible. We also auto tune these operations so that the same code can perform equally well on Intel, AMD, IBM Power9 and ARM Processors. The kernel-based design and tuned parallel implementations speed up end-to-end graph embedding and GNN algorithms by up to 28x over existing approaches based on the message passing paradigm. The take home message of this talk is that developing graph machine learning algorithms using efficient building blocks provides a cleaner interface to application developers and boosts the performance of end-to-end graph learning applications.

    RELATED WORK
    * FusedMM: A Unified SDDMM-SpMM Kernel for Graph Embedding and Graph Neural Networks
    https://arxiv.org/abs/2011.06391
    * Force2Vec: Parallel force-directed graph embedding
    https://arxiv.org/abs/2009.10035

    BIO
    Dr. Ariful Azad is an Assistant Professor of Intelligent Systems Engineering at Luddy School of Informatics, Computing, and Engineering in Indiana University (IU). He was a Research Scientist in the Computational Research Division at Lawrence Berkeley National Laboratory. Dr. Azad obtained Ph.D. from Purdue University and B.S. from Bangladesh University of Engineering and Technology. His research interests are in graph machine learning, sparse matrix algorithms, high-performance computing, and bioinformatics. His interdisciplinary research group strives to solve large-scale problems in genomics, earth science, and scientific computing.

  • (3/11, 12:00pm): Bertram Ludaescher (University of Illinois Urbana-Champaign): From Provenance Polynomials to Provenance Patterns details)

    TITLE
    From Provenance Polynomials to Provenance Patterns (work-in-progress with Sahil Gupta)

    ABSTRACT
    It is well known that the commutative semiring of integer polynomials N[X] provides a unifying framework (so-called how-provenance) for many earlier provenance approaches in databases. A provenance-annotated answer A′ reveals how output tuples depend on input tuples and thus can be used to answer strictly more questions than the original answer A without provenance. We propose to use provenance patterns, a closely related notion implicit in earlier work, to support answering additional questions, not answerable by A′ alone. We show that the pattern-enhanced answer A* can be easily obtained as a by-product of computing A or A’. Using a detailed running example, we illustrate questions that can and cannot be answered using A, A′, and A*, respectively.

    RELATED PAPERS
    * Karvounarakis, G., & Green, T. J. (2012). Semiring-annotated data: queries and provenance. ACM SIGMOD Record, 41(3), 5-14. https://dl.acm.org/doi/10.1145/2380776.2380778
    * Köhler, S., Ludäscher, B., & Smaragdakis, Y. (2012, September). Declarative Datalog Debugging for Mere Mortals. In International Datalog 2.0 Workshop (pp. 111-122). LNCS 7494, Springer, Berlin, Heidelberg. https://dl.acm.org/doi/abs/10.1007/978-3-642-32925-8_12

    BIO
    Bertram Ludäscher is a professor at the School of Information Sciences at the University of Illinois, Urbana-Champaign and a faculty affiliate with the National Center for Supercomputing Applications (NCSA) and the CS department at UIUC. Until 2014 he was a professor at the CS department at the University of California, Davis and at the UC Davis Genome Center. His research interests range from practical questions in scientific data and workflow management, to database theory, and knowledge representation & reasoning. Prior to his faculty appointments, he was a research scientist at the San Diego Supercomputer Center (SDSC) and an adjunct faculty at the CSE department at UCSD. He received his M.S. (Dipl.-Inform.) in CS from the University of Karlsruhe (K.I.T.), and his PhD (Dr. rer. nat.) from the University of Freiburg, both in Germany.

  • (3/18, 12:00pm): Roee Shraga (Technion): Cross-Domain Schema Matching using Deep Similarity Matrix Adjustment and Evaluation details)

    TITLE
    Cross-Domain Schema Matching using Deep Similarity Matrix Adjustment and Evaluation

    ABSTRACT
    Schema matching is a process that serves in integrating structured and semi-structured data. Being a handy tool in multiple contemporary business and commerce applications, it has been investigated in the fields of databases, AI, Semantic Web, and data mining for many years. The core challenge still remains the ability to create quality algorithmic matchers, automatic tools for identifying correspondences among data concepts (e.g., database attributes). In this work, we offer a novel post processing step to schema matching (dubbed ADnEV) that improves the final matching outcome without human intervention. ADnEV demonstrates the shift in the paradigm that dictates the roles of humans in data integration. We present a new mechanism, similarity matrix adjustment, to calibrate a matching result and propose an algorithm that manipulates, using deep neural networks, similarity matrices, created by state-of-the-art algorithmic matchers. ADnEV learns two models that iteratively adjust and evaluate the original similarity matrix. We empirically demonstrate the effectiveness of the proposed algorithmic solution for improving matching results, using real-world benchmark ontology and schema sets. We show that ADnEV can generalize into new domains without the need to learn the domain terminology, thus allowing cross-domain learning. We also show ADnEV to be a powerful tool in handling schemata which matching is particularly challenging. Finally, we show the benefit of using ADnEV in a related integration task of ontology alignment.

    RELATED WORK
    * Shraga, Gal, Roitman. ADnEV: cross-domain schema matching using deep similarity matrix adjustment and evaluation. PVLDB 2020. http://www.vldb.org/pvldb/vol13/p1401-shraga.pdf
    * Gal, Roitman, Shraga. Learning to rerank schema matches. TKDE 2019. https://ieeexplore.ieee.org/abstract/document/8944172
    * Gal, Shraga. Humans' role in-the-loop. ACM SIGMOD Blog. 2020. http://wp.sigmod.org/?p=3138

    BIO
    Roee Shraga is a Postdoctoral fellow at the Technion – Israel Institute of Technology, from which he received a PhD degree in 2020 in the area of Data Science. Roee has published more than a dozen papers in leading journals and conferences on the topics of data integration, human-in-the-loop, machine learning, process mining, and information retrieval. He is also a recipient of several PhD fellowships including the Leonard and Diane Sherman Interdisciplinary Fellowship (2017), the Daniel Excellence Scholarship (2019), and the Miriam and Aaron Gutwirth Memorial Fellowship (2020).

  • (4/8, 12:00pm): Christoph Riedl (Northeastern University): Spite is contagious in dynamic networks details)

    TITLE
    Spite is contagious in dynamic networks

    ABSTRACT
    Spite, costly behavior that harms others, presents an evolutionary puzzle: given that both the actor and recipient do worse, how could it emerge? We show that dynamically evolving interaction networks provide a novel explanation for the evolution of costly harm. Previous work has shown that anti-correlated interaction (e.g., negative assortment or negative relatedness) among behavioral strategies in populations can lead to the evolution of costly harm. We show that these approaches are blind to important features of interaction brought about by a co-evolution of network and behavior and that these features enable the emergence of spite. We analyze a new model in which agents can inflict harm on others at a cost to themselves, and simultaneously learn how to behave and with whom to interact. We find spite emerges reliably under a wide range of conditions. Our model reveals that when interactions occur in dynamic networks the population can exhibit correlated and anti-correlated behavioral interactions simultaneously, something not possible in standard models. In dynamic networks spite evolves due to transient and partial anti-correlated interaction, even when other behaviors are positively correlated and average degree of correlated interaction in the population is low.

    RELATED PAPERS
    1) Spite is Contagious - https://www.nature.com/articles/s41467-020-20436-1
    2) Avoiding the bullies: Resilience of cooperation among unequals - https://www.dropbox.com/s/3rto9cinnq3m3ia/FoleyEtAl2020-WP-Bullies-2020-09-07.pdf?dl=0
    3) Conflict and Convention in Dynamic Networks - https://royalsocietypublishing.org/doi/10.1098/rsif.2017.0835

    BIO
    Christoph Riedl is associate professor for Information Systems and Network Science at the D’Amore-McKim School of Business at Northeastern University. He hold a joint appointment with the Khoury College of Computer Sciences and is a core faculty member at the Network Science Institute. He is a fellow at the Institute for Quantitative Social Science (IQSS) at Harvard and the Center for Collective Intelligence at MIT. His work has been funded by NSF, ARO, ONR, and DARPA, and has been published in leading journals including Science, Nature Communications, Organization Science, Management Science, Information Systems Research, and Academy of Management Discoveries.

  • (4/15, 12:00pm): Arash Termehchy (Oregon State University): The Data Interaction Game details)

    TITLE
    The Data Interaction Game

    ABSTRACT
    As most users do not precisely know the structure and/or the content of databases, their queries do not exactly reflect their information needs. The database systems may interact with users and use their feedback on the returned results to learn the information needs behind their queries. Current query interfaces assume that users do not learn and modify the way they express their information needs during their interaction with the system. Using a real-world interaction workload, we show that users learn and modify how to express their information needs and their learning is accurately modeled by a well-known online learning scheme. We model the interaction between a user and data system as a game with identical interest between two rational agents whose goal is to establish a common language for representing information needs in form of queries. We propose an online query answering method that adapts to users' learning and prove that it improves the effectiveness of answering queries stochastically speaking. We provide two efficient implementations of this method over large relational databases and show that our algorithms are more effective than the state-of-the-art query answering method using extensive empirical studies.

    RELATED WORK
    * How Do Users and Data Systems Establish a Common Query Language? Ben McCamish, Vahid Ghadakchi, Arash Termehchy, Liang Huang and Behrouz Touri. SIGMOD Record on ACM SIGMOD Research Highlights 48 (1), ,2019
    https://sigmodrecord.org/?smd_process_download=1&download_id=3077
    * The Data Interaction Game Ben McCamish, Vahid Ghadakchi, Arash Termehchy, Behrouz Touri and Liang Huang. The Proceedings of SIGMOD, June 2018.
    http://web.engr.oregonstate.edu/~ghadakcv/assets/pdfs/data-game.pdf

    BIO
    Arash Termehchy is an Associate Professor at the School of Electrical Engineering & Computer Science in Oregon State University. He received his PhD from University Of Illinois at Urbana Champaign. His research interests are human-centric data systems, querying and learning on large & heterogeneous data, and using ML in complex systems. His work is recognized by the ACM SIGMOD Research Highlight Award, Best of Conference Selections of SIGMOD and ICDE, Best Student Paper Award of ICDE, and Yahoo! Key Scientific Challenges Award.

  • (4/22, 12:00pm): Saravanan Thirumuruganathan (Qatar Computing Research): Graph Embeddings for Data Integration and Cardinality Estimation details)

    TITLE
    Graph Embeddings for Data Integration and Cardinality Estimation

    ABSTRACT
    Embeddings are a learned representation that characterizes an entity as a dense high-dimensional vector. There has been extensive work on learning embeddings for diverse types of entities such as words, documents, images, videos, graphs, sets, proteins, and many more. Despite the seeming diversity, most of the database applications have limited themselves to word embeddings.
    In this talk, I will discuss two applications of graph embeddings in the disparate areas of data integration and cardinality estimation. The common idea is to construct an application-specific graph data structure to represent the underlying relational data and learn node embeddings using customized random walks. The learned embeddings have several appealing properties and can be used for various downstream tasks with excellent performance.

    RELATED WORK
    1. Riccardo Cappuzzo, Paolo Papotti, Saravanan Thirumuruganathan. Creating Embeddings of Heterogeneous Relational Datasets for Data Integration Tasks. SIGMOD 2020. Link: https://www.eurecom.fr/en/publication/6231/download/data-publi-6231.pdf
    2. Suraj Shetiya, Saravanan Thirumuruganathan, Nick Koudas, Gautam Das. Astrid: Accurate Selectivity Estimation for String Predicates using Deep Learning. VLDB 2021. Link: http://www.vldb.org/pvldb/vol14/p471-shetiya.pdf

    BIO
    Dr. Saravanan (Sara) Thirumuruganathan is a Scientist at the Qatar Computing Research Institute (QCRI). Recently, he is working on applying deep learning techniques for problems in data integration, query optimization, and fake news. He received his Ph.D. in 2015 from the University of Texas at Arlington. His work has received several awards including Best demo at ICDE 2019, Best of VLDB 2018 and 2012, 2019 SIGMOD Research highlights and a CACM research highlight.

  • (4/29, 12:00pm): Angela Bonifati (Université Lyon 1): Evaluating Top-k Queries with Inconsistency Degrees details)

    TITLE
    Evaluating Top-k Queries with Inconsistency Degrees

    ABSTRACT
    We study the problem of augmenting relational tuples with degrees of inconsistency, that reflect the satisfaction of denial constraints (DCs). We define a notion of inconsistent tuples with respect to a set of DCs and define two measures of inconsistency degrees, which consider single and multiple violations of constraints. In order to compute these measures, we leverage two models of provenance, namely why-provenance and provenance polynomials. We then investigate top-k queries that allow to rank the answer tuples by their inconsistency degrees. Since one of our measures is monotonic and the other non-monotonic, we design an integrated top-k algorithm to compute the top-k results of a query w.r.t. both inconsistency measures.
    By means of an extensive experimental study, we gauge the effectiveness of inconsistency-aware query answering and the efficiency of our algorithm with respect to a baseline, where query results are fully computed and ranked afterwards

    LINKS
    Paper (PVLDB 2020) http://www.vldb.org/pvldb/vol13/p2146-issa.pdf
    Video Link (YouTube): https://www.youtube.com/watch?v=lLzUo4k-gIs
    Video Link (bilibili): https://www.bilibili.com/video/av329339128?p=26

    BIO
    Angela Bonifati is a Professor of Computer Science at Lyon 1 University and affiliated with the CNRS Liris research lab since 2015. She received her Ph.D. from Politecnico di Milano in 2002 and right after she was a postdoctoral researcher at INRIA Rocquencourt in Paris for one year. Her current research interests are on the interplay of relational and graph-oriented data paradigms, particularly on data integration and Big Data curation for life sciences, query processing and learning for structured and unstructured data models. She has co-authored several publications in first-rate venues of the data management field along with two books (edited by Springer in 2011 and Morgan&Claypool in 2018) and an invited paper in ACM Sigmod Record in 2018. She is the Program Chair of ACM Sigmod 2022 and was the Program Chair of EDBT 2020. She is Associate Editor in several database conferences including VLDB 2021, ICDE 2021 and ICDE 2018. She is also Associate Editor of the VLDB Journal, ACM TODS, Distributed and Parallel Databases and Frontiers in Big Data. She is the President of the EDBT Executive Committee and a member of the ICDT council. She holds many visiting scholar positions in foreign universities in both Europe and North America, the latest of which at the University of Waterloo (Canada) in 2019. She is currently Adjunct Professor in this university.

Fall 2020 (Wolfgang)

  • (9/10, 12:00pm): Bert Huang (Tufts University): Speeding up Markov Random Field Inference and Learning details)

    TITLE
    Speeding up Markov Random Field Inference and Learning

    ABSTRACT
    Markov random fields (MRFs) are powerful statistical models that can encode global distributions with local relationships over variables. Unfortunately, algorithms for learning and inference have historically suffered poor scalability and difficult software engineering. In this talk, I’ll cover two strategies we’ve developed to improve the scalability of MRF algorithms on modern hardware, both that build on aspects that have made deep learning successful. First, I’ll discuss an abstraction that enables multiple approximate inference algorithms to be computed via sparse matrix operations. Like the dense matrix backends that have made deep learning software highly portable, sparse matrix operations are low-level, general computations, they are often efficiently optimized on a variety of computing environments. This abstraction allows seamless implementation on many hardware backends, making MRF algorithms fast on current and future computers. Second, I’ll discuss provably convergent stochastic learning algorithms that exploit repeated structure in MRF models. Finally, I will discuss the big picture surrounding these modeling approaches and their role in the modern era.

    RELATED WORK
    * Sparse-Matrix Belief Propagation Reid Bixler, Bert Huang. Conference on Uncertainty in Artificial Intelligence, 2018.
    http://berthuang.com/papers/bixler-uai18.pdf
    * Block Belief Propagation for Parameter Learning in Markov Random Fields You Lu, Zhiyuan Liu, Bert Huang. AAAI Conference on Artificial Intelligence, 2019.
    https://arxiv.org/abs/1811.04064

    BIO
    Bert Huang is an assistant professor in the Department of Computer Science and the Data Intensive Studies Center at Tufts University. He earned his Ph.D. from Columbia University in 2011, was a postdoctoral research associate at the University of Maryland, and previously was an assistant professor at Virginia Tech. His research addresses topics surrounding machine learning, including structured prediction, weakly supervised learning, and algorithmic fairness. His papers have been published at conferences including NeurIPS, ICML, UAI, and AISTATS, and he is an action editor for the Journal of Machine Learning Research.

  • (9/16, 12:00pm): Andreas Loukas (EPFL Lausanne): Erdoes Goes Neural: an Unsupervised Learning Framework for Combinatorial Optimization on Graphs details)

    TITLE
    Erdoes Goes Neural: an Unsupervised Learning Framework for Combinatorial Optimization on Graphs

    ABSTRACT
    Combinatorial optimization (CO) problems are notoriously challenging for neural networks, especially in the absence of labeled instances. This work proposes an unsupervised learning framework for CO problems on graphs that can provide integral solutions of certified quality. Inspired by Erd˝os’ probabilistic method, we use a neural network to parametrize a probability distribution over sets. Crucially, we show that when the network is optimized w.r.t. a suitably chosen loss, the learned distribution contains, with controlled probability, a low-cost integral solution that obeys the constraints of the combinatorial problem. The probabilistic proof of existence is then derandomized to decode the desired solutions. We demonstrate the efficacy of this approach to obtain valid solutions to the maximum clique problem and to perform local graph clustering. Our method achieves competitive results on both real datasets and synthetic hard instances.

    RELATED WORK
    https://arxiv.org/abs/2006.10643

    BIO
    I am a research scientist (Ambizione fellow) at the LTS2 lab in EPFL, Switzerland. Previously, I spent time at TU Berlin, Ben Gurion, and TU Delft. My research focuses on the foundations and applications of graph methods in machine learning and data science. I aim to find elegant explanations for phenomena associated with learning and to exploit them in order to design specialized learning machines. I am also interested in graph problems in signal processing and theoretical computer science, as well as in the analysis of neural networks.
    https://andreasloukas.blog/

  • (9/24, 12:00pm): Laurent Lessard (Northeastern): Automating the analysis and design of large-scale optimization algorithms details)

    TITLE
    Automating the analysis and design of large-scale optimization algorithms

    ABSTRACT
    Most complicated optimization problems, in particular those involving a large number of variables, are solved in practice using iterative algorithms. The problem of selecting a suitable algorithm is currently more of an art than a science; a great deal of expertise is required to know which algorithms to try and how to properly tune them. Moreover, there are seldom performance guarantees. In this talk, I will show how the problem of algorithm selection can be approached using tools from robust control theory. By solving simple semidefinite programs (that do not scale with problem size), we can derive robust bounds on convergence rates for popular algorithms such as the gradient method, proximal methods, fast/accelerated methods, and operator-splitting methods such as ADMM. The bounds derived in this manner either match or improve upon the best known bounds from the literature. The bounds also lead to a natural energy dissipation interpretation and an associated Lyapunov function. Finally, our framework can be used to search for algorithms that meet desired performance specifications, thus establishing a principled methodology for designing new algorithms. The talk will be broadly accessible; no prior control theory knowledge needed!

    RELATED WORK
    https://laurentlessard.com/public/siopt16_iqcopt.pdf

    BIO
    Laurent Lessard is a new Associate Professor of Mechanical and Industrial Engineering at Northeastern University (joined in Fall 2020), with courtesy appointments in ECE and Khoury. Before coming to Northeastern, Laurent was a Charles Ringrose Assistant Professor of Electrical and Computer Engineering at the UW–Madison and Faculty Member at the Wisconsin Institute for Discovery. He received the B.A.Sc. in Engineering Science from the University of Toronto, and received the M.S. and Ph.D. in Aeronautics and Astronautics at Stanford University. After completing his doctoral work, he was an LCCC Postdoc at Lund University in Sweden, and a postdoctoral researcher at the University of California, Berkeley. Dr. Lessard is a recipient of the Hugo Schuck best paper award and the NSF CAREER award. His research interests include: decentralized control, robust control, optimization, and machine learning. For more information, see https://laurentlessard.com/

  • (9/30, 12:00pm): Xin Luna Dong (Amazon): Self-driving product understanding for thousands of categories details)

    TITLE
    Self-driving product understanding for thousands of categories

    ABSTRACT
    Knowledge graphs have been used to support a wide range of applications and enhance search results for multiple major search engines, such as Google and Bing. At Amazon we are building a Product Graph, an authoritative knowledge graph for all products in the world. The thousands of product verticals we need to model, the vast number of data sources we need to extract knowledge from, the huge volume of new products we need to handle every day, and the various applications in Search, Discovery, Personalization, Voice, that we wish to support, all present big challenges in constructing such a graph.
    In this talk we describe our efforts for self-driving knowledge collection for products of thousands of types. The system includes a suite of novel techniques for taxonomy construction, product property identification, knowledge extraction, anomaly detection, and synonym discovery. Our system is a) automatic, requiring little human intervention, b) multi-scalable, scalable in multiple dimensions including many domains, products, and attributes, and c) integrative, exploiting rich customer behavior logs. We describe what we learned in building this product graph and applying it to support customer-facing applications.

    PAPERS
    - Xin Luna Dong, Xiang He, Andrey Kan, Xian Li, Yan Liang, Jun Ma, Yifan Ethan Xu, Chenwei Zhang, Tong Zhao, Gabriel Blanco Saldana, Saurabh Deshpande, Alexandre Michetti Manduca, Jay Ren, Surender Pal Singh, Fan Xiao, Haw-Shiuan Chang, Giannis Karamanolakis, Yuning Mao, Yaqing Wang, Christos Faloutsos, Andrew McCallum, Jiawei Han. AutoKnow: Self-Driving Knowledge Collection for Products of Thousands of Types. In KDD, 2020.
    https://arxiv.org/abs/2006.13473
    - Giannis Karamanolakis, Jun Ma, Xin Luna Dong. TXtract: Taxonomy-aware knowledge extraction for thousands of product categories. In ACL, 2020.
    https://arxiv.org/abs/2004.13852
    - Yuning Mao, Tong Zhao, Andrey Kan, Chenwei Zhang, Xin Luna Dong, Christos Faloutsos, Jiawei Han. Octet: Online catalog taxonomy enrichment with self-supervision. In SigKDD, 2020.
    https://arxiv.org/abs/2006.10276

    BIO
    Xin Luna Dong is a Senior Principal Scientist at Amazon, leading the efforts of constructing Amazon Product Knowledge Graph. She was one of the major contributors to the Google Knowledge Vault project, and has led the Knowledge-based Trust project, which is called the “Google Truth Machine” by Washington’s Post. She has co-authored book “Big Data Integration”, was awarded ACM Distinguished Member, VLDB Early Career Research Contribution Award for “advancing the state of the art of knowledge fusion”, and Best Demo award in Sigmod 2005. She serves in VLDB endowment and PVLDB advisory committee, and is a PC co-chair for VLDB 2021, KDD'2020 ADS Invited Talk Series, ICDE Industry 2019, VLDB Tutorial 2019, and Sigmod 2018.​

  • (12/2, 12:00pm): Yinjun Wu (UPenn): Rapid retraining of machine learning models details)

    TITLE
    Provenance-aided approaches to incrementally updating machine learning models

    ABSTRACT
    The ubiquitous use of machine learning algorithms brings new challenges to traditional database problems such as incremental view update on machine learning models. The problem of incrementally maintaining machine learning models as views occurs in many applications, e.g. better understanding the importance of training samples to the machine learning models, as well as identifying the effect of outliers in training datasets. However, this problem has not been thoroughly studied yet, especially when the updates of the models are triggered by the deletions of training samples. In this talk, first of all, I will present an efficient provenance-based approach, PrIU, for incrementally updating linear regression and logistic regression models. Plus, I will also introduce one extended solution for handling a more broader model class, i.e. strongly convex machine learning models. We prove the correctness and convergence of the incrementally updated model parameters by using those two approaches and validate this experimentally. Experimental results also show that significant speed-ups can be achieved by using those two approaches compared to simply retraining the model from scratch. More importantly, the incrementally updated models by using our approaches do not hurt the prediction performance on the test dataset.

    RELATED WORK
    * SIGMOD 2020: PrIU: A Provenance-Based Approach for Incrementally Updating Regression Models
    https://dl.acm.org/doi/abs/10.1145/3318464.3380571
    * ICML 2020: DeltaGrad: Rapid retraining of machine learning models
    https://arxiv.org/abs/2006.14755

    BIO
    Yinjun Wu is a fifth-year PhD candidate at the department of computer and information science at University of Pennsylvania. His research interests lie at the intersection of data science and machine learning.

  • (12/11, 12:00pm): Chris Jermaine (Rice University): The Tensor-Relational Algebra, and Other Ideas in Machine Learning System Design details)

    TITLE
    The Tensor-Relational Algebra, and Other Ideas in Machine Learning System Design

    ABSTRACT
    One area in which systems for machine learning are wanting (TensorFlow, PyTorch) is in their support for Big Models and Big Data. In contrast to modern relational systems which scale to large data sizes and multiple machines quite well out of the box, getting machine learning computations to work in a distributed setting or with large models is often very challenging. In this talk, I argue that the fundamental problem is lack of abstraction in these systems. I argue that it makes sense to re-design these systems from the ground up, applying many of the lessons from the heyday of relational database system design in the 1970's and 80's.

    RELATED WORK
    https://arxiv.org/abs/2009.00524

    BIO
    Chris Jermaine is a Professor of Computer Science at Rice University, and directs Rice's Data Science Initiative. He is the recipient of an Alfred P. Sloan Foundation Research Fellowship, a National Science Foundation CAREER award, and the George R. Brown School's Teaching & Research Excellence Award. He has received best paper/best paper runner-up awards from top journals/conferences in data mining and data management, including IEEE ICDE, ACM SIGMOD, ACM SIGKDD, and VLDB, as well as the IBM Pat Goldberg Award for the best IBM paper of 2008. He currently serves as the editor-in-chief of ACM Transactions on Database Systems, the ACM’s flagship journal for data management research.

  • (12/16, 1:30pm): Sainyam Galhotra (UMass Amherst): Clustering with Applications to Data Integration details)

    TITLE
    Clustering with Applications to Data Integration

    ABSTRACT
    A growing number of data-based applications are used for decision-making that have far-reaching consequences and significant societal impact. Entity resolution, community detection and taxonomy construction are some of the building blocks of these applications and for these methods, clustering is the fundamental underlying concept. Therefore, the use of accurate, robust and fair methods for clustering cannot be overstated. We tackle the various facets of clustering with a multi-pronged approach described below.
    (i) While identification of clusters that refer to different entities is challenging for automated strategies, it is relatively easy for humans. We study the robustness and scalability of these methods that leverage supervision through an oracle i.e an abstraction of crowdsourcing. We further design techniques to construct taxonomies capturing type-subtype relation over the identified entities.
    (ii) In community detection applications, a common setback in evaluation of the quality of clustering techniques is the lack of ground truth data. We propose a generative model to capture interactions between records that belong to different clusters and devise techniques for efficient cluster recovery.
    (iii) The manifestation of bias in data could arise due to discriminatory treatment of marginalized groups, sampling methods or even measurement errors in the data. We study the impact of this bias on generated clusters and develop techniques that guarantee fair representation from different groups. We prove the noise tolerance of our algorithms and back the theory by demonstrating the efficacy and efficiency on various real world datasets for these applications.

    RELATED PAPERS
    * Balancing the Tradeoff Between Clustering Value and Interpretability
    https://doi.org/10.1145/3375627.3375843
    * Fair Correlation Clustering
    https://arxiv.org/abs/2002.03508

    BIO
    Sainyam Galhotra is a final year PhD student at University of Massachusetts Amherst. Before joining PhD, he was a researcher at Xerox Research and did undergraduate in computer science from IIT Delhi. His research is broadly in the area of data management with a specific focus on designing algorithms to not only be efficient but also transparent and equitable in their decision-making capabilities. He is a recipient of the Best Paper Award in FSE 2017, Most Reproducible Paper Award in SIGMOD 2017 and 2018. He is the first recipient of the Krithi Ramamritham Award at UMass for contribution to database research.

Summer 2020 (Wolfgang)

  • (5/8, 12:00pm): Panagiotis Mandros (Max Planck Institute): Functional dependencies and how to find them details)

    TITLE
    Functional Dependencies and how to find them

    ABSTRACT
    Functional dependencies come in many forms and shapes since they answer the fundamental question "what are the relevant features for a given target" that naturally occurs in many applications. These applications can range from database normalization, all the way to answering questions such as "does this bad habit cause this bad disease". In my talk, I will give a brief overview of these different FDs, and demonstrate how one can go from one end of the spectrum, i.e., database keys, to the other, i.e., causes and effects. By doing so, I hope to clarify a few things regarding which FD approach is appropriate for which application. At some point during the talk, I will present our solution to the FD discovery problem for exploratory data analysis. Our solution, DORA, finds the top FDs for a given target, is interpretable, i.e., facilitates the analysis of the results, is robust, i.e., the results can be trusted, and is very efficient in practice. Finally, I will end my talk by showing that DORA tackles the FD problem in a principled manner, contrary to a certain method that claims to have solved the FD problem once and for all.

    RELATED PAPERS
    * ICDM 2018: Discovering Reliable Dependencies from Data: Hardness and Improved Algorithms (best paper award)
    Mandros, P, Boley, M, & Vreeken, J
    http://eda.mmci.uni-saarland.de/pubs/2018/fedora-mandros,boley,vreeken.pdf
    * KDD 2017: Discovering Reliable Approximate Functional Dependencies
    Mandros, P, Boley, M, & Vreeken, J
    http://eda.mmci.uni-saarland.de/pubs/2017/dora-mandros,boley,vreeken.pdf

    BIO
    Panagiotis Mandros is a PhD candidate at the Max Planck Institute for Informatics and the University of Saarland, Germany. He works on the intersection of Information Theory and Optimization, developing theory and algorithms for discovering associations in high-dimensional data. He is particularly interested in Exploratory Data Analysis, i.e., the principle of exploring data without assumptions on the underlying data generating process, and has contributed methods for finding "patterns" in both unsupervised and supervised scenarios (e.g., functional dependencies, Markov blankets). His work is published at KDD, ICDM, and is the recipient of the 2018 IEEE ICDM best paper award.

  • (5/29, 12:30pm): Hung Ngo (Relational AI): Answering (Functional Aggregate) Queries via Tensor Decomposition details)

    TITLE
    Answering (Functional Aggregate) Queries via Tensor Decomposition

    ABSTRACT
    In this talk I will present a framework to think about and to design efficient algorithms for answering a generic class of queries called Functional Aggregate Queries (or FAQs). FAQs can be used to capture queries from a while range of application areas: databases, machine learning, CSPs, for instance. The tensor decomposition framework can be used to explain recent advances in CSP and database query processing such as the notions of fractional hypertree widths, submodular widths, worst-case optimal join algorithms, or the junction tree algorithm in graphical models. 

    The talk is based on several joint works with Mahmoud Abo Khamis, Ryan Curtin, Ben Moseley, Dan Olteanu, Atri Rudra, Maximilian Schleich, and Dan Suciu.

    RELATED SLIDES
    * FANDA: https://cse.buffalo.edu/~hungngo/papers/long-panda-slides.pdf
    * FAQ: https://cse.buffalo.edu/~hungngo/papers/faq-insideout.pdf

    BIO
    Hung Ngo is currently VP of Algorithms at relationalAI. Before that, he was a professor at SUNY Buffalo, and a Computer Scientist with LogicBlox. His current research interests are in designing and implementing algorithms for solving computational problems defined declaratively.

  • (6/5, 12:30pm): Nikos Tizavelis (Khoury): Optimal join algorithms meet top-k details)

    TITLE
    Optimal Join Algorithms Meet Top-k

    ABSTRACT
    Top-k queries have been studied intensively in the database community and they are an important means to reduce query cost when only the “best” or “most interesting” results are needed instead of the full output. While some optimality results exist, e.g., the famous Threshold Algorithm, they hold only in a fairly limited model of computation that does not account for the cost incurred by large intermediate results and hence is not aligned with typical database-optimizer cost models. On the other hand, the idea of avoiding large intermediate results is arguably the main goal of recent work on optimal join algorithms, which uses the standard RAM model of computation to determine algorithm complexity. This research has created a lot of excitement due to its promise of reducing the time complexity of join queries with cycles, but it has mostly focused on full-output computation. We argue that the two areas can and should be studied from a unified point of view in order to achieve optimality in the common model of computation for a very general class of top-k-style join queries. This tutorial has two main objectives. First, we will explore and contrast the main assumptions, concepts, and algorithmic achievements of the two research areas. Second, we will cover recent, as well as some older, approaches that emerged at the intersection to support efficient ranked enumeration of join-query results. These are related to classic work on k-shortest path algorithms and more general optimization problems, some of which dates back to the 1950s. We demonstrate that this line of research warrants renewed attention in the challenging context of ranked enumeration for general join queries.

    RELATED WORK
    Based on an upcoming SIGMOD 2020 tutorial and VLDB 2020 paper:
    https://arxiv.org/pdf/2005.00448
    https://arxiv.org/pdf/1911.05582
    https://northeastern-datalab.github.io/anyk/
    https://northeastern-datalab.github.io/topk-join-tutorial/

  • (6/10, 12:00pm): Deepak Ajwani (UC Dublin): Average-case behavior of k-shortest path algorithms details)

    TITLE
    Average-case behavior of k-shortest path algorithms

    ABSTRACT
    The k-shortest path problem is a generalization of the fundamental shortest path problem, where the goal is to compute k simple paths from a given source to a target node, in non-decreasing order of their weight. With numerous applications modeling various optimization problems and as a feature in some learning systems, there is a need for efficient algorithms for this problem. Unfortunately, despite many decades of research, the best directed graph algorithm still has a worst-case asymptotic complexity of ˜O(kn(n+m)). In contrast to the worst-case complexity, many algorithms have been shown to perform well on small diameter directed graphs in practice. In this paper, we prove that the average-case complexity of the popular Yen’s algorithm on directed random graphs with edge probability p = Ω(logn)/n in the unweighted and uniformly distributed weight setting is O(kmlogn), thus explaining the gap between the worst-case complexity and observed empirical performance. While we also provide a weaker bound of O(kmlog 4 n) for sparser graphs with p ≥ 4/n, we show empirical evidence that the stronger bound should also hold in the sparser setting. We then prove that Feng’s directed k-shortest path algorithm computes the second shortest path in expected O(m) time on random graphs with edge probability p = Ω(logn)/n. Empirical evidence suggests that the average-case result for the Feng’s algorithm holds even for k > 2 and sparser graphs.

    RELATED PAPER
    Complex networks 2018: https://researchrepository.ucd.ie/bitstream/10197/9887/2/ajwani_complex_networks18.pdf

    BIO
    Dr. Deepak Ajwani is an Assistant Professor at the School of Computer Science, University College Dublin. His research is at the confluence of diverse areas such as algorithms and data structures (with a focus on scalable graph algorithms), algorithm engineering, combinatorial optimization and machine learning.
    He received his Ph.D. from Max Planck Institute for Informatics in 2008, for his work on I/O-efficient graph traversal algorithms. From 2008-2010, he worked as a Postdoctoral Researcher at MADALGO - Centre for Massive Data Algorithms - where he developed shared memory multicore algorithms for discrete optimization problems. From 2010-2012, he worked on an IRC funded project at University College Cork, in collaboration with IBM Research, for a project on designing graph partitioning and repartitioning techniques in the context of Exascale stream computing systems. From 2012-2018, he worked as a research scientist at Nokia Bell Labs, on the design and development of a cognitive computing tool for interpreting, organizing and navigating unstructured content.

  • (6/26, 12:00pm): Pablo Barcelo (PUC Chile): The Logical Expressiveness of Graph Neural Networks details)

    TITLE
    The Logical Expressiveness of Graph Neural Networks

    ABSTRACT
    The ability of graph neural networks (GNNs) for distinguishing nodes in graphs has been recently characterized in terms of the Weisfeiler-Lehman (WL) test for checking graph isomorphism. This characterization, however, does not settle the issue of which Boolean node classifiers (i.e., functions classifying nodes in graphs as true or false) can be expressed by GNNs. We tackle this problem by focusing on Boolean classifiers expressible as formulas in the logic FOC2, a well-studied fragment of first order logic. FOC2 is tightly related to the WL test, and hence to GNNs. We start by studying a popular class of GNNs, which we call AC-GNNs, in which the features of each node in the graph are updated, in successive layers, only in terms of the features of its neighbors. We show that this class of GNNs is too weak to capture all FOC2 classifiers, and provide a syntactic characterization of the largest subclass of FOC2 classifiers that can be captured by AC-GNNs. This subclass coincides with a logic heavily used by the knowledge representation community. We then look at what needs to be added to AC-GNNs for capturing all FOC2 classifiers. We show that it suffices to add readout functions, which allow to update the features of a node not only in terms of its neighbors, but also in terms of a global attribute vector. We call GNNs of this kind ACR-GNNs.

    RELEATED WORK
    ICLR 2020 paper: https://openreview.net/forum?id=r1lZ7AEKvB
    Previous work on the relationship between WL and GNNs:
    a) Weisfeiler and Leman Go Neural: Higher-Order Graph Neural Networks https://aaai.org/ojs/index.php/AAAI/article/view/4384
    b) How Powerful are Graph Neural Networks? https://openreview.net/forum?id=ryGs6iA5Km

    BIO
    Pablo Barcelo is Full Professor at Pontificia Universidad Católica de Chile, where he also acts as Director of the Institute for Mathematical and Computational Engineering, and Deputy Director of Millennium Institute for Foundational Research on Data. He is the author of more than 70 technical papers, has chaired ICDT 2019, will be chairing ACM PODS 2022, and is currently a member of the editorial committee of Logical Methods in Computer Science. From 2011 to 2014 he was the editor of the database theory column of SIGMOD Record. His areas of interest are database theory, logic in computer science, and the emerging relationship between these areas and machine learning.

  • (7/1, 12:00pm): Jan Tönshoff (RWTH Aachen): Unsupervised Machine Learning for Constraint Satisfaction Problems details)

    TITLE
    Unsupervised Machine Learning for Constraint Satisfaction Problems

    ABSTRACT
    Constraint Satisfaction Problems (CSPs) are a powerful theoretical framework for modeling a wide range of combinatorial problems. I will present a novel neural network architecture for solving CSPs, which was developed by our research group. The architecture is generic and can be applied to any binary CSP. Training is unsupervised and it suffices to train on small random instances. The trained networks generalize well to much larger and structurally complex instances. Experiments on Max-2-Sat, Max-Cut and Max-IS demonstrate that our approach matches or surpasses conventional algorithms on several NP-hard problems, while being simple, generic and efficient.

    RELATED PAPER
    Graph Neural Networks for Maximum Constraint Satisfaction: https://arxiv.org/abs/1909.08387

    BIO
    I am a PhD student at RWTH Aachen University, where I work for Prof. Martin Grohe at the Chair of Logic and Theory of Discrete Systems. My research mainly focuses on applying machine learning to relational data structures, such as graphs, CSPs and databases.

  • (7/3, 12:00pm): Leopoldo Bertossi (Adolfo Ibanez University, Santiago, Chile): Score-Based Explanations in Data and Machine Learning details)

    TITLE
    Score-Based Explanations in Data and Machine Learning

    ABSTRACT
    Explanations have become important in data management and machine learning. In this presentation we review some recent work on the use of counterfactual interventions, as they appear in actual causality, to provide scores that measure the contribution of database tuples to query answering, and that of feature values to results of classification models.

    RELATED PAPERS
    * Livshits, Bertossi, Kimelfeld, Sebag. The Shapley Value of Tuples in Query Answering. ICDT 2020.
    http://arxiv.org/abs/1904.08679
    * Bertossi, Li, Schleich, Suciu, Vagena. Causality-based Explanation of Classification Outcomes. DEEM@SIGMOD 2020.
    http://arxiv.org/abs/2003.06868
    * Bertossi. An ASP-Based Approach to Counterfactual Explanations for Classification. RuleML-RR 2020.
    https://arxiv.org/abs/2004.13237

    BIO
    Leopoldo Bertossi is a Full Professor at the Faculty of Engineering and Sciences, "Universidad Adolfo Ibáñez" (Santiago, Chile), where he is the Director of the Graduate Program in Data Science. He is a Bachelor, Master and PhD in Mathematics by the "Pontificia Universidad Catolica de Chile" (PUC). His research has been concentrated in the areas of computational logic, knowledge representation, and data management. His research interests are related to Data Science and Artificial Intelligence in general. His is an Emeritus Professor of the School of Computer Science at Carleton University (Ottawa, Canada), Senior Computer Scientist at RelationalAI Inc., and Senior Researcher at the "Millenium Institute on Foundations of Data" (IMFD, Chile).

  • (7/8, 12:00pm): Ji-Yong Shin (Yale / Northeastern): Isolation in cloud storage systems details)

    TITLE
    Isolation in cloud storage systems

    ABSTRACT
    Guaranteeing isolation in cloud storage systems is challenging due to heterogeneous user workloads and concurrent I/O requests. This talk focuses on improving two types of isolation support -- transactional isolation and fine-grained client-centric consistency control -- in cloud storage systems.
    In the first part of the talk, the design and implementation of a block-level ACID transactional storage system called Isotope will be discussed. While transactional isolation has traditionally been implemented toward high levels of the storage stack, Isotope implements ACID transactions at the lowest common layer of the storage stack to fulfill the general need for concurrency control. Isotope demonstrates that the block-level transaction support trivializes the construction of transactional systems and even cross-application transactions.
    The second part of the talk presents a new class of storage systems called StaleStore, which trades off consistency and performance within a storage server. StaleStore is designed based on the observation that modern servers are as powerful and parallel as distributed systems in the past and employs client-centric consistency (also known as distributed weak consistency) within a server. StaleStore accelerates server and distributed applications by returning an older version of data within a server if accessing the latest version is expected to be slower.

    RELEATED PAPERS
    - Isotope conference version: https://www.usenix.org/system/files/conference/fast16/fast16-papers-shin.pdf
    - Isotope journal version: https://dl.acm.org/doi/pdf/10.1145/3032967
    - StaleStore conference version: https://dl.acm.org/doi/pdf/10.1145/2987550.2987579

    BIO
    Ji-Yong Shin is an associate research scientist in the Department of Computer Science at Yale University and will be joining Khoury College of Computer Science in the coming fall. His research interests are in designing novel distributed systems and exploring practical formal verification methods that can be applied to system designs. He received his Ph.D. from Cornell University, where he designed cloud storage systems with enhanced isolation support and a completely wireless datacenter.

  • (7/10, 12:00pm): Hongyang Zhang (Stanford / Wharton/ Northeastern): On the Generalization Effects of Linear Transformations in Data Augmentation details)

    TITLE
    On the Generalization Effects of Linear Transformations in Data Augmentation

    ABSTRACT
    Data augmentation is a powerful technique to improve performance in applications such as image and text classification tasks. Yet, there is little rigorous understanding of why and how various augmentations work. In this work, we consider a family of linear transformations and study their effects on the ridge estimator in an over-parametrized linear regression setting. First, we show that transformations which preserve the labels of the data can improve estimation by enlarging the span of the training data. Second, we show that transformations which mix data can improve estimation by playing a regularization effect. Finally, we validate our theoretical insights on MNIST. Based on the insights, we propose an augmentation scheme that searches over the space of transformations by how uncertain the model is about the transformed data. We validate our proposed scheme on image and text datasets. For example, our method outperforms RandAugment by 1.24% on CIFAR-100 using Wide-ResNet-28-10. Furthermore, we achieve comparable accuracy to the SoTA Adversarial AutoAugment on CIFAR datasets.

    RELATED WORK
    Based on joint work with Sen Wu, Greg Valiant, and Chris Re that appeared in ICML’20: https://arxiv.org/abs/2005.00695.

    BIO
    Hongyang Zhang is a postdoctoral researcher at the Statistics Department at Wharton School, University of Pennsylvania. He received a Ph.D. from Stanford University and a B.E. from Shanghai Jiao Tong University, both in computer science. His research interests lie in the intersection of machine learning, algorithms and statistics, including topics such as non-convex optimization, multi-task and transfer learning, weakly supervised learning, neural networks and its theory. He is a recipient of the best paper award at COLT 2018.

  • (7/17, 12:00pm): Zach Ives (UPenn): Juneau: Managing and Guiding Data Science details)

    TITLE
    Juneau: Managing and Guiding Data Science

    ABSTRACT
    Many modern data science applications build on data lakes, schema-agnostic repositories of data files and data products that offer limited organization and management capabilities. There is a need to build data lake search capabilities into data science environments, so scientists and analysts can find tables, schemas, workflows, and datasets useful to their task at hand. We develop search and management solutions for the Jupyter Notebook data science platform, to enable scientists to augment training data, find potential features to extract, clean data, and find joinable or linkable tables. Our core methods also generalize to other settings where computational tasks involve execution of programs or scripts.

    RELATED PAPERS
    CIDR 2019: http://cidrdb.org/cidr2019/papers/p55-ives-cidr19.pdf
    SIGMOD 2020: https://dl.acm.org/doi/pdf/10.1145/3318464.3389726

    BIO
    Zachary Ives is the Department Chair and Adani President's Distinguished Professor of Computer and Information Science at the University of Pennsylvania. He is a co-founder of Blackfynn, Inc., a company focused on enabling life sciences research and discovery through data integration. Zack's research interests include data integration and sharing, managing "big data," sensor networks, and data provenance and authoritativeness. He is a recipient of the NSF CAREER award, and an alumnus of the DARPA Computer Science Study Panel and Information Science and Technology advisory panel. He has also been awarded the Christian R. and Mary F. Lindback Foundation Award for Distinguished Teaching. He is a co-author of the textbook Principles of Data Integration, and has received an ICDE 2013 ten-year Most Influential Paper award as well as the 2017 SWSA Ten-Year Award at the International Semantic Web Conference. He has been an Associate Editor for Proceedings of the VLDB Endowment (2014) and a Program Co-Chair (2015) and Group Leader (2018) for the ACM SIGMOD conference.

  • (7/22, 9:00am): Nofar Carmeli (Technion): Answering (Unions of) Conjunctive Queries using Random Access and Random-Order Enumeration details)

    TITLE
    Answering (Unions of) Conjunctive Queries using Random Access and Random-Order Enumeration

    ABSTRACT
    As data analytics becomes more crucial to digital systems, so grows the importance of characterizing the database queries that admit a more efficient evaluation. We consider the tractability yardstick of answer enumeration with a logarithmic delay after a lineartime preprocessing phase. Such an evaluation is obtained by constructing, in the preprocessing phase, a data structure that supports logarithmic-delay enumeration. In this talk, we seek a structure that supports the more demanding task of a “random permutation”: logarithmic-delay enumeration in truly random order. Enumeration of this kind is required if downstream applications assume that the intermediate results are representative of the whole result set in a statistically valuable manner. An even more demanding task is that of a “random access”: logarithmic-time retrieval of an answer whose position is given. We establish that the free-connex acyclic CQs are tractable in all three senses: enumeration, random-order enumeration, and random access; and in the absence of self-joins, it follows from past results that every other CQ is intractable by each of the three (under some fine-grained complexity assumptions). However, the three yardsticks are separated in the case of a union of CQs (UCQ): while a union of free-connex acyclic CQs has a tractable enumeration, it may (provably) admit no random access. For such UCQs we devise a random-order enumeration whose delay is logarithmic in expectation.

    RELATED WORK
    PODS 2020: https://dl.acm.org/doi/pdf/10.1145/3375395.3387662

    BIO
    Nofar Carmeli is a Ph.D. student in the Data and Knowledge group at Technion, Israel Institute of Technology, advised by Prof. Benny Kimelfeld. Her research focuses on query optimization with guarantees using enumeration techniques. Nofar completed her BSc in 2015 in the Lapidim excellence program of the Computer Science Department of Technion.

  • (7/24, 12:00pm): Jorge Perez (Universidad de Chile): On the Turing Completeness of Modern Neural Network Architectures details)

    TITLE
    On the Turing Completeness of Modern Neural Network Architectures

    ABSTRACT
    Alternatives to recurrent neural networks, in particular, architectures based on attention or convolutions, have been gaining momentum for processing input sequences. In spite of their relevance, the computational properties of these alternatives have not yet been fully explored. We study the computational power of two of the most paradigmatic architectures exemplifying these mechanisms: the Transformer (Vaswani et al., 2017) and the Neural GPU (Kaiser & Sutskever, 2016). We show both models to be Turing complete exclusively based on their capacity to compute and access internal dense representations of the data. In particular, neither the Transformer nor the Neural GPU requires access to an external memory to become Turing complete. Our study also reveals some minimal sets of elements needed to obtain these completeness results.

    RELATED WORK
    ICLR 2019: https://openreview.net/forum?id=HyGBdo0qFm

    BIO
    Jorge Pérez is Associate Professor in the Department of Computer Science at Universidad de Chile, and Associate Researcher at the Millennium Institute for Foundational Research on Data. His research interests include data exchange and integration, Web data, and the theory of modern neural network architectures. He has received several awards for his research, including the best paper award in 5 international conferences, the Microsoft Research PhD Fellowship, and the SWSA Ten-Years Award for his work on query languages for Web data. His interests also include the analysis of social, medical and political text data, particularly in Spanish.

  • (7/29, 12:00pm): Nima Dehmamy (Kellog @ Northwestern University): Understanding the representation power of graph neural networks in learning graph topology details)

    TITLE
    Understanding the representation power of graph neural networks in learning graph topology

    ABSTRACT
    To deepen our understanding of graph neural networks, we investigate the representation power of Graph Convolutional Networks (GCN) through the looking glass of graph moments, a key property of graph topology encoding path of various lengths. We find that GCNs are rather restrictive in learning graph moments. Without careful design, GCNs can fail miserably even with multiple layers and nonlinear activation functions. We analyze theoretically the expressiveness of GCNs, concluding a modular GCN design, using different propagation rules with residual connections could significantly improve the performance of GCN. We demonstrate that such modular designs are capable of distinguishing graphs from different graph generation models for surprisingly small graphs, a notoriously difficult problem in network science. Our investigation suggests that, depth is much more influential than width, with deeper GCNs being more capable of learning higher order graph moments. Additionally, combining GCN modules with different propagation rules is critical to the representation power of GCNs.

    RELATED WORK
    NeurIPS paper 2019: https://arxiv.org/pdf/1907.05008

    BIO
    Nima Dehmamy is a research assistant professor at Northwestern Institute for Complex Systems in Kellogg School of Management at Northwestern University, Evanston IL, USA. He is a physicist working on complex systems and theory of machine learning. His research involves AI in graph learning, using physics to understand optimization landscapes and neuroscience. He earned his PhD in physics with Eugene Stanley at Boston University in 2016 and he was a postdoctoral fellow at the Barabasi Lab at Northeastern University, Boston, MA.

  • (7/31, 12:00pm): Donghui Zhang (Google): Spanner's SQL evolution details)

    TITLE
    Spanner's SQL Evolution

    ABSTRACT
    This talk introduces Google's distributed RDBMS called Spanner. It will cover Spanner's SQL interface, distributed query processing, and some lessons learned, as well as blockwise columnar storage.

    Related publications
    https://www.usenix.org/system/files/conference/osdi12/osdi12-final-16.pdf
    https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/46103.pdf

    BIO
    Donghui Zhang has been working in the database field since he got a Ph.D. from University of California, Riverside in 2002. His first job was an Assistant Professor at Northeastern University. In 2009 he left academia and joined the Microsoft Jim Gray Systems Lab where he researched on the best usage of Flash SSDs inside Microsoft SQL Server, and on lock-free hash table. Between 2012 and 2017 he led the engineering in Paradigm4, a Stonebraker startup building SciDB, a distributed computational database system. After that, he worked in Facebook's data warehouse for two years before joining Google's Spanner team.

  • (8/5, 12:00pm): Juan Sequeda (data.world): The Socio-Technical Phenomena of Data Integration details)

    TITLE
    The Socio-Technical Phenomena of Data Integration

    ABSTRACT
    Data Integration has been an active area of computer science research for over two decades. A modern manifestation is as Knowledge Graphs which integrates not just data but also knowledge at scale. Tasks such as Domain modeling and Schema/Ontology Matching are fundamental in the data integration process. Research focus has been on studying the data integration phenomena from a technical point of view (algorithms and systems) with the ultimate goal of automating this task.
    In the process of applying scientific results to real world enterprise data integration scenarios to design and build Knowledge Graphs, we have experienced numerous obstacles. In this talk, I will share insights about these obstacles. I will argue that we need to think outside of a technical box and further study the phenomena of data integration with a human-centric lens: from a socio-technical point of view.

    RELATED WORK
    A Pay-as-you-go Methodology to Design and Build Enterprise Knowledge Graphs from Relational Databases, ISWC 2019: https://github.com/juansequeda/papers/raw/master/iswc2019.pdf

    BIO
    Juan F. Sequeda is the Principal Scientist at data.world. He joined through the acquisition of Capsenta, a company he founded as a spin-off from his research. He holds a PhD in Computer Science from The University of Texas at Austin. Juan is the recipient of the NSF Graduate Research Fellowship, received 2nd Place in the 2013 Semantic Web Challenge for his work on ConstituteProject.org, Best Student Research Paper at the 2014 International Semantic Web Conference and the 2015 Best Transfer and Innovation Project awarded by the Institute for Applied Informatics. Juan is on the Editorial Board of the Journal of Web Semantics, member of multiple program committees (ISWC, ESWC, WWW, AAAI, IJCAI). He was the General Chair of AMW2018, PC chair of ISWC 2017 In-Use track, co-creator of COLD workshop (7 years co-located at ISWC). He has served as a bridge between academia and industry as the current chair of the Property Graph Schema Working Group, member of the Graph Query Languages task force of the Linked Data Benchmark Council (LDBC) and past invited expert member and standards editor at the World Wide Web Consortium (W3C). Wearing his scientific hat, Juan's goal is to reliably create knowledge from inscrutable data. His research interests are on the intersection of Logic and Data for (ontology-based) data integration and semantic/graph data management, and what now is called Knowledge Graphs. Wearing his business hat, Juan is a product manager, does business development and strategy, technical sales and works with customers to understand their problems to translated back to R&D.

  • (8/7, 12:00pm): Vikas Garg (MIT): Representation Limits and Generalization of Graph Neural Networks details)

    TITLE
    Representation Limits and Generalization of Graph Neural Networks

    ABSTRACT
    Graphs provide a natural abstraction to model relational and strategic data in domains as diverse as biology (e.g., molecules), multiagent settings (e.g., online vendors on ecommerce platforms), and distributed systems (e.g., Internet). Graphs also find much use as theoretical objects (e.g., probabilistic graphical models), and several important algorithms (e.g., max-flow for image segmentation) can be invoked when tasks are formulated in terms of graphs. Graph neural networks are naturally suited for making predictions based on graphs but they remain poorly understood in terms of what they can and cannot do. We analyze whether GNNs can distinguish graphs that differ in properties such as cycles, but have similar local structure. We also provide the first data dependent generalization bounds for message passing GNNs.

    The talk will be based on recent work with Stefanie Jegelka and Tommi Jaakkola.

    RELATED WORK
    ICML 2020: https://people.csail.mit.edu/tommi/papers/GJJ_ICML2020.pdf

    BIO
    Vikas Garg is a PhD student in Computer Science at MIT. His research interests in machine learning include generative models, graphical models, theory of deep learning, and learning under uncertainty or resource constraints along with their intersections with optimization and game theory.

  • (8/14, 12:00pm): Wang-Chiew Tan (Megagon Labs): Unleashing the Power of Subjective Data: Managing Experiences as First-Class Citizens details)

    TITLE
    Unleashing the Power of Subjective Data: Managing Experiences as First-Class Citizens

    ABSTRACT
    Subjective data refers to data that contains opinions and experiences. Such data is ubiquitous in product reviews, tweets, and discussion forums in social media. Consumers today spend considerable time sifting through subjective data to make informed decisions about purchases. At Megagon Labs, we are building technologies to synthesize knowledge from subjective data and to facilitate searching over them.
    In this talk, I will overview some of the technologies we have developed in the directions of subjective data preparation and integration, experience discovery and search, subjective knowledge representation, and data analytics and exploration.

    RELEATED PAPERS
    * OpinionDigest: A Simple Framework for Opinion Summarization. https://arxiv.org/pdf/2005.01901.pdf. Yoshihiko Suhara, Xiaolan Wang, Stefanos Angelidis, Wang-Chiew Tan – ACL 2020
    * ExtremeReader: An Interactive Explorer For Customizable And Explainable Review Summarization. https://megagon.ai/wp-content/uploads/2020/05/ExtremeReader.pdf. Xiaolan Wang, Yoshihiko Suhara, Natalie Nuno, Yuliang Li, Jinfeng Li, Nofar Carmeli, Stefanos Angelidis, Eser Kindogan, Wang-Chiew Tan – WWW 2020
    * Deep Entity Matching with Pre-Trained Language Models. https://arxiv.org/pdf/2004.00584 Yuliang Li, Jinfeng Li, Yoshihiko Suhara, AnHai Doan, Wang-Chiew Tan

    BIO
    Wang-Chiew leads the research efforts at Megagon Labs with the goal of building advanced technologies to enhance search by experience. Her team conducts research on data integration, information extraction, text mining and summarization, knowledge base construction and commonsense reasoning, and data visualization. Prior to that, she was a Professor of Computer Science at University of California, Santa Cruz. She also spent two years at IBM Research - Almaden. She received her B.Sc. (First Class) in Computer Science from the National University of Singapore and her Ph.D. in Computer Science from the University of Pennsylvania.
    https://wangchiew.github.io

  • (8/21, 12:00pm): Predrag Radivojac (Khoury): Graph kernels details)

    TITLE
    Graph kernels

    ABSTRACT
    We will discuss basics of learning on graphs via graph kernels. We will first cover basics of kernel methods, as well as graph classification and node classification with graph kernels. We will then extend graph kernel methods to edge classification and link prediction problems in an attempt to unify these problems as vertex classification on hypergraphs, a problem we will solve by developing edit-distance hypergraphlet kernels. We will show results on both directed and undirected graphs from social sciences and computational biology, with primary focus on undirected graphs and biological applications.

    BIO
    Predrag Radivojac is a Professor of Computer Science at Northeastern University, where he recently moved from Indiana University. Prof. Radivojac received his Bachelor's and Master's degrees in Electrical Engineering from the University of Novi Sad and University of Belgrade, Serbia. His Ph.D. degree is in Computer Science from Temple University (2003) under the direction of Prof. Zoran Obradovic and co-direction of Prof. Keith Dunker. In 2004 he held a post-doctoral position in Keith Dunker's lab at Indiana University School of Medicine, after which he joined Indiana University Bloomington. Prof. Radivojac's research is in the areas of computational biology and machine learning with specific interests in protein function, MS/MS proteomics, genome interpretation, and precision health. He received the National Science Foundation (NSF) CAREER Award in 2007 and is an August-Wilhelm Scheer Visiting Professor at Technical University of Munich (TUM) as well as an honorary member of the Institute for Advanced Study at TUM. At Indiana University, he was Associate Chair of the Department of Computer Science and a co-Director of all of Informatics and Data Science for the multi-campus Precision Health Initiative. Prof. Radivojac's projects have been regularly supported by NSF and National Institutes of Health (NIH). He is currently an Editorial Board member for the journal Bioinformatics, Associate Editor for PLoS Computational Biology, and serves his third term (elected) on the Board of Directors of the International Society for Computational Biology (ISCB).

  • (8/24, 12:00pm): Shantanu Jain (Khoury): Approaches and Algorithms for Positive Unlabeled Learning details)

    TITLE
    Approaches and Algorithms for Positive Unlabeled Learning

    ABSTRACT
    Positive Unlabeled (PU) learning is an extreme case of class imbalance in binary classification, where labeled data from the negative class is not available. Typically, unlabeled examples are used as a substitute for the negatives to train a non-traditional classifier that separates them from the positives. Surprisingly, the score function of a non-traditional classifier provides an optimal ranking of the inputs. However, for many other tasks such as 1) thresholding the score function to predict the hard class labels, 2) estimating the classifiers performance, and 3) obtaining well calibrated scores, treating the unlabeled examples as negatives, leads to significant bias. The bias can be corrected by the knowledge of the class proportion, making class proportion estimation the central task in PU learning. I will describe the theoretical issue of non-uniqueness (unidentifiability) in the estimation of the class proportion and identify key assumptions that make the estimation feasible. I will then present a nonparametric algorithm AlphaMax for practical estimation. Next, I’ll present an empirical risk minimization based criteria for PU classification. Finally, I will have an open ended discussion on deriving label propagation type algorithms for PU learning and extensions of PU learning to DAG structured multilabel classification.

    RELATED PAPERS
    - Nonparametric semi-supervised learning of class proportions: https://arxiv.org/abs/1601.01944
    - Estimating the class prior and posterior from noisy positives and unlabeled data, NIPS 2016: https://papers.nips.cc/paper/6168-estimating-the-class-prior-and-posterior-from-noisy-positives-and-unlabeled-data.pdf
    - Recovering True Classifier Performance in Positive-Unlabeled Learning, AAAI 2017: https://aaai.org/ocs/index.php/AAAI/AAAI17/paper/view/14990

    BIO
    Shantanu Jain is an associate research scientist in the Khoury College of Computer Sciences at Northeastern University. He is interested in the field of statistical modeling and machine learning. His research focuses on developing semi-supervised methods under data constraints for which standard approaches lead to biased estimates. His recent work has addressed issues in binary classification and its evaluation that arise due to the absence of labeled examples from one of the classes (positive-unlabeled learning), incorrectly labeled examples and bias in the labeled examples. His research has been applied to many bioinformatics problems and mass spectrometry data.

Spring 2020 (Wolfgang)

  • (1/24, 12:00pm): Fatemeh Nargesian (University of Rochester): Organizing Data Lakes for Navigation details)

    TITLE
    Organizing Data Lakes for Navigation

    ABSTRACT
    In this talk, I discuss data-driven discovery in data lakes and introduce the data lake organization problem, where the goal is to find a navigation structure that allows a user to most effectively navigate a data lake. An organization is defined as a graph that contains nodes representing sets of attributes within a data lake and edges indicating subset relationships between nodes. I present a new probabilistic model of how users interact with an organization and an approximate algorithm for efficiently constructing a data lake organization. I discuss quantitative results on both real data lakes containing data from open data portals and on benchmarks that emulate the observed characteristics of real data lakes. I also present the results of a user study showing that 42% of users preferred the use of navigation over keyword search, suggesting these are complementary and both useful modalities for data discovery in data lakes.

    BIO
    Fatemeh Nargesian is an assistant professor of computer science at the University of Rochester. Her primary research interests are in data discovery and data integration. She received her PhD from the University of Toronto.

  • (2/6, 1:35pm): BHS 030: Stratis Ioannidis (Northeastern): Learning from Comparisons details)

    TITLE
    Learning from Comparisons

    ABSTRACT
    We consider learning from comparisons, i.e., learning from a dataset containing pairs of samples and labels indicating their relative order. For example, a medical expert presented with a pair of patient files can order the pair w.r.t. the relative severity of a disease. Comparisons are often less noisy than class labels: human labelers disagreeing when generating class judgments often exhibit reduced variability when asked to compare pairs of samples instead. Comparisons are also more informative, as they capture both inter and intra-class relationships; the latter are not revealed via class labels alone. We discuss inference algorithms in this setting, as well as means to identify which comparison labels to collect.

    BIO
    Stratis Ioannidis is an assistant professor in the Electrical and Computer Engineering Department of Northeastern University, in Boston, MA, where he also holds a courtesy appointment with the College of Computer and Information Science. He received his B.Sc. (2002) in Electrical and Computer Engineering from the National Technical University of Athens, Greece, and his M.Sc. (2004) and Ph.D. (2009) in Computer Science from the University of Toronto, Canada. Prior to joining Northeastern, he was a research scientist at the Technicolor research centers in Paris, France, and Palo Alto, CA, as well as at Yahoo Labs in Sunnyvale, CA. He is the recipient of an NSF CAREER Award, a Google Faculty Research Award, and a best paper award at ACM ICN 2017. His research interests span machine learning, distributed systems, networking, optimization, and privacy.

    RELATED PAPERS:
    - A Severity Score for Retinopathy of Prematurity Peng Tian, Yuan Guo, Jayashree Kalpathy-Cramer, Susan Ostmo, J. Peter Campbell, Michael F. Chiang, Jennifer Dy, Deniz Erdogmus, and Stratis Ioannidis. *Knowledge Discovery and Data Mining (KDD) *, Anchorage, AK, 2019.
    - Classification and Comparison via Neural Networks Peng Tian, Jennifer Dy, Deniz Erdogmus, James Brown, Jayashree Kalpathy-Cramer, Susan Ostmo, J. Peter Campbell, Michael F. Chiang, and Stratis Ioannidis. *Elsevier Journal of Neural Networks *, 2019.
    - Experimental Design Under the Bradley-Terry Model Yuan Guo, Peng Tian, Jayashree Kalpathy-Cramer, Susan Ostmo, J. Peter Campbell, Michael F. Chiang, Deniz Erdogmus, Jennifer Dy, and Stratis Ioannidis. *International Joint Conference on Artificial Intelligence (IJCAI) *, Stockholm, Sweden, 2018.
    - Accelerated Experimental Design for Pairwise Comparisons Yuan Guo, Jennifer Dy, Deniz Erdogmus, Jayashree Kalpathy-Cramer, Susan Ostmo, J. Peter Campbell, Michael F. Chiang, and Stratis Ioannidis. *SIAM International Conference on Data Mining (SDM) *, Calgary, Alberta, Canada, 2019.

  • (2/12, 1:35pm): BHS 030: Alina Oprea (Northeastern): Security analytics: the application of machine learning and artificial intelligence in cyber security details)

  • (2/19, 1:35pm): BHS 030: Cristina Nita-Rotaru (Northeastern): Designing (byzantine-)resilient distributed systems details)

    Title: Designing (byzantine-)resilient distributed systems

    Speaker: Cristina Nita-Rotaru

    Abstract: What makes bitcoin work? How do you build a highly-resilient infrastructure for the power-grid? How could a 43 seconds network partition have cascaded into an over a day of service unavailability for github?

    In this talk, I will describe several fundamental concepts from distributed systems and show their applications to blockchains, resilient infrastructure, and cloud services. I will highlight the main challenges when building such systems and describe several ongoing research projects in my group.

    Bio: Cristina Nita-Rotaru is a Professor of Computer Science in the Khoury College of Computer Sciences at Northeastern University. Prior to joining Northeastern she was a faculty in the Department of Computer Science at Purdue University from 2003 to 2015. Her research lies at the intersection of security, distributed systems, and computer networks. The overarching goal of her work is designing and building secure and resilient distributed systems and network protocols, with assurance that their deployed implementations provide their security, resilience, and performance goals. Her work received several best paper awards in IEEE SafeThings 2019, NDSS 2018, ISSRE 2017, DSN 2015 as well as two IETF/IRTF Applied Networking Research Prize in 2018 and 2016 for her work on QUIC. Cristina Nita-Rotaru is a recipient of the NSF Career Award in 2006.

    Network and Distributed Systems Security Laboratory: https://nds2.ccs.neu.edu/

    Paper:https://ieeexplore.ieee.org/document/4624271

  • (2/26, 1:35pm): BHS 030: Predrag Radivojac (Northeastern): Deciphering molecular mechanisms of disease upon mutation via semi-suprvised learning details)

    Title: Deciphering molecular mechanisms of disease upon mutation via semi-suprvised learning

    Speaker: Predrag Radivojac

    Abstract: A major goal in computational biology is the development of algorithms, analysis techniques, and tools towards deep mechanistic understanding of life at a molecular level. In the process, computational biology must take advantage of the new developments in artificial intelligence and machine learning, and then extend beyond pattern analysis to provide testable hypotheses for experimental scientists. This talk will focus on our contributions to this process and relevant related work. We will first discuss the development of machine learning techniques for partially observable domains such as molecular biology; in particular, methods for accurate estimation of frequency of occurrence of hard-to-measure and rare events. We will show some identifiability results in parametric and nonparametric situations as well as how such frequencies can be used to correct estimated model accuracies. We will then show how these methods play key roles in inferring protein cellular roles and phenotypic effects of genomic mutations, with an emphasis on understanding the molecular mechanisms of human genetic disease. We further assessed the value of these methods in the wet lab where we tested the molecular mechanisms behind selected de novo mutations in a cohort of individuals with neurodevelopmental disorders. Finally, we will discuss implications on future research in machine learning, genome interpretation, and precision health.

    Bio: Predrag Radivojac joined Northeastern University as a Professor in the Khoury College of Computer Sciences. Prior to joining Northeastern he was a Professor of Computer Science at Indiana University Bloomington and Associate Chair in the Department of Computer Science.

    Prof. Radivojac’s primary research interests include computational biology and machine learning. He is motivated to improve our understanding of life at a molecular level and how molecular events affect higher level phenotypes. His group addresses such questions through the development of algorithms and analysis techniques related to the function of biological macromolecules, mass spectrometry proteomics, genome interpretation, and precision health; e.g., he is interested in elucidating the molecular mechanisms of disease consequent to genetic variation. In the area of machine learning, Prof. Radivojac’s research addresses foundational and applied problems in semi-supervised learning, structured-output learning, and active learning, and investigates topics such as kernels and distance functions (e.g., metrics) across data types and analysis techniques. He is also interested in performance evaluation of machine learning algorithms, especially in the hierarchical structured-output domains and cases of selection bias that often arise in the open world setting.

    Prof. Radivojac received the National Science Foundation CAREER Award in 2007 and is an August-Wilhelm Scheer Visiting Professor at Technical University of Munich (TUM) as well as an honorary member of the Institute for Advanced Study at TUM. Prof. Radivojac co-directed all of data sciences and informatics within the multi-campus Precision Health Initiative of Indiana University. He is currently an Editorial Board member for the journal Bioinformatics, Associate Editor for PLoS Computational Biology, and serves his third (elected) term on the Board of Directors of the International Society for Computational Biology (ISCB).

    Paper: http://biorxiv.org/content/early/2017/05/09/134981

  • (2/26, 2:50pm): WVH 362: Evimaria Terzi (Boston University): Simple models for optimizing driver earnings in ride-sharing platforms details)

    TITLE
    Simple models for optimizing driver earnings in ride-sharing platforms

    SPEAKER
    Evimaria Terzi

    ABSTRACT
    On-demand ride-hailing platforms like Uber and Lyft are helping reshape urban transportation, by enabling car owners to become drivers for hire with minimal overhead. Such platforms are a multi-sided market and offer a rich space for  studies with socio-economic implications. 

    In this talk I am going to address two questions:

    1. In the absence of coordination, what is the best course of action for a self-interested driver that wants to optimize his earnings?

    2. In the presence of coordination, is it possible to maximize social welfare objectives in an environment where the objectives of the participants (drivers, customers and the platform) are (often) misaligned?

    We will discuss the computational problems behind these problems and describe simple algorithmic solutions that work extremely well in practice. We will demonstrate the practical strength of our approaches with well-designed experiments on novel datasets we collected from such platforms.

    BIO
    Evimaria Terzi is an Associate Professor at the Computer Science Department of Boston University.  She joined the department in 2009 after spending two years as a Research Staff Member at the IBM Almaden Research Center. Evimaria got her PhD from the University of Helsinki in 2007.  Her research interests span a big range of data-centered problems. Currently she is interested in problems related to team formation, ride-sharing platforms as well as recommender systems.  He research is funded by multiple NSF grants and she was a recipient of the NSF CAREER award (2013) and Microsoft Faculty Fellowship (2010).

  • (2/27, 1:35pm): BHC 030: Jon Ullman (Northeastern): (Preventing) Overfitting in Adaptive Data Analysis details)

    Title: (Preventing) Overfitting in Adaptive Data Analysis

    Speaker: Jon Ullman

    Abstract: How can we use a random sample to draw meaningful conclusions about populations, without overfitting to the sample? This problem remains vexing despite decades of study by statisticians. One cause of overfitting is so-called adaptive data analysis---the common practice where datasets are re-used for multiple analyses, and the outcome of one analysis can influence the choice of future analyses. Adaptivity can arise in a range of ways, including data shared across research groups, multi-stage statistical methods, and data science competitions with shared validation data. Unfortunately, adaptivity invalidates traditional methods for preventing overfitting because it breaks a fundamental assumption about the independence of the data and the method.

    In this talk I will introduce a relatively recent approach to understanding and preventing false discovery in adaptive data analysis based on the concept of algorithmic stability. Specifically, I will introduce and motivate the problem of adaptive data analysis, describe a model for studying this problem, and demonstrate how overfitting can occur and, to some extent, be mitigated in this model.

    Related blog:  http://blog.mrtz.org/2015/03/09/competition.html

    Paper:  https://arxiv.org/abs/1511.02513

  • (3/11, 3:45pm): 362 WVH: Stavros Sintos (Duke University): Efficient algorithms for querying uncertain data details)

    TITLE
    Efficient Algorithms for Querying Uncertain Data

    ABSTRACT
    Motivated by applications in sensor network, data cleaning, and data integration, there has been a growing interest in managing uncertain data. Uncertainty is typically captured using stochastic data models, and querying data requires either statistics about the probabilistic behavior of the underlying data, or data cleaning to reduce the uncertainty of the query answer. In the first part of the talk I will present efficient indexes for answering range-max queries on uncertain data. Given a collection of uncertain points in the d-dimensional space, where each point is associated with a value, the goal is to preprocess the points to an index such that given a query rectangle R, compute the expected maximum or the most likely maximum of the values of the points that lie in R. Next, I will present data cleaning problems with applications in Fact-Checking. Given a claim, made by a politician, over uncertain data our goal is to ascertain the quality of the claim, which is measured by a function f. We propose algorithms for choosing a subset of objects to clean with cost at most C to i) minimize the variance of f or ii) maximize the surprise of f. I will conclude by discussing current research and future research directions.

    BIO
    Stavros Sintos is a Ph.D. candidate in the Department of Computer Science at Duke University under supervision of Prof. Pankaj K. Agarwal. His main research interest is in the design of efficient algorithms for problems in databases, data mining, and computational geometry. In particular, he works on designing practical algorithms and indexes with theoretical guarantees for problems in data summarization and data uncertainty. An important aspect of his research has been on combining geometric optimization with query processing. He is also a recipient of the James B. Duke Fellowship. Before joining Duke he obtained his BS in the Department of Computer Science at University of Ioannina in Greece. For more details please see http://www.cs.duke.edu/~ssintos.

    RELATED PAPERS
    * Range-Max Queries on Uncertain Data (PODS 2016)
    https://dl.acm.org/doi/10.1145/2902251.2902281
    *Selecting Data to Clean for Fact Checking: Minimizing Uncertainty vs. Maximizing Surprise (PVLDB 2019)
    http://www.vldb.org/pvldb/vol12/p2408-sintos.pdf
    * Other papers: https://sites.google.com/view/stavros-sintos/publica

  • (3/12, 1:35pm): Zoom: Tina Eliassi-Rad (Northeastern): Network science details)

    Title: Geometric and Topological Graph Analysis for Machine Learning

    Speaker: Tina Eliassi-Rad

    Abstract:This talk has two parts: (1) geometric analysis for graph embedding and (2) topological analysis for graph distances. First, graph embedding seeks to build an accurate low-dimensional representation of a graph. This low-dimensional representation is then used for various downstream tasks such as link prediction. One popular approach is Laplacian Eigenmaps, which constructs a graph embedding based on the spectral properties of the Laplacian matrix of a graph. The intuition behind it, and many other embedding techniques, is that the embedding of a graph must respect node similarity: similar nodes must have embeddings that are close to one another. We dispose of this distance-minimization assumption. In its place, we use the Laplacian matrix to find an embedding with geometric properties (instead of spectral ones) by leveraging the simplex geometry of the graph. We introduce Geometric Laplacian Eigenmap Embedding (or GLEE for short) and demonstrate that it outperforms various other techniques (including Laplacian Eigenmaps) in the tasks of graph reconstruction and link prediction. This work is joint with Leo Torres and Kevin Chan, and is forthcoming in the Journal of Complex Networks. Second, measuring graph distance is a fundamental task in graph mining. For graph distance, determining the structural dissimilarity between networks is an ill-defined problem, as there is no canonical way to compare two networks. Indeed, many of the existing approaches for network comparison differ in their heuristics, efficiency, interpretability, and theoretical soundness. Thus, having a notion of distance that is built on theoretically robust first principles and that is interpretable with respect to features ubiquitous in complex networks would allow for a meaningful comparison between different networks. We rely on the theory of the length spectrum function from algebraic topology, and its relationship to the non-backtracking cycles of a graph, in order to introduce the Non-Backtracking Spectral Distance (NBD) for measuring the distance between undirected, unweighted graphs. NBD is interpretable in terms of features of complex networks such as presence of hubs and triangles. We showcase the ability of NBD to discriminate between networks in both real and synthetic data sets. This work is joint with Leo Torres and Suarez-Serrato, and was published in the Journal of Applied Network Science in June 2019.

    Bio: Tina Eliassi-Rad is an Associate Professor of Computer Science at Northeastern University in Boston, MA. She is also a core faculty member at Northeastern University's Network Science Institute. Prior to joining Northeastern, Tina was an Associate Professor of Computer Science at Rutgers University; and before that she was a Member of Technical Staff and Principal Investigator at Lawrence Livermore National Laboratory. Tina earned her Ph.D. in Computer Sciences (with a minor in Mathematical Statistics) at the University of Wisconsin-Madison. Her research is rooted in data mining and machine learning; and spans theory, algorithms, and applications of big data from networked representations of physical and social phenomena. She has over 100 peer-reviewed publications (including a few best paper and best paper runner-up awardees); and has given over 200 invited talks and 14 tutorials. Tina's work has been applied to personalized search on the World-Wide Web, statistical indices of large-scale scientific simulation data, fraud detection, mobile ad targeting, cyber situational awareness, and ethics in machine learning. Her algorithms have been incorporated into systems used by the government and industry (e.g., IBM System G Graph Analytics) as well as open-source software (e.g., Stanford Network Analysis Project). In 2017, Tina served as the program co-chair for the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (a.k.a. KDD, which is the premier conference on data mining) and as the program co-chair for the International Conference on Network Science (a.k.a. NetSci, which is the premier conference on network science). In 2020, she is serving as the program co-chair for the International Conference on Computational Social Science (a.k.a. IC2S2, which is the premier conference on computational social science). Tina received an Outstanding Mentor Award from the Office of Science at the US Department of Energy in 2010; and became a Fellow of the ISI Foundation in Turin Italy in 2019.

    Paper 1: http://eliassi.org/papers/appliednetsci19_nbd.pdf
    Paper 2: https://arxiv.org/pdf/1905.09763.pdf
    Slides: http://eliassi.org/tina_fb_feb2020.pdf

  • (4/1, 1:35pm): Zoom: David Lazer: Fake news on Twitter during the 2016 U.S. presidential election details)

    Title: Fake news on Twitter during the 2016 U.S. presidential election

    Speaker:  David Lazer

    Abstract:The spread of fake news on social media became a public concern in the United States after the 2016 presidential election. We examined exposure to and sharing of fake news by registered voters on Twitter and found that engagement with fake news sources was extremely concentrated. Only 1% of individuals accounted for 80% of fake news source exposures, and 0.1% accounted for nearly 80% of fake news sources shared. Individuals most likely to engage with fake news sources were conservative leaning, older, and highly engaged with political news. A cluster of fake news sources shared overlapping audiences on the extreme right, but for people across the political spectrum, most political news exposure still came from mainstream media outlets.

    Paper: https://northeastern-datalab.github.io/cs3950/download/lazer.paper.pdf

  • (4/3, 12:00pm): Zoom: Yanlei Diao (UMass Amherst): AIDEme -- An active learning based system for interactive exploration of large datasets details)

    TITLE
    AIDEme: An active learning based system for interactive exploration of large datasets

    ABSTRACT
    There is an increasing gap between fast growth of data and limited human ability to comprehend data. Consequently, there has been a growing demand for analytics tools that can bridge this gap and help the user retrieve high-value content from data. In this talk, I introduce AIDEme, a scalable interactive data exploration system for efficiently learning a user interest pattern over a large dataset. The system is cast in a principled active learning (AL) framework, which iteratively presents strategically selected records for user labeling, thereby building an increasingly-more-accurate model of the user interest. However, existing AL techniques experience slow convergence when learning the user interest on large datasets. To overcome the problem, AIDEme explores properties of the user labeling process and the class distribution of observed data to design new AL algorithms, which come with provable results on model accuracy and approximation, and have evaluation results showing much improved convergence over existing AL methods while maintaining interactive speed.

    BIO
    Yanlei Diao is Professor of Computer Science at the University of Massachusetts Amherst, USA, and Ecole Polytechnique, France. Her research interests lie in big data analytics and scalable intelligent information systems, with a focus on interactive data exploration, explainable anomaly detection, optimization in cloud analytics, data streams, and uncertain data management. She received her PhD in Computer Science from the University of California, Berkeley in 2005.

    Prof. Diao was a recipient of the 2016 ERC Consolidator Award, 2013 CRA-W Borg Early Career Award (one female computer scientist selected each year for outstanding contributions), IBM Scalable Innovation Faculty Award, and NSF Career Award. She spoke at the Distinguished Faculty Lecture Series at the University of Texas at Austin and Technische Universitaet Darmstadt. She has served as Editor-in-Chief of the ACM SIGMOD Record, Associate Editor of ACM TODS, Chair of the ACM SIGMOD Research Highlight Award Committee, and member of the SIGMOD and PVLDB Executive Committees. She was PC Co-Chair of IEEE ICDE 2017 and ACM SoCC 2016, and served on the organizing committees of SIGMOD, PVLDB, and CIDR, as well as on the program committees of many international conferences and workshops.

    RELEVANT PUBLICATIONS
    * A Factorized Version Space Algorithm for" Human-In-the-Loop" Data Exploration, ICDM 2019
    https://scholar.google.com/scholar?cluster=10867783028515952622
    * AIDEme: An active learning based system for interactive exploration of large datasets, NeurIPS 2019
    https://scholar.google.com/scholar?cluster=9717227470572420687
    * Optimization for Active Learning-based Interactive Database Exploration, VLDB 2018
    https://scholar.google.com/scholar?cluster=10296364795678543901

Summer/Fall 2019 (Wolfgang)

  • (5/10, 12:00pm): Predrag Radivojac (Khoury / Northeastern): A new class of metrics for learning on real-valued and structured data details)

    TITLE
    A new class of metrics for learning on real-valued and structured data

    ABSTRACT
    We propose a new class of metrics on sets, vectors, and functions that can be used in various stages of data mining, including exploratory data analysis, learning, and result interpretation. These new distance functions unify and generalize some of the popular metrics, such as the Jaccard and bag distances on sets, Manhattan distance on vector spaces, and Marczewski-Steinhaus distance on integrable functions. We prove that the new metrics are complete and show useful relationships with $f$-divergences for probability distributions. To further extend our approach to structured objects such as ontologies, we introduce information-theoretic metrics on directed acyclic graphs drawn according to a fixed probability distribution. We conduct empirical investigation to demonstrate the effectiveness on real-valued, high-dimensional, and structured data. Overall, the new metrics compare favorably to multiple similarity and dissimilarity functions traditionally used in data mining, including the Minkowski ($L^p$) family, the fractional $L^p$ family, two $f$-divergences, cosine distance, and two correlation coefficients. We provide evidence that they are particularly appropriate for rapid processing of high-dimensional and structured data in distance-based learning.

    WEB PAGE
    Predrag Radivojac

    RELATED PAPER
    https://doi.org/10.1007/s10618-019-00622-6
    https://www.ccs.neu.edu/home/radivojac/papers/yang_dataminknowldisc_2019.pdf

  • (5/15, 12:00pm): Leilani Battle (University of Maryland): Behavior-Driven Optimizations for Big Data Exploration details)

    TITLE
    Behavior-Driven Optimizations for Big Data Exploration

    ABSTRACT
    With massive amounts of data coming from field sensors, sequencers, mobile devices, etc., data-driven decision making has become increasingly important in industry, government, and the sciences. One of the key issues for analysts and scientists who work with large datasets is efficient visualization of their data to extract patterns, observe anomalies, and debug their analysis workflows. Though a variety of visualization tools exist to help people make sense of their data, these tools often rely on database management systems (or DBMSs) for data processing and storage; and unfortunately, DBMSs fail to process the data fast enough to support a fluid, interactive visualization experience.

    My work blends optimization techniques from databases and methodology from HCI and visualization in order to support interactive exploration of large datasets. In this talk, I will first discuss Sculpin, a visual exploration system that learns user exploration patterns automatically, and exploits these patterns to pre-fetch data ahead of users as they explore. I will then discuss ongoing work to better understand how analysts explore data in more sophisticated analysis systems, such as Tableau Desktop. Finally, I will report on ongoing efforts to standardize the way we evaluate visual data analysis systems in general.

    RELATED PAPERS
    SIGMOD 2016: http://www.cs.umd.edu/~leilani/static/papers/forecache_cr_sigmod2016.pdf
    HILDA 2018: http://www.cs.umd.edu/~leilani/static/papers/battle_visualization-centered-evaluation_hilda_2018.pdf

    BIO
    Leilani Battle is an Assistant Professor at the University of Maryland, College Park, with a joint appointment in the University of Maryland Institute for Advanced Computer Studies (UMIACS). She is also affiliated with the UMD Human-Computer Interaction Laboratory (HCIL). Her research interests focus on developing interactive data-intensive systems that can aid analysts in performing complex data exploration and analysis. Her current research is anchored in the field of databases, but utilizes research methodology and techniques from HCI and visualization to integrate data processing (databases) with interactive interfaces (HCI, visualization). She is an NSF Graduate Research Fellowship Recipient (2012), and her research is currently supported by an NSF CISE CRII Award (2019). In 2017, she completed a postdoc in the UW Interactive Data Lab. She holds an MS (2013) and PhD (2017) in Computer Science from MIT, where she was a member of the MIT Database Group, and a BS in Computer Engineering from UW (2011), where she was a member of the UW database group.

  • (5/31, 12:00pm): Raymond Chi-Wing Wong (The Hong Kong University of Science and Technology): FindYourFavorite: An Interactive System for Finding the User's Favorite Tuple in the Database details)

    TITLE
    FindYourFavorite: An Interactive System for Finding the User's Favorite Tuple in the Database

    ABSTRACT
    When faced with a database containing millions of tuples, an end user might be only interested in finding his/her favorite tuple in the database. In this talk, we study how to help an end user to find such a favorite tuple with a few user interations. In each interaction, a user is presented with a small number of tuples (which can be artificial tuples outside the database or true tuples inside the database) and s/he is asked to indicate the tuple s/he favors the most among them. Different from the previous work which displays artificial tuples to users during the interaction and requires heavy user interactions, we achieve a stronger result. Specifically, we use a concept, called the utility hyperplane, to model the user preference and an effective pruning strategy to locate the favorite tuple for a user in the whole database. Based on these techniques, we developed an interactive system, called FindYourFavorite, and demonstrate that the system could identify the favorite tuple for a user with a few user interactions by always displaying true tuples in the database.

    RELATED PAPER:
    SIGMOD 2018: http://home.cse.ust.hk/~raywong/paper/sigmod18-regret.pdf

    BIOGRAPHY
    Raymond Chi-Wing Wong is an Associate Professor in Computer Science and Engineering (CSE) of The Hong Kong University of Science and Technology (HKUST). He is currently the director of the Risk Management and Business Intelligence (RMBI) program. He was the director of the Computer Engineering (CPEG) program from 2014 to 2016 and was the associate director of the Computer Engineering (CPEG) program from 2012 to 2014. He received the BSc, MPhil and PhD degrees in Computer Science and Engineering in the Chinese University of Hong Kong (CUHK) in 2002, 2004 and 2008, respectively. In 2004-2005, he worked as a research and development assistant under an R&D project funded by ITF and a local industrial company called Lifewood. From May 2006 to Aug 2006, he was a visiting student of Prof. Jian Pei and Prof. Ke Wang, at Simon Fraser University in Canada. From Aug 2007 to Sept 2007, he visited IBM T.J. Watson Research Center as a summer intern under the supervision of Prof. Philip S. Yu. From Jun 2008 to Jul 2008, he visited Prof. Tamer Ozsu at University of Waterloo as a visiting scholar. Some of his collaborators are Prof. Ada Fu (2003-2013), Prof. Ke Wang (2003-2013), Prof. Philip S. Yu (2009-2013), Prof. Jian Pei (2006-2011), Prof. Tamer Ozsu (2009-2011), Prof. Jiuyong Li (2006-2011), Prof. Yufei Tao (2007-2009), Prof. Ihab Ilyas (2009), Prof. Jeffrey Yu (2009),Prof. Lei Chen (2008) and Prof. Eamonn Keogh (2008). He received 28 awards. He published 69 conference papers (e.g., SIGMOD, SIGKDD, VLDB, ICDE and ICDM), 31 journal/chapter papers (e.g., TODS, DAMI, TKDE, VLDB journal and TKDD) and 1 book. He reviewed papers from conferences and journals related to data mining and database, including VLDB conference, SIGMOD, TODS, VLDB Journal, TKDE, TKDD, ICDE, SIGKDD, ICDM, DAMI, DaWaK, PAKDD, EDBT and IJDWM. He is a program committee member of conferences, including SIGMOD, VLDB, ICDE, KDD, ICDM and SDM, and a referee of journals, including TODS, VLDBJ, TKDE, TKDD, DAMI and KAIS.

  • (6/21, 12:00pm): Wolfgang Gatterbauer (Khoury): Anytime Approximation in Probabilistic Databases via Scaled Dissociations details)

    TITLE
    Anytime Approximation in Probabilistic Databases via Scaled Dissociations

    ABSTRACT
    Speeding up probabilistic inference remains a key challenge in probabilistic databases (PDBs) and the related area of statistical relational learning (SRL). Since computing probabilities for query answers is #P-hard, even for fairly simple conjunctive queries, both the PDB and SRL communities have proposed a number of approximation techniques over the years. The two prevalent techniques are either (i) MCMCstyle sampling or (ii) branch-and-bound (B&B) algorithms that iteratively improve model-based bounds using a combination of variable substitution and elimination.
    We propose a new anytime B&B approximation scheme that encompasses all prior model-based approximation schemes proposed in the PDB and SRL literature. Our approach relies on the novel idea of “scaled dissociation” which can improve both the upper and lower bounds of existing model-based algorithms. We apply our approach to the well-studied problem of evaluating self-join-free conjunctive queries over tuple-independent PDBs, and show a consistent reduction in approximation error in our experiments on TPC-H, Yago3, and a synthetic benchmark setting.

    Based on joint work with Maarten Van den Heuvel, Peter Ivanov, Floris Geerts, and Martin Theobald.

    RELATED PAPER:
    SIGMOD 2019: http://gatterbauer.name/download/sigmod2019_anytime_approximations_scaled_dissociations.pdf

  • (9/20, 12:00pm): Oktie Hassanzadeh (IBM): Knowledge Graphs for Data Curation & Data Science details)

    TITLE
    Knowledge Graphs for Data Curation & Data Science

    ABSTRACT
    In this informal talk, I will go over a number of projects at IBM Research that leverage knowledge graphs for large-scale data exploration, data curation, and data science. In particular, I will describe an application of knowledge graphs for large-scale data exploration and curation in the context of the IBM Research Helix project, where large-scale similarity analysis and automated linkage discovery enable the discovery and construction of data sets from a large number of data sets (or a data lake) for a given analytics task. I will then describe our ongoing project and recent progress on using knowledge graphs for automating data science. I will discuss a number of ways knowledge graphs can assist with feature engineering and the challenges we faced in implementing a prototype solution. If time permits, I will briefly go over some of our recent work on knowledge extraction from text.

    RELATED WORK
    * Automated Feature Enhancement for Predictive Modeling using External Knowledge, ICDM 2019 (Demo Track, To Appear)
    * Exploring Big Data with Helix: Finding Needles in a Big Haystack. SIGMOD Record 43(4): 43-54 (2014) https://doi.org/10.1145/2737817.2737829
    * Instance-Based Matching of Large Ontologies Using Locality-Sensitive Hashing. International Semantic Web Conference (1) 2012: 49-64 https://doi.org/10.1007/978-3-642-35176-1_4
    * Inducing Implicit Relations from Text Using Distantly Supervised Deep Nets. International Semantic Web Conference (1) 2018: 38-55 https://doi.org/10.1007/978-3-030-00671-6_3 (ISWC 2017 Semantic Web Challenge Winning Solution, see blog post here: https://www.ibm.com/blogs/research/2017/11/knowledge-base-construction-iswc-2017/)
    * Answering Binary Causal Questions Through Large-Scale Text Mining: An Evaluation Using Cause-Effect Pairs from Human Experts. IJCAI 2019: 5003-5009 https://doi.org/10.24963/ijcai.2019/695

    BIOGRAPHY
    Dr. Oktie Hassanzadeh is a Research Staff Member at IBM Research. He has received several academic and corporate awards for his work in data management and knowledge graphs. He is a recipient of the top prize at the prestigious Semantic Web Challenge at ISWC 2017, as well as two best-paper awards at ESWC 2018 and ESWC 2016 conferences. He has received his M.Sc. and Ph.D. degrees from the University of Toronto, where he received the IBM PhD fellowship and the Yahoo! Key Scientific Challenges awards. He is also a two-time recipient of the first prize at the Triplification Challenge at the SEMANTiCS Conference for his projects in the areas of Semantic Technologies and Linked Data. For more information, refer to his home page: http://researcher.watson.ibm.com/person/us-hassanzadeh

  • (9/27, 12:00pm): Boris Glavic (Illinois Institute of Technology): Relevance-based Data Management details)

    TITLE
    Relevance-based Data Management

    ABSTRACT
    In this talk, I give an overview of my group's work on relevance-based data management, a novel data management paradigm centered around the idea of analyzing what data is needed to answer queries. For the most part, the talk will focus on one particular approach which we refer to as provenance-based data skipping. Since the inception of relational databases, index structures and other physical design techniques have been used to speed-up access to data. In spite of the wide variety of mature techniques that are available, there are important classes of selective queries, e.g., top-k queries, which cannot benefit from these techniques. This issue arises because database systems statically analyze incoming queries to determine which data is needed for answering which query. However, for many queries it is impossible to determine up-front what data is relevant. Provenance-based data skipping overcomes this limitation by generating provenance sketches which are concise encodings of what data is relevant for a query. Similar to query answering with views, a provenance sketch captured for one query is used to speed up the subsequent evaluation of other queries. In contrast to materialized views, provenance sketches are guaranteed to be small (100s of bytes) and can benefit from physical database design artifacts such as indexes, horizontal partitioning of tables, and zone maps. Furthermore, I will highlight open problems in provenance-based data skipping and discuss additional applications of relevance-based data management techniques in general.

    RELATED PAPER:
    IEEE Data Engineering Bulletin: GProM - A Swiss Army Knife for Your Provenance Needs IEEE Data Engineering Bulletin http://sites.computer.org/debull/A18mar/p51.pdf

    BIOGRAPHY
    Boris Glavic is an Associate Professor of Computer Science at the Illinois Institute of Technology where he leads the IIT DBGroup (http://www.cs.iit.edu/~dbgroup/ ). Before coming to IIT, Boris spend to two years as a PostDoc at the University of Toronto working with Renée J. Miller. He received his PhD from the University of Zurich in Switzerland being advised by Michael Böhlen and Gustavo Alonso. Boris is a database researcher building comprehensive systems based on solid theoretical foundations. His main research interests are provenance, data integration and cleaning, and query processing and optimization.

  • (10/11, 12:00pm): Owen O'Malley (Cloudera): Big Data's journey to ACID details)

    TITLE
    Big Data's journey to ACID

    ABSTRACT
    Originally, the Hadoop ecosystem had no ability to handle updates and the user had to completely re-write the data to update it. Over the years, the community built the ability to have point updates in key-value stores (eg HBase) and adding new partitions of data via Hive. Five years ago, we added the first ability for Hive to use standard SQL operations to update your data. Recently, there have been multiple new systems to address similar needs: Apache Iceberg, Apache Hudi, and Databricks Delta. I'll talk about compare the new systems to Hive ACID, discussing their trade offs and approaches.

    BIOGRAPHY
    Owen O’Malley was a founder and technical fellow at Hortonworks. Hortonworks was founded in 2011 with 25 employees spinning out from Yahoo, went public in 2014, and grew to 1,400 employees before merging with Cloudera in 2019. Cloudera’s platform includes Hadoop and the large ecosystem of big data open source tools that enterprises need for their data analytics. Owen has been working on Hadoop since the beginning of 2006, was the first committer added to the project, and used Hadoop to set the Gray sort benchmark in 2008 and 2009. Over the years, he was the technical lead for MapReduce, Hadoop Security, ORC file, and adding ACID transactions to Hive.
    https://www.linkedin.com/in/owenomalley/
    https://scholar.google.com/citations?user=04LfgIoAAAAJ

  • (10/11, 3:00pm): Nan Tang (Qatar Computing Research Institute, QCRI): Data Preparation meets Data Cleaning details)

    TITLE
    Data Preparation meets Data Cleaning

    ABSTRACT
    In this talk, I will first provide a brief overview of my research in data preparation, including data discovery, data cleaning and integration, and some systems (such as Data Civilizer, a project with MIT, and Columbus, a project with Wisconsin) to put different pieces together for data preparation. Afterwards, I will discuss why data visualization is important for data preparation, and what interesting research topics can happen when putting these two together, based on which I will present three works: (1) DeepEye: a visualization recommendation system, which was motivated by how to help users easily understand the discovered datasets in data preparation; (2) Interactive data cleaning for progressive visualization, which is to cheaply solve the problem (by 10-15 user interactions) that data visualization is bad if the data is very dirty; and (3) Cymphony: a generic system for collaborative task management, motivated by the practical scenarios that for many tasks (such as manual data cleaning, manually checking fake news), a team of users need to work collaboratively.

    RELATED PAPERS:
    ICDE 2018: http://da.qcri.org/ntang/pubs/icde2018deepeye.pdf
    EDBT 2018 vision paper: http://da.qcri.org/ntang/pubs/edbt2018.pdf
    SIGMOD 2018 demo: http://da.qcri.org/ntang/pubs/deepeyesigmod2018demo.pdf
    ICDE 2020 paper “Interactive cleaning for progressive visualization through composite questions”
    http://da.qcri.org/ntang/publications.html

    BIOGRAPHY
    Dr. Nan Tang is a senior scientist at Qatar Computing Research Institute (QCRI), HBKU, Doha, Qatar. Prior to joining QCRI, he has worked at Univeristy of Edinburgh (2010-2011) and CWI (2008-2010). He got his PhD. degree from The Chinese University of Hong Kong (2007). His research focuses on data preparation including data cleaning, entity matching, and building systems to support these tasks, and data visualization, especially on visualization recommendation and cleaning. He serves the community in many PCs, such as SIGMOD, PVLDB, ICDE, KDD. You can find more about me from my homepage: http://da.qcri.org/ntang/index.html

  • (12/4, 12:00pm): Wolfgang Gatterbauer (Khoury): Scalable Compatibility Estimation from Extremely Sparse Graphs details)

    TITLE
    Scalable Compatibility Estimation from Extremely Sparse Graphs

    ABSTRACT
    Node classification is an important problem of graph management and Belief Propagation (BP) methods, along with its linearization, have been widely used for graphs with arbitrary compatibilities between classes. These methods rely on a prior specified compatibility matrix between classes, which is commonly given by domain experts or heuristics. Thus an important unresolved questions is: Can we do better in deriving compatibilities from the actual data in a principled and scalable way? We address this problem by indeed estimating the compatibilities on graphs with very sparse labels (e.g., 1 in 10,000 nodes is labeled). We propose methods to perform this estimation in a fraction of time it later takes to label the remaining nodes, which makes it a cheap preprocessing step for any existing label propagation method. Our approach consists of two steps: first, we create multiple consistent and compact graph statistics; and second, perform estimation on these far smaller representations. We show that the labeling accuracy of our proposed estimator is comparable to using the "correct" compatibilities and that our estimator is orders of magnitude faster than standard approaches based on train-test sets.

  • (12/9, 12:00pm): Oliver Kennedy (U of Buffalo): Safe, Reusable Heuristic Data Transformation through Caveats details)

    TITLE
    Safe, Reusable Heuristic Data Transformation through Caveats

    ABSTRACT
    Through extensive surveys of data scientists in the field, we've identified numerous data transformation challenges, including time-series alignment, outlier management, entity resolution, parsing, and more. In each use case, a correct, fully general solution would have ben worthy of a PhD Thesis. By contrast, each of these use cases admitted a simple heuristic tailored to the specific dataset and analysis being performed, which could be scripted or applied manually in hours or even minutes. Although full generality is clearly not worth the price, dataset- and analysis- specific heuristics make it difficult to re-use the transformed dataset or transformation workflow.

    In this talk, I introduce a form of lightweight data annotation called Caveats that make it easier to re-use both heuristically transformed datasets and heuristic transformation workflows. Caveats combine "where" provenance with techniques from incomplete data management to mark specific data values and records as dependent on a data transformation heuristic --- marked data elements are subject to change if the heuristic is altered. I will show how caveats can be seamlessly integrated into heuristic transformation workflows, introduce our lightweight scheme for propagating caveats through SQL queries, and demonstrate our prototype notebook, Vizier, built around Caveats.

    RELATED PAPERS
    https://odin.cse.buffalo.edu/papers/2015/VLDB-lenses-final.pdf
    https://odin.cse.buffalo.edu/papers/2019/SIGMOD-UADBs.pdf
    https://odin.cse.buffalo.edu/papers/2019/submitted/CIDR-CrumbyNotebooks.pdf (to appear)
    https://vizierdb.info/ (try it out!)

    BIOGRAPHY
    Oliver Kennedy is an associate professor at the University at Buffalo. He earned his PhD from Cornell University in 2011 and now leads the Online Data Interactions (ODIn) lab, which operates at the intersection of databases and programming languages. Oliver is the recipient of an NSF CAREER award, UB's Exceptional Scholar Award, and the UB SEAS Early Career Teacher of the Year Award. Several of his papers have been invited to "Best of" compilations from SIGMOD and VLDB. The ODIn lab is currently exploring uncertain data management, just-in-time data structure design, and "small data" management, as well as implementing VizierDB, a reproducibility- and reusability-focused notebook.

Spring 2019 (Wolfgang)

  • (1/9, 12:00pm): Kathleen Fisher (Tufts): Using a Declarative Description Language to Tame Ad Hoc Data: An Overview of the PADS project details)

    TITLE
    Using a Declarative Description Language to Tame Ad Hoc Data: An Overview of the PADS project

    SPEAKER
    Kathleen Fisher (Tufts University)

    ABSTRACT
    The goal of the PADS project (http://www.padsproj.org) is to make it easier for data analysts to extract useful information from ad hoc data files. This talk gives an overview of the project and how it helps bridge the gap between the unmanaged world of ad hoc data and the managed world of typed programming languages and databases. In particular, the paper reviews the design of PADS data description languages, describes the generated parsing tools and discusses the importance of meta-data. It also sketches the formal semantics, discusses useful tools and how can they can be generated automatically from PADS descriptions, and describes an inferencing system that can learn useful PADS descriptions from positive examples of the data format.

    BIO
    Kathleen Fisher is a Professor in and the Chair of the Computer Science Department at Tufts. Previously, she was a program manager at DARPA where she started and managed the HACMS and PPAML programs, a Consulting Faculty Member in the Computer Science Department at Stanford University, and a Principal Member of the Technical Staff at AT&T Labs Research. Kathleen is an ACM Fellow. She has served as Program Chair for PLDI, OOPSLA, ICFP, CUFP, and FOOL, and as General Chair for ICFP 2015. She is a former Associate Editor for TOPLAS and a former editor of the Journal of Functional Programming. Kathleen is a past Chair of the ACM Special Interest Group in Programming Languages (SIGPLAN) and past Co-Chair of CRA's Committee on the Status of Women (CRA-W). Kathleen is a recipient of the SIGPLAN Distinguished Service Award. She is Vice Chair of DARPA's ISAT Study Group and a member of the Board of Trustees of Harvey Mudd College.

    RELATED PAPERS
    Kathleen Fisher, David Walker: The PADS project: an overview. ICDT 2011: 11-17
    https://dl.acm.org/citation.cfm?doid=1938551.1938556

  • (1/11, 12:00pm): Xiaoyu Liu (MIT): Kyrix: Interactive Visual Data Exploration at Scale details)

    TITLE
    Kyrix: Interactive Visual Data Exploration at Scale

    ABSTRACT
    Scalable interactive visual data exploration is crucial in many domains due to increasingly large datasets generated at rapid rates. Details-on-demand provides a useful interaction paradigm for exploring large datasets, where the user starts at an overview, finds regions of interest, zooms in to see detailed views, zooms out and then repeats. This paradigm is the primary user interaction mode of widely used systems such as Google Maps, Aperture Tiles and ForeCache. These earlier systems, however, are highly customized with hardcoded visual representations and optimizations. A more general framework is needed to facilitate the development of visual data exploration systems at scale. We present Kyrix, an end-to-end system for developing scalable details-on-demand data exploration applications. Kyrix provides the developer with a declarative model for easy specification of general visualizations. Behind the scenes, Kyrix utilizes a suite of performance optimization techniques to achieve a response time within 500 ms for various user interactions. We also report results from a performance study which shows that a novel dynamic fetching scheme adopted by Kyrix outperforms tile-based fetching used in traditional systems.

    BIO
    Xiaoyu is a research intern at MIT data system group. Before that, she completed her master at Purdue, ECE department. She is interested in building scalable and reliable data-intensive systems.

    RELATED WORK
    http://cidrdb.org/cidr2019/papers/p70-tao-cidr19.pdf
    http://dsail.csail.mit.edu/index.php/kyrix/

  • (1/23, 12:00pm): Panayiotis Tsaparas (University of Ioannina): Finding Patterns in Temporal and Flow Networks details)

    TITLE
    Finding Patterns in Temporal and Flow Networks

    ABSTRACT
    Networks are natural models for complex systems consisting of multiple interconnected entities. Many of these systems are dynamic, with connections appearing and disappearing over time. In the resulting networks, edges are annotated with time stamps, resulting in a graph history consisting of multiple network snapshots over time. We refer to such networks as temporal networks (graphs).

    In this talk we consider two knowledge extraction problems on temporal networks. First, we consider the problem of finding lasting dense subgraphs. We provide formal definitions of graph density for a temporal graph, and we formulate the BFF problem that seeks a subset of nodes that are densely connected over time. Furthermore, we also consider the problem of finding a set of graph snapshots in which there are dense subsets of nodes. We study the complexity of the problems and propose exact, approximation and heuristic algorithms. Experiments indicate that our approach can find interesting patterns in collaboration and word co-occurrence networks.

    In the second problem we assume that edges are also associated with a value, indicating a flow between the two nodes at a specific time stamp. Such networks appear naturally in practice in the case of money exchange networks (e.g., bitcoin), or traffic networks. We are interested in finding motifs in these networks, that is, small subgraphs that appear more often than random. Our motifs are restricted temporally and with respect to flow: all interactions must happen within a given time-window, and they must involve a minimum amount of flow. We propose efficient algorithms for finding such motifs and experiment with them on real networks.

    RELATED PAPERS
    Konstantinos Semertzidis, Evaggelia Pitoura, Evimaria Terzi, Panayiotis Tsaparas. Finding lasting dense subgraphs. ECML/PKDD Journal Track (DAMI), 2019 (to appear)
    https://link.springer.com/epdf/10.1007/s10618-018-0602-x?author_access_token=tBZlkys4uOpdBFnzu3GcJfe4RwlQNchNByi7wbcMAY4KONtc_c2_3ZKjrPQGAs_BqHQ9sjj8kKtAD3ES-jJm6otpAiLRF-1khuFNPaPKh6_139GtFUfjLEcx_dfP_y-AXowi-QCmJaCkYhkY1kDhJg%3D%3D
    Chrysanthi Kosyfaki, Nikos Mamoulis, Evaggelia Pitoura, Panayiotis Tsaparas. Flow motifs in interaction networks. EDBT 2019 (to appear).
    https://arxiv.org/abs/1810.08408

    BIO
    Panayiotis Tsaparas completed his undergraduate studies at the Department of Computer Science at University of Crete, Greece in 1995. He continued his graduate studies at University of Toronto, where he received his M.Sc., and Ph.D. degree, under the supervision of Allan Borodin. After graduation, he worked as a post-doctoral fellow at University of Rome, “La Sapienza”, and at University of Helsinki, and as a researcher at Microsoft Research. Since 2011 he joined the Department of Computer Science and Engineering at University of Ioannina, where he is now an Associate Professor. His research interests include Social Network Analysis, Algorithmic Data Mining, Web Mining and Information Retrieval.
    Web: http://www.cs.uoi.gr/~tsap/

  • (1/24, 12:00am): Northeast Database Day 2019 details)

  • (1/25, 1:15pm): Fatemeh Nargesian (University of Toronto): Table Union and Navigation details)

    TITLE
    Table Union and Navigation

    ABSTRACT
    Preparing data for advanced analytics is prohibitively time-consuming and expensive for all but the best-trained and best-funded engineers. Nevertheless, the success of trained systems often depends on data and generated features more than powerful statistical algorithms. Among the data management challenges that need to be addressed is data enrichment which is the discovery and integration of meaningful data in data lakes for a data science task. However, there are many challenges to overcome: (1) the sheer size of data, (2) the unique distributions and characteristics of data lakes, and (3) the probabilistic and human-in-the-loop nature of data discovery. In this talk, I discuss two prevalent data discovery scenarios. In the first scenario, the query is a dataset and the data scientist is interested in interactively finding datasets that can be integrated (e.g unioned) with the query. I will introduce a probabilistic framework for finding and aligning unionable tables with a query table and discuss the need for distribution-aware techniques for data discovery. In the second scenario, search does not start with a query, instead, it is data-driven. I will talk about data lake organization problem where the goal is to find an organization (a directed acyclic graph) that allows a user to most efficiently navigate data lakes. I will present a probabilistic navigation model of how users interact with an organization and introduce a scalable structure learning algorithm for optimizing data lake organizations.

    RELATED PAPER
    Nargesian, Zhu, Pu, Miller: Table Union Search on Open Data, PVLDB 2018.
    http://www.vldb.org/pvldb/vol11/p813-nargesian.pdf

    BIO
    Fatemeh Nargesian is a PhD student in the Data Curation Group of the Department of Computer at University of Toronto. Her research focuses on optimizing and automating data preparation for end-to-end data science, encompassing dataset discovery in data lakes and enriching datasets with new features. While at University of Toronto, Fatemeh was a joint Research intern at IBM Research-NY. Prior to University of Toronto, Fatemeh worked at the Clinical Informatics research group at McGill University on clinical data management, and received M.Sc. degrees in computer science and artificial intelligence at University of Toronto and Sharif University of Technology.

  • (1/28, 12:00pm): Stefano Ceri (Politecnico di Milano): WVH 366: Data-Driven Genomic Computing: Making Sense of the Signals from the Genome details)

    TITLE
    Data-Driven Genomic Computing: Making Sense of the Signals from the Genome

    ABSTRACT
    Genomic computing is a new science focused on understanding the functioning of the genome, as a premise to fundamental discoveries in biology and medicine. Next Generation Sequencing (NGS) allows the production of the entire human genome sequence at a cost of about 1000 US $; many algorithms exist for the extraction of genome features, or "signals", including peaks (enriched regions), variants, or gene expression (intensity of transcription activity). The missing gap is a system supporting data integration and exploration, giving a “biological meaning” to all the available information; such a system can be used, e.g., for better understanding cancer or how environment influences cancer development.

    The GeCo Project (Data-Driven Genomic Computing, ERC Advanced Grant, 2016-2021) has the objective or revisiting genomic computing through the lens of basic data management, through models, languages, and instruments, focusing on genomic data integration. Starting from an abstract model, we developed a system that can be used to query processed data produced by several large Genomic Consortia, including Encode and TCGA; the system employs internally the Spark engine, and prototypes can already be accessed from Polimi, from Cineca (Italian supercomputing center) and from the Broad Institute in Cambridge. During the five-years of the ERC project, the system will be enriched with data analysis tools and environments and will be made increasingly efficient. Among the objectives of the project, the creation of an “open source” repository of public data, available to biological and clinical research through queries, web services and search interfaces.

    BIO
    Stefano Ceri is professor of Database Systems at the Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB) of Politecnico di Milano. His research work covers four decades (1978-2018) and has been generally concerned with extending database technologies in order to incorporate new features: distribution, object-orientation, rules, streaming data; with the advent of the Web, his research has been targeted towards the engineering of Web-based applications and to search systems. More recently he turned to genomic computing. He authored over 350 publications (H-index 75) and authored or edited 15 books in English. He is the recipient of two ERC Advanced Grants: "Search Computing (SeCo)" (2008-2013), focused upon the rank-aware integration of search engines in order to support multi-domain queries and “Data-Centered Genomic Computing (GeCo)” (2016-2021), focused upon new abstractions for querying and integrating genomic datasets. He is the recipient of the ACM-SIGMOD "Edward T. Codd Innovation Award" (New York, June 26, 2013), an ACM Fellow and a member of Academia Europaea.

    RELATED PUBLICATION
    http://www.bioinformatics.deib.polimi.it/geco/publications/Bioinformatics_2018.pdf

  • (1/28, 12:30pm): Marco Brambilla (Politecnico di Milano): Extraction of Evolving Knowledge from Social Media details)

    TITLE
    Extraction of Evolving Knowledge from Social Media

    ABSTRACT
    Knowledge in the world continuously evolves. Ontologies that aim at formalizing this knowledge are largely incomplete, especially regarding data belonging to the so-called long tail. On the other side, informal sources such has social media are typically very up to date with respect to facts, events and relations between real-world entities. We propose a method for discovering emerging knowledge by extracting it from social content. Once initialized by domain experts, the method is capable of finding relevant entities by means of a mixed syntactic-semantic method. The method uses seeds, i.e. prototypes of emerging entities provided by experts, for generating candidates; then, it associates candidates to feature vectors built by using terms occurring in their social content and ranks the candidates by using their distance from the centroid of seeds, returning the top candidates. Our method can run iteratively, using the results as new seeds. The talk will describe the different extraction techniques, the advantages obtained by combining them, and the results of the experiments performed with the different methods.

    RELATED PAPERS
    Extracting Emerging Knowledge from Social Media. WWW 2017 https://dl.acm.org/citation.cfm?id=3052697
    Iterative Knowledge Extraction from Social Networks. WWW Comp. 2018 https://dl.acm.org/citation.cfm?id=3191578

    BIO
    Marco Brambilla is associate professor at Politecnico di Milano. His research interests include data science, domain specific modeling languages and design patterns, crowdsourcing, social media monitoring, and big data analysis. He has been visiting researcher at CISCO, San Josè, and University of California, San Diego. He has been visiting professor at Dauphine University, Paris. He is co-founder of the startups Fluxedo, focusing on social media analysis and Social engagement, and WebRatio, devoted to software modeling tools for Web, Mobile and Business Process based software applications. He is author of various international books and research articles in journals and conferences, with over 200 papers. He was awarded various best paper prizes and gave keynotes and speeches at many conferences and organisations. He runs research projects on data science and industrial projects on data-driven innovation and big data. He is the main author of the OMG standard IFML.

  • (1/30, 12:00pm): Evimaria Terzi (BU): Active Matrix Completion Problems details)

    TITLE
    Active Matrix Completion Problems

    ABSTRACT
    In many applications, e.g., recommender systems and traffic monitoring, the data comes in the form of a matrix that is only partially observed. A fundamental data-analysis task for these datasets is matrix completion, where the goal is to accurately infer the entries missing from the matrix. In this talk we will consider matrices that satisfy one of the two — very common — assumptions: low rank and positive definite. In both cases we will consider the active version of the matrix-completion problem that says: given access to an oracle that can obtain some (small) number of entries of the original matrix which entries shall we query so that we achieve the least completion error. We will demonstrate how this question can be transformed into elegant combinatorial problems and we will discuss algorithms for solving them. We will also argue that our combinatorial formulations are very different from existing work in the matrix-completion literature.

    TALK BASED ON TWO PAPERS:

    Charalampos Mavroforakis, Dóra Erdös, Mark Crovella, Evimaria Terzi: Active Positive-Definite Matrix Completion. SDM 2017.
    https://scholar.google.com/scholar?cluster=15851698742463762316

    Natali Ruchansky, Mark Crovella, Evimaria Terzi: Matrix Completion with Queries. KDD 2015.
    https://scholar.google.com/scholar?cluster=18212580788830238017

    BIO
    Evimaria Terzi is an associate professor at the Computer Science Department at Boston University, where she also serves as an Associate Chair. Before joining BU in 2009, she was a research scientist at IBM Almaden Research Center. Evimaria has received her Ph.D. from University of Helsinki, Finland and her MSc from Purdue University. Evimaria is a recipient of the Microsoft Faculty Fellowship (2010) and NSF CAREER award and multiple NSF awards. Her research interests span a wide range of data-mining topics including algorithmic problems arising in recommendation systems, online social networks and social media.

  • (2/1, 12:00pm): Huy Ngyuen (Northeastern): Submodular Maximization with Nearly-optimal Approximation and Adaptivity details)

    TITLE
    Submodular Maximization with Nearly-optimal Approximation and Adaptivity

    ABSTRACT
    In this talk, we present recent progress on understanding the tradeoff between the approximation guarantee and adaptivity for submodular maximization. The adaptivity of an algorithm is the number of sequential rounds of queries it makes to the evaluation oracle of the function, where in every round the algorithm is allowed to make polynomially-many parallel queries. Adaptivity is an important consideration in settings where the objective function is estimated using samples and in applications where adaptivity is the main running time bottleneck. We present nearly-optimal algorithms for submodular maximization subject to a variety of constraints.

    RELATED PAPERS
    https://arxiv.org/abs/1804.05379
    https://arxiv.org/abs/1808.09987
    https://arxiv.org/abs/1812.01591

    BIO
    Huy Lê Nguyen is an Assistant Professor of Computer Science in the College of Computer and Information Science at Northeastern University. Prior to joining Northeastern, he was a Research Assistant Professor at the Toyota Technological Institute in Chicago and before that, a Google Research Fellow at the Simons Institute at University of California, Berkeley. He received his PhD in Computer Science from Princeton University. Professor Nguyen is broadly interested in the design and analysis of algorithms, with an emphasis on algorithmic techniques for massive data sets and machine learning.

  • (2/13, 12:00pm): Sarah Ostadabbas (Northeastern): Physics-based Simulation to Bootstrap Learning in Small Data Domains details)

    TITLE
    Physics-based Simulation to Bootstrap Learning in Small Data Domains

    ABSTRACT
    Deep learning (DL) regularly obliterate records for regression and classification tasks that have previously seen only incremental accuracy improvements. Furthermore, the training process is much more automated than classical techniques, no longer requiring a huge investment in feature selection and dataset pruning. However, this performance comes at a large data cost, frequently requiring upwards of 10^9 data/label pairs. Our digital economy has provided many problems for which such data exists or can be obtained cheaply relative to the benefits. There are many other fields that would get significant benefit from DL, but where data collection or labelling is expensive. This is very common for medical and military applications where data collection and/or labeling is expensive, individualized, and protected by very strong privacy or classification laws. Many applications will benefit from a learning framework with deep structure that works with limited labeled training samples, integrates domain-knowledge into the model, and maximizes the generalization of learning across domains. In this talk, I introduce our proposed 3-step framework that enables training of accurate and robust learning models under data limitation constraints based the use of simulation as its generative models. This framework includes: (i) employment of physics-based computational models referred to as simulation; (ii) design and analysis of unsupervised domain adaption techniques to close the gap between the simulated and real-world data distributions through a low-dimensional subspace transformation; (iii) development of learning techniques in the projected subspace to train an initial weak labeler; (iv) combined use of the weak labeler and a generative-adversarial framework to refine the simulated datasets by employing on a set of unlabeled real-world dataset in order to train a strong labeler; and (v) development and analysis of active learning techniques to select the most informative datasets to refine and adapt the strong labeler into a novel case with small data in the target application.

    RELATED PAPERS
    Inner Space Preserving Generative Pose Machine
    https://arxiv.org/abs/1808.02104
    Background Subtraction via Fast Robust Matrix Completion
    https://arxiv.org/abs/1711.01218

    BIO
    Professor Ostadabbas is an assistant professor in the Electrical and Computer Engineering Department of Northeastern University (NEU), Boston, Massachusetts, USA. Professor Ostadabbas joined NEU in 2016 from Georgia Tech, where she was a post-doctoral researcher following completion of her PhD at the University of Texas at Dallas in 2014. At NEU, Professor Ostadabbas is the director of the Augmented Cognition Laboratory (ACLab) with the goal of enhancing human information-processing capabilities through the design of adaptive interfaces via physical, physiological, and cognitive state estimation. These interfaces are based on rigorous models adaptively parameterized using machine learning and computer vision algorithms. In particular, she has been integrating domain knowledge with machine learning by using physics-based simulation as generative models for bootstrapping deep learning recognizers. Professor Ostadabbas is the co-author of more than 50 peer-reviewed journal and conference articles and her research has been awarded by the National Science Foundation (NSF), Mathworks, Amazon AWS, and NVIDIA. ​She is the co-organizer of the Multimodal Data Fusion (MMDF2018) workshop, an NSF PI mini-workshop on Deep Learning in Small Data, and will be the program chair of the upcoming Machine Learning in Signal Processing (MLSP2019) conference. Prof. Ostadabbas is an associate editor of the IEEE Transactions on Biomedical Circuits and Systems, on the Editorial Board of the IEEE Sensors Letters and Digital Biomarkers Journal, and has been serving in several signal processing and machine learning conferences as a technical chair or session chair.

  • (2/15, 12:00pm): Olga Papaemmanouil (Brandeis University): Deep Learning meets Query Optimization details)

    TITLE
    Deep Learning meets Query Optimization

    ABSTRACT
    Query optimization remains one of the most important and well studied problems in database systems. However, traditional query optimizers are complex, heuristically-driven systems that do not to learn from past experiences: they plan the execution of a query, but are ignorant of the actual performance of the picked plan. Because of the lack of feedback, a query optimizer may select the same bad query plan repeatedly, never learning from its previous good or bad choices.

    In this talk, I will argue that a new type of query optimizer, one that integrates deep learning with query optimization, can drastically improve on the state-of-the-art. Towards this direction, I will discuss ReJOIN, a proof-of-concept join enumerator that relies on deep reinforcement learning. ReJOIN leverages prior experience, and learns how to optimize future queries more effectively (i.e., discovers better query plans) and efficiently (i.e., spending less time on optimization) compared with traditional optimizers. I will discuss potential challenges for future research and describe deep learning approaches that can lead the way to end-to-end learning-based query optimizers.

    RELATED PAPERS
    http://www.cs.brandeis.edu/%7Eolga/publications/ReJOIN_aiDM18.pdf
    https://arxiv.org/pdf/1809.10212

    BIO
    Olga Papaemmanouil is an Associate Professor in the Department of Computer Science at Brandeis University. She received her undergraduate degree in Computer Science and Informatics at the University of Patras, Greece in 1999. In 2001, she received her Sc.M. in Information Systems at the University of Economics and Business, Athens, Greece. She then joined the Computer Science Department at Brown University, where she completed her Ph.D in Computer Science at Brown University in 2008. Her research interests are in databases and distributed data management. She is the recipient of an NSF Career Award (2013) and a Paris Kanellakis Fellowship from Brown University (2002).

  • (3/5, 3:30pm): Laura Di Rocco (Northeastern): The role of geographical knowledge in subcity level geolocation algorithms details)

    ABSTRACT
    Geolocation of microblog messages has been largely investigated in the literature. Many solutions exist, most of whichare data-driven (i.e., they rely on a training phase) and achieve good results at city-level. The development of algorithmsfor geolocation at sub-city level is still an open problem. In this paper, we investigate the role that external geographic knowledge can play in achieving reasonably accuratesub-city level geolocation. We propose a knowledge-base method, called Sherloc, to accurately geolocate messages at sub-city level, by exploiting the presence of toponyms in the message. Sherloc extracts the semantics of toponyms from geographic gazetteers and embeds them into a metric space that captures the semantic distance among them. This way we can identify the semantically closest toponyms to a microblog message and, at the same time, cluster them withrespect to their spatial locations. In contrast to the state of the art methods, Sherloc requires no prior training, it is not limited to geolocating on a fixed spatial grid, and is able to infer the location at sub-city level with higher accuracy.

  • (3/29, 11:00am): Yongjoo Park (University of Michigan): Bringing Statistical Tradeoffs to Data Systems details)

    TITLE
    Bringing Statistical Tradeoffs to Data Systems

    ABSTRACT
    Despite advances in computing power, the cost of large-scale analytics and machine learning remains daunting to small and large enterprises alike. This has created a pressing demand for reducing infrastructure costs and query latencies. To meet these goals, data analysts and applications are in many cases willing to tolerate a slight—but controlled—degradation of accuracy in exchange for substantial gains in cost and performance, which we refer to as statistical tradeoffs. This is particularly true in the early stages of data exploration and is in stark contrast to traditional tradeoffs where the infrastructure costs must increase for higher performance.

    BIO
    My research builds large-scale data systems that can make these statistical tradeoffs in a principled manner. In this talk, I will focus on two specific directions. First, I will present VerdictDB, a system that enables quality-guaranteed, statistical tradeoffs without any changes to backend infrastructure; thus, it offers a universal solution for off-the-shelf query engines. Second, I will introduce Database Learning, a new query execution paradigm that allows existing query engines to constantly learn from their past executions and become “smarter” over time without any user intervention. I will conclude by briefly discussing other promising directions with emerging workloads beyond SQL, including visualization and machine learning.

  • (4/2, 10:00am): WVH 155/168: Volker Markl & members of his research group details)

    Four students from Prof Volker Markl's group in TU Berlin will present their work within 15min each (after general introductions by Renee and Volker).

    Bio: Volker Markl is a Full Professor and Chair of the Database Systems and Information Management (DIMA) Group at the Technische Universität Berlin (TU Berlin). At the German Research Center for Artificial Intelligence (DFKI), he is both a Chief Scientist and Head of the Intelligent Analytics for Massive Data Research Group. In addition, he is Director of the Berlin Big Data Center (BBDC) and Co- Director of the Berlin Machine Learning Center (BzMl). Earlier in his career, he was a Research Staff Member and Project Leader at the IBM Almaden Research Center in San Jose, California, USA and a Research Group Leader at FORWISS, the Bavarian Research Center for Knowledge-based Systems located in Munich, Germany. Dr. Markl has published numerous research papers on indexing, query optimization, lightweight information integration, and scalable data processing. He holds 18 patents, has transferred technology into several commercial products, and advises several companies and startups. He has been both the Speaker and Principal Investigator for the Stratosphere Project, which resulted in a Humboldt Innovation Award as well as Apache Flink, the open-source big data analytics system. He serves as the President-Elect of the VLDB Endowment and was elected as one of Germany's leading Digital Minds (Digitale Köpfe) by the German Informatics (GI) Society. Most recently, Volker and his team earned an ACM SIGMOD Research Highlight Award 2016 for their work on “Implicit Parallelism Through Deep Language Embedding.” Volker Markl and his team earned an ACM SIGMOD Research Highlight Award 2016 for their work on implicit parallelism through deep language embedding.

    LINKS: Prof Markl's group website: http://www.dima.tu-berlin.de


    Speaker: Sebastian Breß

    Title: Efficient Data Processing on Modern Hardware

    Abstract: Processor manufacturers build increasingly specialized processors to mitigate the effects of the power wall in order to deliver improved performance. Currently, database engines have to be manually optimized for each processor which is a costly and error prone process. In this talk, we provide an overview of our research on automatic performance tuning in Hawk, a hardware-tailored code generator. Our key idea is to create processor-specific code variants and to learn a well- performing code variant for each processor. These code variants leverage various parallelization strategies and apply both generic and processor-specific code transformations. We observe that performance of different code variants may diverge up to two orders of magnitude. Thus, we need to generate custom code for each processor for peak performance. To this end, Hawk automatically finds efficient code variants for CPUs, GPUs, and MICs.

    Bio: Sebastian Breß received his PhD (Dr.-Ing.) from University of Magdeburg, Germany in 2015, under the supervision of Gunter Saake (University of Magdeburg) and Jens Teubner (TU Dortmund). He is the the initiator and system architect of the research database system CoGaDB and the Hawk Code Generator. Currently, Sebastian is a Senior Researcher at German Research Center for Artificial Intelligence (DFKI) and a PostDoc at Technische Universität Berlin, working with Prof. Dr. Volker Markl and Prof. Dr. Tilmann Rabl. Sebastian‘s research interests include data management on modern hardware, query compilation, stream processing, and optimizing data management systems for heterogeneous processors. Sebastian has been selected as a Distinguished Program Committee Member at SIGMOD 2018.


    Speaker: Jonas Traub

    Title: On-Demand Data Stream Gathering in the Internet of Things

    Abstract: In the Internet of Things (IoT), billions of sensor nodes form a sensor cloud and offer data streams to analysis systems. However, it is impossible to transfer all available data with maximal frequencies to all applications. In this talk, we show how we gather streaming data efficiently from a huge number of sensor nodes and how we make it available to a huge number of applications. Our key-idea is to gather data streams based on the data demand of streaming queries. The data demand of a query is the minimum number of data points which allows for answering the query with the desired precision. We present a solution which shares sensor nodes among many concurrent streaming queries by multiplexing their data demands. Our technique shares sensor reads and corresponding network traffic among all queries to save costs. Our experiments with real-world sensor data show that our approach saves up to 87% in data transmissions.

    Bio: Jonas is a Research Associate at Technische Universität Berlin and the German Research Center for Artificial Intelligence (DFKI). His research interests include data stream processing, sensor data analysis, and data acquisition from sensor nodes. Jonas authored several publications related to data stream gathering, processing and transmission in the Internet of Things and will complete his PhD in March 2019 under the supervision of Volker Markl. Before he started his PhD, Jonas wrote his master thesis at the Royal Institute of Technology (KTH) and the Swedish Institute of Computer Science (SICS) / RISE in Stockholm under supervision of Seif Haridi and Volker Markl and advised by Paris Carbone and Asterios Katsifodimos. Prior to that, he received his B.Sc. degree at Baden-Württemberg Cooperative State University (DHBW Stuttgart) and worked several years at IBM in Germany and the USA. Jonas is an alumnus of "Software Campus", "Studienstiftung des deutschen Volkes" and "Deutschlandstipendium". All publication and supplementary material are available on http://www.user.tu-berlin.de/powibol/.


    Speaker: Andreas Kunft

    Title: Efficient Matrix Partitioning Through Joins

    Abstract: End-to-end machine learning pipelines often consist of (i) relational operators to join the input data, (ii) user defined functions for feature extraction and vectorization, and (iii) linear algebra operators for model training and cross-validation. Often, these pipelines need to scale out to large datasets. In this case, these pipelines are usually implemented on top of dataflow engines like Hadoop, Spark, or Flink. Dataflow engines implement relational operators on row-partitioned datasets. However, efficient linear algebra operators use block-partitioned matrices. As a result, pipelines combining both kinds of operators require rather expensive changes to the physical representation, in particular re-partitioning steps. In this talk, I investigate the potential of reducing shuffling costs by fusing relational and linear algebra operations into specialized physical operators. I present BlockJoin, a distributed join algorithm which directly produces block-partitioned results. To minimize shuffling costs, BlockJoin applies database techniques known from columnar processing, such as index-joins and late materialization, in the context of parallel dataflow engines.

    Bio: Andreas Kunft is a PhD student/research associate at Technische Universität Berlin in the Database Systems and Information Management Group (DIMA) led by Volker Markl. His research interests include massive parallel processing frameworks, databases, and compilers, with focus on the integration of database and compiler techniques for holistic optimization of analytics pipelines.


    Speaker: Martin Kiefer

    Title: Approximate Data Analysis using Modern Hardware

    Abstract: Data summaries are an effective tool to balance efficiency and accuracy in data analysis tasks: Most relational query optimizers rely heavily on selectivity estimates computed from histograms or samples. Sketches have become an increasingly popular technique to perform high-bandwidth stream analysis tasks such as change or heavy-hitter detection. Using modern hardware such as GPUs and FPGAs reduces the cost of approximate

  • (4/5, 12:00pm): Tingjian Ge (UMass Lowell): Dense regions: a different type of event in graph streams details)

    TITLE
    Dense regions: a different type of event in graph streams

    ABSTRACT
    A graph stream, as a special data stream, consists of a sequence of edge insertions and deletions over a dynamic graph. Analogous to general data streams, complex events have been studied on graph streams which incorporate both time order and structural relationship over basic edge events. In this talk, I will propose a new type of complex event – the occurrence of a dense region that also lasts for a long time. This turns out to be very useful in applications of various domains including telecommunication hotspot detection, road traffic control, spam network filtering, and dynamic community detection.

    To efficiently discover dense regions, we propose an algorithmic framework called the Expectation-Maximization with a Utility function (EMU), a novel stochastic approach that nontrivially extends the conventional EM. We validate our EMU approach by showing that it converges to the optimum—by proving that it is a specification of the general Minorization-Maximization (MM) framework with convergence guarantees. We then devise EMU algorithms for the densest lasting subgraph problem. Using real-world graph streams, we experimentally verify the effectiveness and efficiency of our techniques.

    RELATED PAPER
    http://www.cs.uml.edu/~ge/pdf/dense_subgraph_ICDE_19.pdf

    BIO
    Tingjian Ge is an associate professor in Computer Science at the University of Massachusetts, Lowell. He received a Ph.D. from Brown University in 2009. Prior to that, he got his Bachelor’s and Master’s degrees in Computer Science from Tsinghua University and UC Davis, respectively, and worked at Informix and IBM for six years. His research areas are in data management and analytics, with a recent focus on applying machine learning, AI, and algorithmic techniques in data management and mining. He is a recipient of the NSF CAREER Award in 2012, and a Teaching Excellence Award at UMass Lowell in 2014.

Fall 2018 (Wolfgang)

  • (9/13, 10:40am): WVH 366: Anshumali Shrivastava (Rice University): Hashing Algorithms for Extreme Scale Machine Learning details)

    TITLE
    Hashing Algorithms for Extreme Scale Machine Learning.

    ABSTRACT
    In this talk, I will discuss some of my recent and surprising findings on the use of hashing algorithms for large-scale estimations. Locality Sensitive Hashing (LSH) is a hugely popular algorithm for sub-linear near neighbor search. However, it turns out that fundamentally LSH is a constant time (amortized) adaptive sampler from which efficient near-neighbor search is one of the many possibilities. Our observation adds another feather in the cap for LSH. LSH offers a unique capability to do smart sampling and statistical estimations at the cost of few hash lookups. Our observation bridges data structures (probabilistic hash tables) with efficient unbiased statistical estimations. I will demonstrate how this dynamic and efficient sampling beak the computational barriers in adaptive estimations where, for the first time, it is possible that we pay roughly the cost of uniform sampling but get the benefits of adaptive sampling. We will demonstrate the power of one simple idea for three favorite problems 1) Partition function estimation for large NLP models such as word2vec, 2) Adaptive Gradient Estimations for efficient SGD and 3) Sub-Linear Deep Learning with Huge Parameter Space.

    In the end, if time permits, we will switch to memory cost show a simple hashing algorithm that can shrink memory requirements associated with classification problems exponentially! Using our algorithms, we can train 100,000 classes with 400,000 features, on a single Titan X while only needing 5% or less memory required to store all the weights. Running a simple logistic regression on this data, the model size of 320GB is unavoidable.

    BIO
    Anshumali Shrivastava is an assistant professor in the computer science department at Rice University. His broad research interests include randomized algorithms for large-scale machine learning. He is a recipient of National Science Foundation (NSF) CAREER Award, a Young Investigator Award from Air Force Office of Scientific Research (AFOSR), and machine learning research award from Amazon. His research on hashing inner products has won Best Paper Award at NIPS 2014 while his work on representing graphs got the Best Paper Award at IEEE/ACM ASONAM 2014. Anshumali got his PhD in 2015 from Cornell University.

  • (9/26, 12:00pm): Spyros Blanas (Ohio State University): Scaling database systems to high-performance computers details)

    TITLE
    Scaling database systems to high-performance computers

    ABSTRACT
    We are witnessing the increasing use of warehouse-scale computers to analyze massive datasets quickly. This poses two challenges for database systems. The first challenge is interoperability with established analytics libraries and tools. Massive datasets often consist of images (arrays) in file formats like FITS and HDF5. To analyze such datasets users turn to domain-specific libraries and deep learning frameworks, and thus write code that directly manipulates files. We will first present ArrayBridge, an open-source I/O library that allows SciDB, TensorFlow and HDF5-based programs to co-exist in a pipeline without converting between file formats. With ArrayBridge, users benefit from the optimizations of a database system without sacrificing the ability to directly manipulate data through the existing HDF5 API when they want to.

    The second challenge is scalability, as warehouse-scale computers expose communication bottlenecks in foundational data processing operations. This talk will focus on data shuffling and parallel aggregation. We will first present an RDMA-aware data shuffling algorithm that transmits data up to 4X faster than MPI. This is achieved by switching to a connectionless, datagram-based network transport layer that scales better but requires flow control in software. We will then present a parallel aggregation algorithm for high-cardinality aggregation that carefully schedules data transmissions to avoid unscaleable all-to-all communication. The algorithm leverages similarity to transmit less data over congested network links. We will conclude by highlighting additional challenges that need to be overcome to scale database systems to massive computers.

    BIO
    Spyros Blanas is an assistant professor in the Department of Computer Science and Engineering at The Ohio State University. His research interest is high performance database systems, and his current goal is to build a database system for high-end computing facilities. He has received the IEEE TCDE Rising Star award and a Google Research Faculty award. He completed his Ph.D. at the University of Wisconsin–Madison where part of his Ph.D. dissertation was commercialized in Microsoft SQL Server as the Hekaton in-memory transaction processing engine.

    LINKS
    EUROSYS 2017:
    https://scholar.google.com/scholar?cluster=12838806817740083185

    ICDE 2018:
    https://scholar.google.com/scholar?cluster=3222331661200894636

  • (9/28, 12:00pm): Stratos Idreos (Harvard): The periodic table of data structures details)

    TITLE
    The periodic table of data structures.

    ABSTRACT
    Data structures are critical in any data-driven scenario, and they define the behavior of modern data systems and data-driven algorithms. However, they are notoriously hard to design due to a massive design space and the dependence of performance on workload and hardware which evolve continuously.

    What if we knew how many and which data structures are possible to design? What if we could compute the expected performance of a data structure design on a given workload and hardware without having to implement it and without even having access to the target machine? We will discuss our quest for 1) the first principles of data structures, 2) design continuums that make it possible to automate design, and 3) self-designing systems that can morph between what we now consider fundamentally different structures. We will draw examples from the NoSQL key-value store design space and discuss how to accelerate them and balance space-time tradeoffs.

    BIO
    Stratos Idreos is an assistant professor of Computer Science at Harvard University where he leads DASlab, the Data Systems Laboratory. Stratos was awarded the 2011 ACM SIGMOD Jim Gray Doctoral Dissertation award and the 2011 ERCIM Cor Baayen award. He is also a recipient of an IBM zEnterpise System Recognition Award, a VLDB Challenges and Visions best paper award and an NSF Career award. In 2015 he was awarded the IEEE TCDE Rising Star Award from the IEEE Technical Committee on Data Engineering for his work on adaptive data systems.  

    LINKS
    https://stratos.seas.harvard.edu/files/stratos/files/datacalculator.pdf
    https://stratos.seas.harvard.edu/files/stratos/files/dostoyevski.pdf
    https://stratos.seas.harvard.edu/files/stratos/files/monkeykeyvaluestore.pdf

    Prior recorded talk at UW:
    http://db.cs.washington.edu/nwds/nwds.html

    Further pointer:
    https://www.zdnet.com/article/zen-and-the-art-of-data-structures-from-self-tuning-to-self-designing-data-systems/

  • (10/4, 4:30pm): Angelika Kimmig (Cardiff University): A Collective, Probabilistic Approach to Schema Mapping details)

    TITLE
    A Collective, Probabilistic Approach to Schema Mapping

    ABSTRACT
    We propose a probabilistic approach to the problem of schema mapping. Our approach is declarative, scalable, and extensible. It builds upon recent results in both schema mapping and probabilistic reasoning and contributes novel techniques in both fields. We introduce the problem of mapping selection, that is, choosing the best mapping from a space of potential mappings, given both metadata constraints and a data example. As selection has to reason holistically about the inputs and the dependencies between the chosen mappings, we define a new schema mapping optimization problem which captures interactions between mappings. We then introduce Collective Mapping Discovery (CMD), our solution to this problem using state-of-the-art probabilistic reasoning techniques, which allows for inconsistencies and incompleteness. Using hundreds of realistic integration scenarios, we demonstrate that the accuracy of CMD is more than 33% above that of metadata-only approaches already for small data examples, and that CMD routinely finds perfect mappings even if a quarter of the data is inconsistent.

    BIO
    Angelika Kimmig is a Lecturer at Cardiff University, UK. She obtained her Ph.D. from KU Leuven, Belgium, and was a postdoctoral fellow at KU Leuven and the University of Maryland, College Park. Her research interests include symbolic AI, reasoning under uncertainty, machine learning, logic programming, and especially combinations thereof such as probabilistic programming and statistical relational learning. She is a key contributor to the probabilistic logic programming language ProbLog.

    PAPER from ICDE 2017:
    https://scholar.google.com/scholar?cluster=2786820379326742378

  • (10/26, 12:00pm): Mania Abdi (Northeastern): D3N: A multi-layer cache for improving big-data applications’ performance in data centers with imbalanced networks details)

    TITLE
    D3N: A multi-layer cache for improving big-data applications’ performance in data centers with imbalanced networks

    ABSTRACT
    Caching methods for improving the performance of datalakes for throughput-bound big-data jobs assume unlimited bandwidth across the datacenter. However, most enterprise and academic datacenters grow organically and have heavily oversubscribed networks between different computer clusters. This paper describes D3N, an architecture that caches data on the access side of potential network bottlenecks and uses a multilayer approach. In the case of a two layer cache, reuse distances are dynamically computed along with cache miss costs to determine the partitioning of the cache into L1 (used for local accesses) and L2 (used for remote access, but within same access side) layers. We have implemented a prototype of D3N by modifying Ceph’s RADOS gateway. Micro and macro evaluations of the prototype demonstrate that the implementation is performant enough to saturate the (40,Gbit/s) NICs and (5 GB/s read) SSD of our caching server and deliver substantial storage bandwidth improvements for real workloads. Numerical models show that a multi-level, dynamic cache can have substantial advantages over today’s single-level caches when bandwidth is constrained.

    BIO
    Mania Abdi is a PhD student at the Northeastern University Solid State Storage research group. Prior to that, she was a software engineer. She received her MSc. in Computer Engineering at the Sharif University of Technology and B.Eng. in Computer Engineering at the Amirkabir University of Technology. Mania is a computer systems researcher with a storage focus and has worked on a broad set of topics, including distribted storage, caching, data center debuging, and end-to-end tracing.

  • (11/7, 12:00pm): Laurel Orr (University of Washington): Probabilistic Database Summarization for Interactive Data Exploration details)

    TITLE:
    Probabilistic Database Summarization for Interactive Data Exploration

    ABSTRACT
    A fundamental assumption of traditional DBMSs is that the database contains all information necessary to answer a query; i.e., the database contains the entire universe of data. Many data scientists, however, do not have access to the universe of data and instead rely on samples to answer queries. These data scientists need to use tools outside of the database or alter their queries to correctly reweight and debias their samples to get a more accurate query answer. This talk presents preliminary research into building the first open world database system (OWDB) that inherently assumes relations are samples from some universe, even if the sampling mechanism is unknown. We will discuss two different approaches for building an OWDB: probabilistic modeling of the universe and sample reweighting. We then present EntropyDB, a system that takes the former approach to summarize a database, and we lastly discuss ongoing research into the later approach using Bayesian Networks. We conclude with discussing the open challenges in developing an OWDB.

    BIO
    Laurel Orr is a 6th year PhD student in the Database Group at the Paul G Allen School for Computer Science and Engineering. Her research interests are data summarization, approximate query processing, data exploration, and the general development of tools and techniques to aid data scientists in their data investigation and analysis pipeline. Her current research goal is to develop a prototype open world database system that inherently treats relations a samples drawn from some larger universe of data, even when the sampling mechanism is unknown. She was a 2018 NCWIT Collegiate Award Honorable Mention and was awarded a NSF Graduate Research Fellowship in 2015.

    RELATED PAPER:
    http://db.cs.washington.edu/projects/entropydb/ljorr-vldb2017.pdf

    PROJECT PAGE:
    http://db.cs.washington.edu/projects/entropydb/

  • (11/9, 12:00pm): Raul Castor Fernandez (MIT): Aurum: A Data Discovery System details)

    TITLE
    Aurum: A Data Discovery System

    ABSTRACT
    Organizations store data in hundreds of different data sources, including relational databases, files, and large data lake repositories. These data sources contain valuable information and insights that can be beneficial to multiple aspects of modern data-driven organizations. However, as more data is produced, our ability to use it reduces dramatically, as no single person knows about all the existent data sources. One big challenge is to discover the data sources that are relevant to answer a particular question. Aurum is a data discovery system to answer "discovery queries" on large volumes of data. In this talk, I'll motivate the data discovery problem with use cases from different industries. I will describe Aurum's design and I will talk a bit about a new research project that aims to discover data beyond tables.

    BIO
    In my research, I build high-performance and scalable systems to discover, prepare and process data. I'm a postdoctoral researcher at MIT, working with Sam Madden and Mike Stonebraker. Before, I completed my PhD at Imperial College London with Peter Pietzuch.

    RELATED PAPERS
    http://raulcastrofernandez.com/papers/icde18-aurum.pdf
    http://raulcastrofernandez.com/papers/wip_termite.pdf

  • (11/14, 12:00pm): Lawson Wong (Northeastern): Abstraction in robotics details)

    TITLE
    Abstraction in robotics

    ABSTRACT
    Robotics is a big data problem. To make sense of the physical world, perform tasks well, and generalize across environments, robots need to represent and understand the world at the "correct" level of abstraction. What "correct" should mean remains to be seen.
    In this talk, I will describe two lines of work that attempt to answer this question from very different perspectives. I will first discuss work on grounding natural language instructions to robot behavior, where we demonstrate that having the right representations can enable human-robot communication. This is an important problem for robotics, since we envision users using natural language to instruct robots to perform a wide variety of tasks. In the second half, I will discuss recent preliminary work on the theoretical foundations of state abstraction in reinforcement learning, a common framework used in robot learning problems. In particular, we view state abstraction as data compression, and apply results in information theory (rate-distortion theory) to the reinforcement learning setting.
    Time permitting, I will describe some extensions to the above work, as well as other abstraction-related problems, that I envision my group will pursue at Northeastern.

    BIO
    Lawson L.S. Wong is an assistant professor in the College of Computer and Information Science at Northeastern University. His research focuses on learning, representing, and estimating knowledge about the world that an autonomous robot may find useful. Prior to Northeastern, Lawson was a postdoctoral fellow at Brown University. He completed his PhD at the Massachusetts Institute of Technology. He has received a Siebel Scholarship, AAAI Robotics Student Fellowship, and Croucher Foundation Fellowship for Postdoctoral Research.
    https://www.ccis.northeastern.edu/people/lawson-wong/

    RELATED PAPERS
    Sequence-to-Sequence Language Grounding of Non-Markovian Task Specifications
    Nakul Gopalan, Dilip Arumugam, Lawson L.S. Wong, Stefanie Tellex
    http://www.roboticsproceedings.org/rss14/p67.html
    Robotics: Science and Systems (2018)

    Grounding Natural Language Instructions to Semantic Goal Representations for Abstraction and Generalization
    Dilip Arumugam, Siddharth Karamcheti, Nakul Gopalan, Edward C. Williams, Mina Rhee, Lawson L.S. Wong, Stefanie Tellex
    http://cs.brown.edu/~lwong5/papers/2018-languagegrounding.pdf
    Autonomous Robots (in press; 2018)
    Extended journal version of Robotics: Science and Systems (2017) paper

    State Abstraction as Compression in Apprenticeship Learning
    David Abel, Dilip Arumugam, Kavosh Asadi, Yuu Jinnai, Michael L. Littman, Lawson L.S. Wong
    Preprint available on request (by e-mail)
    To appear in AAAI Conference on Artificial Intelligence (2019)

  • (11/16, 12:00pm): Alexandra Meliou (UMass Amherst): Creating a Higher-Quality Data World details)

    TITLE
    Diagnoses and Explanations: Creating a Higher-Quality Data World

    ABSTRACT
    The correctness and proper function of data-driven systems and applications relies heavily on the correctness of their data. Low quality data can be costly and disruptive, leading to revenue loss, incorrect conclusions, and misguided policy decisions. Improving data quality is far more than purging datasets of errors; it is critical to improve the processes that produce the data, to collect good data sources for generating the data, and to address the root causes of problems.

    Our work is grounded on an important insight: While existing data cleaning techniques can be effective at purging datasets of errors, they disregard the fact that a lot of errors are systemic, inherent to the process that produces the data, and thus will keep occurring unless the problem is corrected at its source. In contrast to traditional data cleaning, we focus on data diagnosis: explaining where and how the errors happen in a data generative process. I will describe our work on Data X-Ray and QFix, two diagnostic frameworks for large-scale extraction systems and relational data systems. I will also provide a brief overview of new results on knowledge augmentation and explanations for dataset differences, building towards a vision for toolsets that assist the exploration of information in a varied, diverse, and highly non-integrated data world.

    BIO
    Alexandra Meliou is an Assistant Professor in the College of Information and Computer Sciences, at the University of Massachusetts, Amherst. Prior to that, she was a Post-Doctoral Research Associate at the University of Washington. Alexandra received her PhD degree from the Electrical Engineering and Computer Sciences Department at the University of California, Berkeley. She has received recognitions for research and teaching, including a CACM Research Highlight, an ACM SIGMOD Research Highlight Award, an ACM SIGSOFT Distinguished Paper Award, an NSF CAREER Award, a Google Faculty Research Award, and a Lilly Fellowship for Teaching Excellence. Her research focuses on data provenance, causality, explanations, data quality, and algorithmic fairness.
    https://people.cs.umass.edu/~ameli/

    RELATED PAPERS
    http://sites.computer.org/debull/A18june/p47.pdf
    http://people.cs.umass.edu/ameli/projects/queryProvenance/papers/WangMW2017.pdf
    https://people.cs.umass.edu/~ameli/projects/packageBuilder/papers/scalable-paql.pdf

  • (12/7, 11:30am): Fei Chiang (McMaster University): Contextual and Spatio-temporal Data Cleaning details)

    TITLE
    Contextual and Spatio-temporal Data Cleaning

    ABSTRACT
    It is becoming increasingly difficult for organizations to reap value from their data due to poor data quality. This is motivated by the observation that real data is rarely error free, containing incomplete, inconsistent, and stale values. This leads to inaccurate, and out-of-date data analysis downstream. Addressing data inconsistency requires not only reconciling differing syntactic references to an entity, but it is often necessary to include domain expertise to correctly interpret the data. For example, understanding that a reference to ‘jaguar’ may be interpreted as an animal or as a vehicle. Secondly, having up-to-date (or current) data is important for timely data analysis. Cleaning stale values goes beyond just relying on timestamps, especially when timestamps may be missing, inaccurate or incomplete.

    In this talk, I will present our work towards achieving consistent and up-to-date data. First, I will discuss contextual data cleaning that uses a new class of data integrity constraints that tightly integrate domain semantics from an ontology. Second, we argue that data currency is a relative notion based on individual spatio-temporal update patterns, and these patterns can be learned and predicted. I will present our framework to achieve these two objectives, and provide a brief overview of recent extensions with applications to knowledge fusion.

    BIO
    Fei Chiang is an Assistant Professor in the Department of Computing and Software at McMaster University. She is a Faculty Fellow at the IBM Centre for Advanced Studies, and served as an inaugural Associate Director of the McMaster MacData Institute. She received her M. Math from the University of Waterloo, and B.Sc and PhD degrees from the University of Toronto, all in Computer Science. Her research interests are in data quality, data cleaning, data privacy and text mining. She holds four patents for her work in self-managing database systems. Her work has been featured in the Southern Ontario Smart Computing Impact Report. She is a recipient of the Dean’s Teaching Honour Roll, and a 2018 Ontario Early Researcher Award.

  • (12/12, 12:00pm): Erkang Zhu (University of Toronto): Get Your Data Together! Algorithms for Managing Data Lakes details)

    TITLE
    Get Your Data Together! Algorithms for Managing Data Lakes

    ABSTRACT
    Data lakes (e.g., enterprise data catalogs and Open Data portals) are data dumps if users cannot find and utilize the data in them. In this talk, I present two problems in massive, dynamic data lakes: 1) searching for joinable tables without a precomputed join graph, and 2) joining tables from different sources through auto-generated syntactic transformation on join values. I will also present two algorithmic solutions that can be used for data lakes that are large both in the number of tables (millions) and table sizes. The presented work has been published in SIGMOD and VLDB.

    BIO
    Erkang (Eric) Zhu is a 5th year computer science PhD candidate at University of Toronto. His supervisor is Prof. Renée J. Miller. His research focuses on data discovery, large-scale similarity search, and randomized algorithms (data sketches).

    RELATED PAPERS
    Erkang Zhu, Fatemeh Nargesian, Ken Q. Pu, Renée J. Miller: LSH Ensemble: Internet-Scale Domain Search. PVLDB 9(12): 1185-1196 (2016)
    Erkang Zhu, Ken Q. Pu, Fatemeh Nargesian, Renée J. Miller: Interactive Navigation of Open Data Linkages. PVLDB 10(12): 1837-1840 (2017)

Spring / Summer 2018 (Wolfgang)

Fall 2017 (Wolfgang)

This semester the data lab seminar is combined with in our special topics class (CS 7290: Special topics: Foundations in scalable data management) and takes place every Tuesday 11:45am-1:25pm and Thursday 2:50-4:30pm in Ryder Hall 126 (map).

Related Seminars @NEU