Accepted Research Papers
Program at a Glance
Industry Track
PhD Symposium
Poster Papers
Demos
TKDE Posters

Research Track Schedule



R01: Crowdsourcing (Tuesday 21st April, 10:00-11:30)
Title Authors
R01.1 39 Real-Time Cross Online Matching in Spatial Crowdsourcing Yurong Cheng (Beijing institute of technology)*; Boyang Li (Northeastern University); Xiangmin Zhou (RMIT University); Ye Yuan (Beijing Institute of Technology); Guoren Wang (Beijing Institute of Technology); Lei Chen (Hong Kong University of Science and Technology)
With the development of mobile devices and communication techniques, spatial crowdsourcing platforms are becoming more and more popular in recent years. One of the typical topics of spatial crowdsourcing is the task assignment, which assigns crowd workers to users' requests in real time and maximizes the total revenue. However, it is common that the available crowd workers over a platform are too far away to serve the requests, so some user requests may be rejected or responded at high money cost after long waiting. Fortunately, the neighbors of a platform usually have available resources for the same services. If different platforms can collaboratively conduct the task allocation, it can greatly affect the quality of services, and have not been investigated yet. In this paper, we propose a Cross Online Matching (COM), which enables a platform to ``borrow" some unoccupied crowd workers from other platforms for completing the user requests. If the requests attempts to be completed by ``borrowed" workers, COM decides how much should be paid to the borrowed crowd workers, such that the borrowed workers are stimulated to accept the request and the request target platform can gain some revenue as well. We propose two algorithms, deterministic cross online matching (DemCOM) and randomized cross online matching (RamCom) for COM. DemCOM guarantees the largest obtained revenue in a greedy manner, while RamCom considers the trade-off between the obtained revenue and the probability of request being accepted by the borrowed workers. Extensive experimental results verify the effectiveness and efficiency of our algorithms.
R01.2 53 "Predictive Task Assignment in Spatial Crowdsourcing: A Data-driven Approach" Yan Zhao (School of Computer Science and Technology, Soochow University); Kai Zheng (University of Electronic Science and Technology of China)*; Yue Cui (University of Electronic Science and Technology of China); Han Su (Big Data Research Center, University of Electronic Science and Technology of China); Feida Zhu (Singapore Management University); Xiaofang Zhou (University of Queensland)
With the rapid development of mobile networks and the widespread usage of mobile devices, spatial crowdsourcing, which refers to assigning location-based tasks to moving workers, has drawn increasing attention. One of the major issues in spatial crowdsourcing is task assignment, which allocates tasks to appropriate workers. However, existing works generally assume the static offline scenarios, where the spatio-temporal information of all the workers and tasks is determined and known a priori. Ignorance of the dynamic spatio-temporal distributions of workers and tasks can often lead to poor assignment results. In this work we study a novel spatial crowdsourcing problem, namely Predictive Task Assignment (PTA), which aims to maximize the number of assigned tasks by taking into account both current and future workers/tasks that enter the system dynamically with location unknown in advance. We propose a two-phase data-driven framework. The prediction phase hybrids different learning models to predict the locations and routes of future workers and designs a graph embedding approach to estimate the distribution of future tasks. In the assignment component, we propose both greedy algorithm for large-scale applications and optimal algorithm with graph partition based decomposition. Extensive experiments on two real datasets demonstrate the effectiveness of our framework.
R01.3 427 "Curiosity-Driven Energy-Efficient Worker Scheduling in Vehicular Crowdsourcing: A Deep Reinforcement Learning Approach" Chi Harold Liu (Beijing Institute of Technology)*; Yinuo Zhao (Beijing Institute of Technology ); Zipeng Dai (Beijing Institute of Technology ); Ye Yuan (Northeastern University); Guoren Wang (Beijing Institute of Technology); Dapeng Wu (); Kin K. Leung (Imperial College )
Spatial crowdsourcing (SC) utilizes the potential of a crowd to accomplish certain location based tasks. Although worker scheduling has been well studied recently, most existing works only focus on the static deployment of workers but ignore their temporal movement continuity. In this paper, we explicitly consider the use of unmanned vehicular workers, e.g., drones and driverless cars, which are more controllable and can be deployed in remote or dangerous areas to carry on long-term and hash tasks as a vehicular crowdsourcing (VC) campaign. We propose a novel deep reinforcement learning (DRL) approach for curiosity driven energy-efficient worker scheduling, called “DRL-CEWS”, to achieve an optimal trade-off between maximizing the collected amount of data and coverage fairness, and minimizing the overall energy consumption of workers. Specifically, we first utilize a chief-employee distributed computational architecture to stabilize and facilitate the training process. Then, we propose a spatial curiosity model with a sparse reward mechanism to help derive the optimal policy in large crowdsensing space with unevenly distributed data. Extensive simulation results show that DRL-CEWS outperforms the state-of-the-art methods and baselines, and we also visualize the benefits curiosity model brings and show the impact of two hyperparameters.
R01.4 541 "Crowdsourced Collective Entity Resolution with Relational Match Propagation" Jiacheng Huang (Nanjing University); Wei Hu (Nanjing University)*; Zhifeng Bao (RMIT University); Yuzhong Qu (Nanjing University)
Knowledge bases (KBs) store rich yet heterogeneous entities and facts. Entity resolution (ER) aims to identify entities in KBs which refer to the same real-world object. Recent studies have shown significant benefits of involving humans in the loop of ER. They often resolve entities with pairwise similarity measures over attribute values and resort to the crowds to label uncertain ones. However, existing methods still suffer from high labor costs and insufficient labeling to some extent. In this paper, we propose a novel approach called crowdsourced collective ER, which leverages the relationships between entities to infer matches jointly rather than independently. Specifically, it iteratively asks human workers to label picked entity pairs and propagates the labeling information to their neighbors in distance. During this process, we address the problems of candidate entity pruning, probabilistic propagation, optimal question selection and error-tolerant truth inference. Our experiments on real-world datasets demonstrate that, compared with state-of-the-art methods, our approach achieves superior accuracy with much less labeling.
R01.5 639 "An End-to-End Deep RL Framework for Task Arrangement in Crowdsourcing Platforms" Caihua Shan (HKU)*; Nikos Mamoulis (University of Ioannina); Reynold Cheng ("The University of Hong Kong, China"); Guoliang Li (Tsinghua University); Xiang Li (University of Hong Kong); Yuqiu Qian (Tencent Inc.)
In this paper, we propose a Deep Reinforcement Learning (RL) framework for task arrangement, which is a critical problem for the success of crowdsourcing platforms. Previous works conduct the personalized recommendation of tasks to workers via supervised learning methods. However, the majority of them only consider the benefit of either workers or requesters independently. In addition, they cannot handle the dynamic environment and may produce sub-optimal results. To address these issues, we utilize Deep Q-Network (DQN), an RL-based method combined with a neural network to estimate the expected long-term return of recommending a task. DQN inherently considers the immediate and future reward simultaneously and can be updated in real-time to deal with evolving data and dynamic changes. Furthermore, we design two DQNs that capture the benefit of both workers and requesters and maximize the profit of the platform. To learn value functions in DQN effectively, we also propose novel state representations, carefully design the computation of Q values, and predict transition probabilities and future states. Experiments on synthetic and real datasets demonstrate the superior performance of our framework.
R02: Data Integration and Machine Learning (Tuesday 21st April, 10:00-11:30)
Title Authors
R02.1 63 "Efficient Bidirectional Order Dependency Discovery" Yifeng Jin (Fudan University); Lin Zhu (Fudan University); Zijing Tan (Fudan University)*
Bidirectional order dependencies state a relationship of order between lists of attributes. They naturally model the order-by clauses in SQL queries, and are proved effective in query optimizations concerning sorting. Despite their importance, order dependencies on a dataset are typically unknown and are too costly, if not impossible, to design or discover manually. Techniques for automatic order dependency discovery are recently studied, but are not well understood by now. It is hard for order dependency discovery to scale well, since it is by nature factorial in the number m of attributes and quadratic in the number n of tuples.In this paper, we adopt a strategy that decouples the impact of m from that of n, and that still finds all minimal valid bidirectional order dependencies. We present carefully designed data structures, a host of algorithms and optimizations, for efficient order dependency discovery. With extensive experimental studies on both real-life and synthetic datasets, we verify our approach significantly outperforms state-of-the-art techniques, by orders of magnitude.
R02.2 96 "Efficient Diversity-Driven Ensemble for Deep Neural Networks" Wentao Zhang (Peking University)*; Jiawei Jiang (Peking University); Yingxia Shao (BUPT); Bin Cui (Peking University)
The ensemble of deep neural networks has been shown, both theoretically and empirically, to improve generalization accuracy on the unseen test set. However, the high training cost hinders its efficiency since we need a sufficient number of base models and each one in the ensemble has to be separately trained.Lots of methods are proposed to tackle this problem, and most of them are based on the feature that a pre-trained network can transfer its knowledge to the next base model and then accelerate the training process.However, these methods suffer a severe problem thatall of them transfer knowledge without selection and thus lead to low diversity. As the effect of ensemble learning is more pronounced if ensemble members are accurate and diverse, we propose a method named Efficient Diversity-Driven Ensemble (EDDE) to address both the diversity and the efficiency of an ensemble. To accelerate the training process, we propose a novel knowledge transfer method which can selectively transfer the previous generic knowledge. To enhance diversity, we first propose a new diversity measure, then use it to define a diversity-driven loss function for optimization. At last, we adopt a Boosting-based framework to combine the above operations, such a method can also further improve diversity. We evaluate EDDE on Computer Vision (CV) and Natural Language Processing (NLP) tasks. Compared with other well-known ensemble methods, EDDE can get highest ensemble accuracy with the lowest training cost, which means it is efficient in the ensemble of neural networks.
R02.3 689 "Adaptive Network Alignment with Unsupervised and Multi-order Convolutional Networks" Thanh Trung Huynh (Griffith University); Van Vinh Tong (Griffith University); Thanh Tam Nguyen (EPFL); Hongzhi Yin (The University of Queensland); Matthias Weidlich (Humboldt-Universität zu Berlin); Quoc Viet Hung Nguyen (Griffith University)*
Network alignment is the problem of pairing nodes between two graphs such that the paired nodes are structurally and semantically similar. A well-known application of network alignment is to identify which accounts in different social networks belong to the same person. Existing alignment techniques, however, lack scalability, cannot incorporate multi-dimensional information without training data, and are limited in the consistency constraints enforced by an alignment. In this paper, we propose a fully unsupervised network alignment framework based on a multi-order embedding model. The model learns the embeddings of each node using a graph convolutional neural representation, which we prove to satisfy consistency constraints. We further design a data augmentation method and a refinement mechanism to make the model adaptive to consistency violations and noise. Extensive experiments on real and synthetic datasets show that our model outperforms state-of-the-art alignment techniques. We also demonstrate the robustness of our model against adversarial conditions, such as structural noises, attribute noises, graph size imbalance, and hyper-parameter sensitivity.
R02.4 87 "A Natural Language Interface for Database: Achieving Transfer-learnability Using Adversarial Method for Question Understanding" Wenlu Wang (Auburn University)*; Yingtao Tian (Stony Brook University); Haixun Wang (WeWork Research); Wei-Shinn Ku (Auburn University)
Relational database management systems (RDBMSs) are powerful because they are able to optimize and answer queries against any relational database. However, when it comes to NLIDB (natural language interface for databases), the entire system is often custom-made for a particular database. How to overcome the complexity and expressiveness of natural languages so that a single NLI can support a variety of databases is an unsolved problem. In this work, we show that it is possible to separate data specific components from latent semantic structures in expressing relational queries in a natural language. With the separation, transferring an NLI from one database to another becomes possible. We develop a neural network classifier to detect data specific components and an adversarial mechanism to locate them in a natural language question. We then introduce a general purpose transfer-learnable NLI that focuses on the latent semantic structure.We devise a deep sequence model that translates the latent semantic structure to an SQL query. Experiments show that our approach outperforms previous NLI methods on the WikiSQL dataset and the model we learned can be applied to other benchmark datasets such as OVERNIGHT without retraining.
R02.5 824 "Array-based Data Management for Genomics" Olha Horlova (Politecnico di Milano)*; Abdulrahman Kaitoua (DFKI); Stefano Ceri (Politecnico di Milano)
With the huge growth of genomic data, exposing multiple heterogeneous features of genomic regions for millions of individuals, we increasingly need to support domain-specific query languages and knowledge extraction operations, capable of aggregating and comparing trillions of regions arbitrarily positioned on the human genome. While row-based models for regions can be effectively used as a basis for cloud-based implementations, in previous work we have shown that the array-based model is effective in supporting the class of region-preserving operations, i.e. operations which do not create any new region but rather compose existing ones. In this paper, we remove the above constraint, and describe an array-based implementation which applies to unrestricted region operations, as required by the Genometric Query Language. Specifically, we define a wide spectrum of operations over datasets which are represented using arrays, and we show that the array-based implementation scales well upon Spark, also thanks to a data representation which is effectively used for supporting machine learning. Our benchmark, which uses an independent, pre-existing collection of queries, shows that in many cases the novel array-based implementation improves the performance of the row-based implementation.
R03: Recommendation Systems (Tuesday 21st April, 10:00-11:30)
Title Authors
R03.1 198 "Group Recommendation with Latent Voting Mechanism" Lei Guo (Shandong Normal University)*; Hongzhi Yin (The University of Queensland); Qinyong Wang (The University of Queensland); Bin Cui (Peking University); Zi Huang (University of Queensland); Lizhen Cui (ShanDong University, China)
Group Recommendation (GR) is the task of suggesting relevant items/events for a group of users in online systems, whose major challenge is to aggregate the preferences of group members to infer the decision of a group. Prior group recommendation methods applied redefined static strategies for preference aggregation. However, these static strategies are insufficient to model the complicated decision making process of a group, especially for occasional groups which are formed ad-hoc. Compared to conventional individual recommendation task, GR is rather dynamic and each group member may contribute to the final group decision. Recent works argue that group members should have non-uniform weights in forming the decision of a group, and try to utilize a standard attention mechanism to aggregate the preferences of group members, but they do not model the interaction behavior among group members, and the decision making process of a group is largely unexplored.In this work, we study GR in a more general scenario, that is Occasional Group Recommendation (OGR) and focus on solving the preference aggregation problem and the data sparsity issue of group-item interactions. Instead of exploring new heuristic and standard attention-based mechanism, we propose a new social self-attention based aggregation strategy by directly modeling the interactions among group members, namely Group Self-Attention (GroupSA). In GroupSA, we treat the group decision making process as a multiple sub-voting process, and develop a stacked social self-attention network to simulate how a group consensus is reached. To overcome the data sparsity issue, we resort to the relatively abundant user-item and user-user interaction data, and enhance the representation of users by two types of aggregation methods. To improve the training process of the group-item recommendation task, we further propose a joint training method to learn the user/item embeddings in the group-item recommendation task and the user-item recommendation task simultaneously. We conduct extensive experiments on two real-world datasets. The experimental results demonstrate the superiority of our GroupSA method compared to several state-of-the-art methods in terms of HR and NDCG.
R03.2 275 Price-aware Recommendation with Graph Convolutional Networks Yu Zheng (Tsinghua University)*; Chen Gao (Tsinghua University); Xiangnan He (University of Science and Technology of China); Yong Li (Tsinghua University); Depeng Jin (Tsinghua University)
With the development of mobile devices and communication techniques, spatial crowdsourcing platforms are becoming more and more popular in recent years. One of the typical topics of spatial crowdsourcing is the task assignment, which assigns crowd workers to users' requests in real time and maximizes the total revenue. However, it is common that the available crowd workers over a platform are too far away to serve the requests, so some user requests may be rejected or responded at high money cost after long waiting. Fortunately, the neighbors of a platform usually have available resources for the same services. If different platforms can collaboratively conduct the task allocation, it can greatly affect the quality of services, and have not been investigated yet. In this paper, we propose a Cross Online Matching (COM), which enables a platform to ``borrow" some unoccupied crowd workers from other platforms for completing the user requests. If the requests attempts to be completed by ``borrowed" workers, COM decides how much should be paid to the borrowed crowd workers, such that the borrowed workers are stimulated to accept the request and the request target platform can gain some revenue as well. We propose two algorithms, deterministic cross online matching (DemCOM) and randomized cross online matching (RamCom) for COM. DemCOM guarantees the largest obtained revenue in a greedy manner, while RamCom considers the trade-off between the obtained revenue and the probability of request being accepted by the borrowed workers. Extensive experimental results verify the effectiveness and efficiency of our algorithms.
R03.3 551 Syndrome-aware Herb Recommendation with Multi-Graph Convolution Network yuanyuan jin (East China Normal University)*; Wei Zhang (ECNU); Xiangnan He (University of Science and Technology of China); Xinyu Wang (East China Normal University); Xiaoling Wang (East China Normal University)
Herb recommendation plays a crucial role in the therapeutic process of Traditional Chinese Medicine (TCM), which aims to recommend a set of herbs to treat the symptoms of a patient. While several machine learning methods have been developed for herb recommendation, they are limited in modeling only the interactions between herbs and symptoms, and ignoring the intermediate process of syndrome induction. When performing TCM diagnostics, an experienced doctor typically induces a syndrome from the patient's symptoms, and then suggests herbs based on the induced syndrome. As such, we believe the induction of syndrome --- an overall description of the symptoms --- is important for herb recommendation and should be properly handled. In this paper, we propose a new method that takes the syndrome induction into account for herb recommendation. Given a set of symptoms to treat, we feed the embeddings of the symptoms into a Multi-Layer Perceptron (MLP) to generate the representation of the syndrome, so as to mimic how a doctor induces the syndrome. Towards symptom embedding learning, we additionally construct a symptom-symptom graph from the input prescriptions,for capturing the interactions (co-occurred patterns) between symptoms; we then build graph convolution networks (GCNs) on both symptom-symptom and symptom-herb graphs to learn symptom embedding. Similarly, we construct a herb-herb graph, and build GCNs on both herb-herb and symptom-herb graphs to learn herb embedding, which is finally interacted with the syndrome representation to predict the scores of herbs. The advantage of such a multi-graph GCN architecture is that more comprehensive representations can be obtained for symptoms and herbs. We conduct extensive experiments on a public TCM dataset, demonstrating significant improvements over state-of-the-art herb recommendation methods. Further studies justify the effectiveness of our design of syndrome representation and multiple graphs.
R03.4 591 PoisonRec: An Adaptive Data Poisoning Framework for Attacking Black-box Recommender Systems Junshuai Song (Peking University)*; Zhao Li (Alibaba Group); Zehong Hu (Alibaba Group); Wucheng Wu (Peking University); Zhenpeng Li (Alibaba Group); Jian Li (Alibaba Group); Jun Gao (Peking University)
Recommender systems can help to predict users' preferences, and show users the items that they are potentially interested in. While the data-driven recommender system has been deployed in many real online service platforms, several studies show that they are vulnerable to data poisoning attacks, and attackers have the ability to mislead the system to perform as their desires. However, existing attack methods are either general heuristic strategies that are with limited performance, or particularly designed for given recommendation algorithms, requiring strong knowledge. Consider the realistic scenario where the recommender system is usually a black-box for attackers, and complex algorithms may be deployed in them. How to learn effective attack strategies on such recommender systems is still an open problem. In this paper, we propose an adaptive data poisoning framework, PoisonRec, which can automatically learn effective attack strategies on different recommender systems with very limited knowledge. PoisonRec leverages the reinforcement learning architecture, in which an attack agent actively injects poison data (fake user behaviors) into the recommender system, and then can improve its attack strategies through reward signals that are available under the strict black-box setting. Specifically, we model the sequential attack trajectory as the Markov Decision Process (MDP) in reinforcement learning. We also design a Biased Complete Binary Tree (BCBT) to reformulate the action space for better attack performance. We adopt 8 widely-used representative recommendation algorithms as our testbeds, and make extensive experiments on 4 different real-world datasets. The results show that PoisonRec has the ability to achieve good attack performance on various recommender systems with limited knowledge.
R03.5 769 Toward Recommendation for Upskilling: Modeling Skill Improvement and Item Difficulty in Action Sequences Kazutoshi Umemoto (The University of Tokyo)*; Tova Milo (Tel Aviv University); Masaru Kitsuregawa (University of Tokyo, Japan)
How can recommender systems help people improve their skills? As a first step toward recommendation for the upskilling of users, this paper addresses the problems of modeling the improvement of user skills and the difficulty of items in action sequences where users select items at different times. We propose a progression model that uses latent variables to learn the monotonically non-decreasing progression of user skills. Once this model is trained with the given sequence data, we leverage it to find a statistical solution to the item difficulty estimation problem, where we assume that users usually select items within their skill capacity. Experiments on five datasets (four from real domains, and one generated synthetically) revealed that (1) our model successfully captured the progression of domain-dependent skills; (2) multi-faceted item features helped to learn better models that aligned well with the ground-truth labels available in the synthetic dataset; (3) the learned skill and difficulty features were practically useful to predict items and ratings in action sequences; and (4) exploiting the dependency structure of our model for parallel computation made the training process more efficient.
R04: Graphs and Social Networks 1 (Tuesday 21st April, 10:00-11:30)
Title Authors
R04.1 28 "Exploring Finer Granularity within the Cores: Efficient (k,p)-core Computation" CHEN ZHANG (UNSW)*; Fan Zhang (University of New South Wales); Wenjie Zhang (University of New South Wales); BOGE LIU (University of New South Wales); Ying Zhang (University of Technology Sydney); Lu Qin (UTS); Xuemin Lin (University of New South Wales)
Recent works strengthen the long-term argument that each user in a community should have at least a certain fraction p of neighbors inside the community, especially for users with relatively large degrees. Besides, a uniform degree constraint k should be applied to ensure a minimum level of user engagement, especially for users with small degrees. Motivated bythe above facts, we propose and study a novel cohesive subgraph model, named (k,p)-core, which is a maximal subgraph where each vertex has at least k neighbours and at least p fraction of its neighbours in the subgraph. We propose an O(m) algorithm to compute a (k,p)-core with given k and p, and an O(dm) algorithm to decompose a graph by (k,p)-core, where m is thenumber of edges in the graph G and d is the degeneracy of G. A space efficient index is designed for time optimal (k,p)-core query processing. Novel techniques are proposed for the maintenance of (k,p)-core index against graph dynamic. Extensive experiments on 8 real-life datasets demonstrate that our (k,p)-core model is effective and the algorithms are efficient.
R04.2 29 "Kaskade: Graph Views for Efficient Graph Analytics" Joana M. F. da Trindade (MIT)*; Konstantinos Karanasos (Microsoft); Carlo Curino (Microsoft); Samuel Madden (MIT); Julian Shun (MIT)
Graphs are a natural way to model real-world entities and relationships between them, ranging from social networks to data lineage graphs and biological datasets. Queries over these large graphs often involve expensive sub-graph traversals and complex analytical computations. These real-world graphs are often substantially more structured than a generic vertex-and-edge model would suggest, but this insight has remained mostly unexplored by existing graph engines for graph query optimization purposes. In this work, we leverage structural properties of graphs and queries to automatically derive materialized graph views that can dramatically speed up query evaluation. We present KASKADE, the first graph query optimization framework to exploit materialized graph views for query optimization purposes. KASKADE employs a novel constraint-based view enumeration technique that mines constraints from query workloads and graph schemas, and injects them during view enumeration to significantly reduce the search space of views to be considered. Moreover, it introduces a graph view size estimator to pick the most beneficial views to materialize given a query set and to select the best query evaluation plan given a set of materialized views. We evaluate its performance over real-world graphs, including the provenance graph that we maintain at Microsoft to enable auditing, service analytics, and advanced system optimizations. Our results show that KASKADE substantially reduces the effective graph size and yields significant performance speedups (up to 50X), in some cases making otherwise intractable queries possible.
R04.3 38 "Efficient Top-k Edge Structural Diversity Search" Qi Zhang (Beijing Institute of Technology); Rong-Hua Li (Beijing Institute of Technology)*; Qixuan Yang (Beijing Institute of Technology); Guoren Wang (Beijing Institute of Technology); Lu Qin (UTS)
The structural diversity of an edge, as measured by the number of connected components of the edge's ego-network, has recently been recognized as a key metric for analyzing social influence and information diffusion in social networks. Given this, an important problem in social network analysis is to identify top-k edges that have the highest structural diversities. In this work, we, for the first time, perform a systematical study for the top-k edge structural diversity search problem on large graphs. Specifically, we first develop a new online search framework with two basic upper-bounding rules to efficiently solve this problem. Then, we propose a new index structure using near-linear space to process the top-k edge structural diversity search in near-optimal time. To create such an index structure, we devise an efficient algorithm based on an interesting connection between our problem and the 4-clique enumeration problem. In addition, we also propose novel index maintenance techniques to handle dynamic graphs. The results of extensive experiments on several large real-life datasets demonstrate the efficiency, scalability, and effectiveness of our algorithms.
R04.4 79 "Adaptive Relation Discovery from Focusing Seeds on Large Networks" Zhuo Wang (Institute of Information Engineering, CAS; School of Cyber Security, UCAS)*; Chaokun Wang (Tsinghua University); Weiping Wang (Institute of Information Engineering, CAS, China); Xiaoyan Gu (Institute of Information Engineering, Chinese Academy of Sciences); Bo Li ( Institute of Information Engineering, Chinese Academy of Sciences); Dan Meng (Institute of Information Engineering, CAS)
Given a network and a set of seeds related to each other, the problem of relation discovery from focusing seeds aims to discover the relations among all the seeds. Due to its wide applications, the task has been well studied in the literature. However, when facing applications where maybe not all the seeds relate to each other, methods for the task will discover many vertices unrelated to the seeds. To support such applications, a new problem called adaptive relation discovery from focusing seeds (A-RDFS) is proposed and studied in this article. Given a network and a set of seeds which may not be related to each other, discover additional vertices to reveal the relations among the seeds which are related to each other.To solve the A-RDFS problem, a relation sensitive subgraph structure called α-relation core is proposed to find vertices related to a subset of the seeds. Thereafter, a metric called the relation quality is proposed to measure the quality of discovered relations. The metric is positively correlated with the α value of each discovered α-relation core. Hence, by maximizing the relation quality, a set of α-relation cores with large α values can be discovered, which reveals the relations among the seeds related to each other. Two methods are developed to optimize the relation quality. Then, using the optimizing methods as subroutines, a method called OPT-A-RDFS is designed for the A-RDFS problem. Extensive experimental results demonstrate the performance of our methods.
R04.5 379 "Repairing Entities using Star Constraints in Multirelational Graphs" Peng Lin (Washington State University)*; Qi Song (Washington State University); Yinghui Wu (Washington State University); Jiaxing Pi (Siemens Corporate Technology)
This paper studies a class of neighborhood constraints to characterize and repair erroneous entity information in multi-relational graph data. (1) We propose a classof constraints called star functional dependencies (StarFDs). Unlike conventional integrity constraints, a StarFD enforces value dependencies conditioned by entities and their relevant neighbors, which are identified by a star pattern that incorporates conjunctive regular path queries. StarFDs achieves a balance between expressiveness and complexity: the validation of StarFDs is tractable, and the satisfiability and implication of StarFDs are NP-complete and coNP-complete, respectively. (2) Given a set of StarFDs Σ and graph G, the entity repair problem is to compute a minimum repair of G by enforcing Σ with the smallest amount of changes. Although this problem is NP-complete and hard to approximate, we show it is feasible to compute repairs in large graphs. Our approach (a) discriminately detects and resolves errors with optimal, approximable and cost-bounded solutions whenever possible; and (b) incurs a time cost determined by Σ and the size of inconsistencies, for all cases. Using real world data, we show that StarFDs-based techniques effectively identify and repair errors. We also show that our repairing algorithms benefit other tasks such as fact checking.
R05: Database Security and Privacy 1 (Tuesday 21st April, 13:30-15:00)
Title Authors
R05.1 853 "Practical Anonymous Subscription with Revocation Based on Broadcast Encryption" Xun Yi (RMIT University); Russell Paulet (RMIT University)*; Elisa Bertino (Purdue University, USA); Fang-Yu Rao (Purdue University)
In this paper we consider the problem where a client wishes to subscribe to some product or service provided by a server, but maintain their anonymity. At the same time, the server must be able to authenticate the client as a genuine user and be able to discontinue (or revoke) the client's access if the subscription fees are not paid. Current solutions for this problem are typically constructed using some combination of blind signature or zero-knowledge proof techniques, which do not directly support client revocation (that is, revoking a user before expiry of their secret value). In this paper, we present a solution for this problem on the basis of the broadcast encryption scheme, suggested by Boneh et al., by which the server can broadcast a secret to a group of legitimate clients. Our solution allows the registered client to log into the server anonymously and also supports client revocation by the server. Our solution can be used in many applications, such as location-based queries. We formally define a model for our anonymous subscription protocol and prove the security of our solution under this model. In addition, we present experimental results from an implementation of our protocol. These experimental results demonstrate that our protocol is practical.
R05.2 606 "SVkNN: Efficient Secure and Verifiable k-Nearest Neighbour Query on the Cloud Platform" Ningning Cui (Northeastern University); Xiaochun Yang (Northeastern University)*; Bin Wang (Northeastern University); Jianxin Li (Deakin University); Guoren Wang (Beijing Institute of Technology)
With the boom in cloud computing, data outsourcing in location-based services is proliferating and has attracted increasing interest from research communities and commercial applications. Nevertheless, since the cloud server is probably both untrusted and malicious, concerns of data security and result integrity have become on the rise sharply. However, there exist little work that can commendably assure the data security and result integrity using a unified way. In this paper, we study the problem of secure and verifiable k nearest neighbor query (SVkNN). To support SVkNN, we first propose a novel unified structure, called verifiable and secure index (VSI). Based on this, we devise a series of secure protocols to facilitate query processing and develop a compact verification strategy. Given an SVkNN query, our proposed solution can not merely answer the query efficiently while can guarantee: 1) preserving data privacy, query privacy, results privacy, and access patterns privacy; 2) authenticating the correctness and completeness of the results without leaking the confidentiality. Finally, the formal security analysis and complexity analysis are theoretically proven and the performance and feasibility of our proposed approach are empirically evaluated and demonstrated.
R05.3 656 "An Anomaly Detection System for the Protection of Relational Database Systems against Data Leakage by Application Programs" Daren K Fadolalkarim (purdue)*; Elisa Bertino (Purdue University, USA); Asmaa Sallam (Purdue)
Application programs are a possible source of attacks to databases. A malicious attacker might exploit vulnerabilities in a privileged database application. The attacker can perform code injection or code-reuse attack in order to retrieve and steal sensitive data. Such attacks result in changes in the program’s behavior. These changes can be detected by using monitoring techniques. One such technique is monitoring the library calls that the application program issues while running. In this paper, we propose AD-PROM, an Anomaly Detection system that aims at protecting relational database systems against malicious/compromised applications PROgraMs aiming at stealing data. AD-PROM tracks library calls executed by application programs on data extracted from a database. The system operates in two phases. The first phase is a profile creation phase during which the behavior of the application is analyzed statically and dynamically in order to build profiles that represent the application’s normal behavior. AD-PROM analyzes the control and data flow of the application program (i.e., static analysis), and builds a hidden Markov model trained by the program traces (i.e., dynamic analysis). During the second phase, the program actions are monitored in order to detect anomalies that may represent attempts to data leakage. We have implemented AD-PROM and carried extensive experimental activities to assess its performance. The experimental results show very high accuracy at detecting the changes in the application programs’ behaviors and very low false positive rates.
R05.4 683 "SFour: A Protocol for Cryptographically Secure Record Linkage at Scale" Basit Khurram (University of Waterloo); Florian Kerschbaum (University of Waterloo)*
The prevalence of various (and increasingly large) datasets presents the challenging problem of discovering common entities dispersed across disparate datasets. Solutions to the private record linkage problem (PRL) aim to enable such explorations of datasets in a secure manner. A two-party PRL protocol allows two parties to determine for which entities they each possess a record (either an exact matching record or a fuzzy matching record) in their respective datasets --- without revealing to one another information about any entities for which they do not both possess records. Although several solutions have been proposed to solve the PRL problem, no current solution offers a fully cryptographic security guarantee while maintaining both high accuracy of output and subquadratic runtime efficiency. To this end, we propose the first known efficient PRL protocol that runs in subquadratic time, provides high accuracy, and guarantees cryptographic security in the semi-honest security model.
R06: Query Processing 1 (Tuesday 21st April, 13:30-15:00)
Title Authors
R06.1 58 "I/O Efficient Approximate Nearest Neighbour Search based on Learned Functions" Mingjie Li (University of Technology Sydney)*; Ying Zhang (University of Technology Sydney); Yifang Sun (University of New South Wales); Wei Wang (University of New South wales); Ivor Tsang (University of Technology Sydney); Xuemin Lin (University of New South Wales)
Approximate nearest neighbour search (ANNS) inhigh dimensional space is a fundamental problem in manyapplications, such as multimedia database, computer visionand information retrieval. Among many solutions, data-sensitivehashing-based methods are effective to this problem, yet few ofthem are designed for external storage scenarios and hence donot optimize for the I/O efficiency during the query processing. Inthis paper, we introduce a novel data-sensitive indexing and queryprocessing framework for ANNS with an emphasis on optimizingthe I/O efficiency, especially, the sequential I/Os. The proposedindex consists of several lists of point IDs, ordered by learnedfunction values on each corresponding data points. The functionis learned from the data and approximately preserves the order inthe high-dimensional space. We consider two instantiations of thefunctions (linear and non-linear), both learned from the data withnovel objective functions. We also develop an I/O efficient ANNSframework based on the index. Comprehensive experiments onseveral benchmarking datasets show that our proposed methodswith learned index structure performs much better than the state-of-the-art I/O-optimized ANNS methods
R06.2 78 "Efficient Query Processing with Optimistically Compressed Hash Tables & Strings in the USSR" Tim Gubner (CWI)*; Viktor Leis (Friedrich Schiller University Jena); Peter Boncz (CWI)
Modern query engines rely heavily on hash tables for query processing. Overall query performance and memory footprint is often determined by how hash tables and the tuples within them are represented. In this work, we propose three complementary techniques to improve this representation:Domain-Guided Prefix Suppression bit-packs keys and values tightly to reduce hash table record width.Optimistic Splitting decomposes values (and operations on them) into (operations on) frequently-accessed and infrequently-accessed value slices. By removing the infrequently-accessed value slices from the hash table record, it improves cache locality.The Unique Strings Self-aligned Region (USSR) accelerates handling frequently-occurring strings, which are very common in real-world data sets, by creating an on-the-fly dictionary of the most frequent strings. This allows executing many string operations with integer logic and reduces memory pressure.We integrated these techniques into Vectorwise. On the TPC-H benchmark, our approach reduces peak memory consumption by 2-4x and improves performance by up to 1.5x. On a real-world BI workload, we measured a 2x improvement in performance and in micro-benchmarks we observed speedups of up to 25x.
R06.3 173 UniKV: Toward High-Performance and Scalable KV Storage in Mixed Workloads via Unified Indexing Zhang Qiang (USTC); Yongkun Li (University of Science and Technology of China)*; Patrick P. C. Lee (The Chinese University of Hong Kong); Yinlong Xu (University of Science and Technology of China ); Qiu Cui (PingCAP); Liu Tang (PingCAP)
Persistent key-value (KV) stores are mainly designed based on Log-Structured Merge-Trees (LSM-trees), which suffer from large read and write amplifications, especially when KV stores grow in size. Various design optimizations are developed to improve the performance of LSM-tree-based KV stores, but they usually possess certain trade-offs and fail to simultaneously improve both read and write performance on large KV stores without sacrificing scan performance. In this paper, we develop UniKV, which unifies the key design ideas of hash indexes and LSM-trees to take advantage of their strengths in a single system. Specifically, UniKV leverages data locality to differentiate the management of KV pairs with different index structures, and develops multiple techniques to tackle the issues introduced by unifying the index structures so as to make them properly integrated. Experiments demonstrate that UniKV outperforms several state-of-the-art KV stores; for example, compared to LevelDB, RocksDB, HyperLevelDB and PebblesDB, UniKV can achieve 6.2x-6.6x, 4.5x-5.3x, 4.6x-5.2x and 2.0x-2.5x overall throughput under read-write mixed workloads of different read- write ratios, respectively.
R06.4 186 Improved Correlated Sampling for Join Size Estimation TaiNing Wang (National University of Singapore)*; Chee-Yong Chan (National University of Singapore)
Recent research on sampling-based join size estimation has focused on a promising new technique known as correlated sampling. While several variants of this technique have been proposed, there is a lack of a systematic study of this family of techniques. In this paper, we first introduce a framework to characterize its design space in terms of five parameters. Based on this framework, we propose a new correlated sampling based technique to address the limitations of existing techniques. Our new technique is based on using a discrete learning method for estimating the join size from samples. We experimentally compare the performance of multiple variants of our new technique and identify a hybrid variant that provides the best estimation quality. This hybrid variant not only outperforms the state-of-the-art correlated sampling technique, but it is also more robust to small samples and skewed data.
R06.5 236 "MESSI: In-Memory Data Series Indexing" Botao Peng (Paris Descartes University )*; Panagiota Fatourou ( University of Crete); Themis Palpanas (Paris Descartes University)
Data series similarity search is a core operation for several data series analysis applications across many different domains. However, the state-of-the-art techniques fail to deliver the time performance required for interactive exploration, or analysis of large data series collections. In this work, we propose MESSI, the first data series index designed for in-memory operation on modern hardware. Our index takes advantage of the modern hardware parallelization opportunities (i.e., SIMD instructions, multi-core and multi-socket architectures), in order to accelerate both index construction and similarity search processing times. Moreover, it benefits from a careful design in the setup and coordination of the parallel workers and data structures, so that it maximizes its performance for in-memory operations. Our experiments with synthetic and real datasets demonstrate that overall MESSI is up to 4x faster at index construction, and up to 11x faster at query answering than the state-of-the-art parallel approach. MESSI is the first to answer exact similarity search queries on 100GB datasets in ~50msec (30-70msec across diverse datasets), which enables real-time, interactive data exploration on very large data series collections.
R07: Mining Time Series and Spatial Data (Tuesday 21st April, 13:30-15:00)
Title Authors
R07.1 189 Spatial Transition Learning on Road Networks with Deep Probabilistic Model Xiucheng Li (Nanyang Technological University)*; Gao Cong (Nanyang Technological Univesity); Yun Cheng (ETH Zurich)
In this paper, we study the problem of predicting the most likely traveling route on the road network between two given locations by considering the real-time traffic. We present a deep probabilistic model–DeepST–which unifies three key explanatory factors, the past traveled route, the impact of destination and real-time traffic for the route decision. DeepST explains the generation of next route by conditioning on the representations of the three explanatory factors. To enable effectively sharing the statistical strength, we propose to learn representations of K-destination proxies with an adjoint generative model. To incorporate the impact of real-time traffic, we introduce a high dimensional latent variable as its representation whose posterior distribution can then be inferred from observations. An efficient inference method is developed within the Variational Auto-Encoders framework to scale DeepST to large-scale datasets. We conduct experiments on two real-world large-scale trajectory datasets to demonstrate the superiority of DeepST over the existing methods on two tasks: the most likely route prediction and route recovery from sparse trajectories. In particular, on one public large-scale trajectory dataset, DeepST surpasses the best competing method by almost 50% on the most likely route prediction task and up to 15% on the route recovery task in terms of accuracy.
R07.2 441 Active Model Selection for Positive Unlabeled Time Series Classification Shen Liang (Fudan University)*; Yanchun Zhang (Victoria University); Jiangang Ma (Federation University Australia)
Positive unlabeled time series classification (PUTSC) refers to classifying time series with only a small set PL of positive labeled examples and a large set U of unlabeled ones. While numerous PUTSC methods have been proposed, how to select the best model (and best parameters) from multiple candidate PUTSC models remains an open question. Motivated by its great influence on PUTSC performance, we look into the problem of PUTSC model selection. To the best of our knowledge, this is the first systematic study in this topic. In particular, we focus on model selection for PUTSC methods that follow the self-training based one-nearest-neighbor (ST-1NN) paradigm, as it has been adopted by the vast majority of existing PUTSC models. In the face of shortage of initially labeled data, we embrace the active learning (AL) methodology to query more labels from an oracle. With limited AL labeling budget, we present the novel concepts of self-training tree (ST-tree) and influence to fully exploit the mechanism of ST-1NN. Based on these concepts, we propose a simple yet effective evaluation strategy to estimate the performances of candidate ST-1NN models using limited queried examples. We also present a family of three influence-based sampling strategies for active model selection. We test the proposed methods on over a hundred datasets from the latest version of the UCR Time Series Classification Archive. We also present a case study in arrhythmia detection where we show that the proposed methods can achieve near optimal results by querying very few labels from the AL oracle.
R07.3 621 Neighbor Profile: Bagging Nearest Neighbors for Unsupervised Time Series Mining Yuanduo He (Peking University)*; Xu Chu (Peking University); Yasha Wang (Peking University)
Unsupervised time series mining has been attracting great interest from both academic and industrial communities. As two most basic data mining tasks, the discoveries of frequent and rare subsequences have been extensively studied and implemented as motif and discord mining in the literature. Recently, a novel data mining framework, \textit{Matrix Profile}, has been proposed to unify motif and discord under a same computing model, which draws an increasing attention in the research community due to its promising mining results. However, matrix profile fails to identify rare subsequences when it occurs more than once in the time series, which is widely known as the \textit{twin freak problem}. This problem is just the ``tip of the iceberg'' due to the essence of matrix profile. In this work, we for the first time provide a clear theoretical analysis of matrix profile (including motif and discord) as the 1-nearest neighbor based nonparametric density estimation of subsequence. Hence, it inherently brings three issues: \textit{low-quality density estimation}, \textit{gravity defiant behavior}, and \textit{lack of reusable model}, which deteriorate the performance of matrix profile in terms of efficiency and mining quality. To overcome these issues, we propose a novel approach, \textit{Neighbor Profile}, to robustly model the subsequence density by bagging nearest neighbors. Specifically, we bootstrap multiple subsamples and average the density estimations from subsamples by leveraging nearest neighbor ball, which not only enhances the estimation robustness but also realizes a reusable model for efficient learning. We check the sanity of neighbor profile on synthetic data and further evaluate it on real-world datasets. The experimental results demonstrate that neighbor profile can correctly model the subsequences of different densities and shows superior performance over matrix profile on the real-world arrhythmia dataset. Furthermore, it is shown that neighbor profile is also efficient for mining massive datasets.
R07.4 435 "Massively-Parallel Change Detection for Satellite Time Series Data with Missing Values" Fabian Gieseke (University of Copenhagen)*; Sabina Rosca (Wageningen University); Troels Henriksen (University of Denmark); Dr.J. Verbesselt (Wageningen University, The Netherlands), Cosmin Oancea (University of Copenhagen);
Large amounts of satellite data are now becoming available, which, in combination with appropriate change detection methods, offer the opportunity to derive accurate information on timing and location of disturbances such as deforestation events across the earth surface. Typical scenarios require the analysis of billions of image patches/pixels. While various change detection techniques have been proposed in the literature, the associated implementations usually do not scale well, which renders the corresponding analyses computationally very expensive or even impossible. In this work, we propose a novel massively-parallel implementation for a state-of-the-art change detection method and demonstrate its potential in the context of monitoring deforestation. The novel implementation can handle large scenarios in a few hours or days using cheap commodity hardware, compared to weeks or even years using the existing publicly available code, and enables researchers, for the first time, to conduct global-scale analyses covering large parts of our Earth using little computational resources. From a technical perspective, we provide a high-level parallel algorithm specification along with several performance-critical optimizations dedicated to efficiently map the specified parallelism to modern parallel devices. While a particular change detection method is addressed in this work, the algorithmic building blocks provided are also of immediate relevance to a wide variety of related approaches in remote sensing and other fields.
R07.5 627 Skyline Cohesive Group Queries in Large Road-social Networks Qiyan Li (Wuhan University); Yuanyuan Zhu (Wuhan University)*; Jeffrey Xu Yu (Chinese University of Hong Kong)
Given a network with social and spatial information, cohesive group queries aim at finding a user group of certain size, which are strongly connected and closely co-located. Most existing studies limit to finding groups either with the strongest social ties under certain spatial constraint or minimum spatial distance under certain social constraints. A few works have tried to combine social and spatial cohesiveness, but their balancing parameters are hard to tune. In this paper, we argue that the social cohesiveness and spatial cohesiveness are two inherently different measures and cohesive group queries can thus be naturally modeled as a skyline query. For the first time, given a road-social network $\mathcal{G} =(G_r, G_s)$ where $G_r$ is the road network and $G_r$ is the social network, we study to finding a set of skyline cohesive group queries, in which each group cannot be dominated by any other groups in terms of social cohesiveness and spatial cohesiveness. We measure the social cohesiveness by $(k, c)$-core (a $k$ core of size $c$), and measure the spatial cohesiveness by travel cost of group members. However, such skyline problem is NP-hard as we need to explore a large number of combinations of $c$ nodes to check whether it is a qualified $(k,c)$-core. Thus, in this paper, we first provide exact solution by a efficient pruning strategy to filtering out a large number of combinations, and then propose highly efficiently greedy solution based on a newly designed $cd$-tree to keep the distance and social information simultaneously. Experimental results show that our exact method can find the exact result on large road-social network with billion edge in reasonable time and our greedy method can significantly reduce the computation cost by 1-3 orders of magnitude while increase the travel cost by no more than than 5\% compared out exact method in multiple real road-social networks.
R08: Graph and Social Networks 2 (Tuesday 21st April, 13:30-15:00)
Title Authors
R08.1 145 "Anchored Vertex Exploration for Community Engagement in Social Networks" Taotao Cai (Deakin University); Jianxin Li (Deakin University)*; Nur Al Hasan Haldar (The University of Western Australia); Ajmal Mian (University of Western Australia); John Yearwood (Deakin University); Timos Sellis (Swinburne University of Technology)
User engagement has recently received significant attention in understanding the decay and expansion of communities in social networks. However, there is no existing works to consider users’ specific interests to analyse user engagement in community formation. Therefore, in this paper, we will fill the gap by investigating the problem of user engagement from the perspective of attributed communities. Given a set of keywords W, a structure cohesive parameter k, and a budget parameter l, our objective is to find l number of users who can induce the maximal expanded community. Meanwhile, every community member must contain the given keywords in W and the community should meet the specified structure cohesiveness constraint k. We introduce this problem as best-Anchored Vertex set Exploration (AVE).To solve the AVE problem, we first develop a Filter-Verify framework by maintaining the intermediate results using multiway tree and probing the best-anchored users in a best search way. To accelerate the efficiency, we further design a keyword-aware anchored and follower index, and also develop an index-based efficient algorithm. The proposed algorithm can greatly reduce the expensive cost of computing anchored users and their followers. In addition, we also present two bound properties that can guarantee the correctness of our solution. Finally, we demonstrate the efficiency of our proposed algorithms and index. We also measure the effectiveness of attributed community-based user engagement model by conducting extensive experiments on five real-world datasets.
R08.2 297 "Optimizing Knowledge Graphs through Voting-based User Feedback" Ruida Yang (ECNU); Xin Lin (ECNU); Jianliang Xu (Hong Kong Baptist University)*; Yan Yang (ECNU); Liang He (ECNU)
Knowledge graphs have been used in a wide range of applications to support search, recommendation, and question answering (Q&A). For example, in Q&A systems, given a new question, we may use a knowledge graph to automatically identify the most suitable answers based on similarity evaluation. However, such systems may suffer from two major limitations. First, the knowledge graph constructed based on source data may contain errors. Second, the knowledge graph may become out-of-date and cannot quickly adapt to new knowledge. To address these issues, in this paper, we propose an interactive framework that refines and optimizes knowledge graphs through user votes. We develop an efficient similarity evaluation notion, called extended inverse P-distance, based on which the graph optimization problem can be formulated as a signomial geometric programming problem. We then propose a basic single-vote solution and a more advanced multi-vote solution for graph optimization. We also propose a split-and-merge optimization strategy to scale up the multi-vote solution. Extensive experi- ments based on real-life and synthetic graphs demonstrate the effectiveness and efficiency of our proposed framework.
R08.3 501 "AutoSF: Searching Scoring Functions for Knowledge Graph Embedding" Yongqi Zhang (HKUST)*; Quanming Yao (4Paradigm); Wenyuan Dai (4Paradigm Inc.); Lei Chen (Hong Kong University of Science and Technology)
Scoring functions (SFs), which measure the plausibility of triplets in knowledge graph (KG), have become the crux of KG embedding. Lots of SFs, which target at capturing different kinds of relations in KGs, have been designed by humans in recent years. However, as relations can exhibit complex patterns that are hard to infer before training, none of them can consistently perform better than others on existing benchmark data sets. In this paper, inspired by the recent success of automated machine learning (AutoML), instead of proposing another new SF, we propose to automatically design SFs (AutoSF) for distinct KGs by the AutoML techniques. However, it is non-trivial to explore domain-specific information here to make AutoSF efficient and effective. We firstly identify a unified representation over popularly used SFs, which helps to set up a search space for AutoSF. Then, we propose a greedy algorithm to search in such a space efficiently. The algorithm is further sped up by a filter and a predictor, which can avoid repeatedly training SFs with same expressive ability and help removing bad candidates during the search before model training. Finally, we perform extensive experiments on benchmark data sets. Results on link prediction and triplets classification show that the searched SFs by AutoSF, are KG dependent, new to the literature, and outperform the state-of-the-art SFs designed by humans.
R08.4 589 "Semantic Guided and Response Times Bounded Top-k Similarity Search over Knowledge Graphs" Yuxiang Wang (Hangzhou Dianzi University)*; Arijit Khan (Nanyang Technological University); Tianxing Wu (Nanyang Technological University); Jiahui Jin (Southeast University); Haijiang Yan (Hangzhou Dianzi University)
Recently, graph query is widely adopted for querying knowledge graphs. Given a query graph $G_Q$, graph query finds subgraphs in a knowledge graph $G$ that exactly or approximately match $G_Q$. We face two challenges on graph query over a knowledge graph: (1) the structural gap between $G_Q$ and the predefined schema in $G$ causes mismatch with query graph, (2) users cannot view the answers until the graph query terminates, leading to a longer system response time (SRT). In this paper, we propose a semantic guided and response-time-bounded graph query to return top-k answers effectively and efficiently. We first leverage a knowledge graph embedding model to build the semantic graph $SG_Q$ for each $G_Q$. Then we define the path semantic similarity ($pss$) over $SG_Q$ to evaluate the answer's quality. We propose an A* semantic search on $SG_Q$ to find top-k answers with the greatest $pss$ via a heuristic $pss$ estimation. Furthermore, we make an approximate optimization on A* semantic search to allow users to trade off the effectiveness for SRT within a user-specific time bound. Extensive experiments over real datasets confirm the effectiveness and efficiency of our approach.
R08.5 694 "PPKWS: An Efficient Framework for Keyword Search on Public-Private Networks" Jiaxin Jiang (Hong Kong Baptist University)*; Xin Huang (Hong Kong Baptist University); Byron Choi (Hong Kong Baptist University); Jianliang Xu (Hong Kong Baptist University); Sourav S Bhowmick (Nanyang Technological University); Lyu Xu (Hong Kong Baptist University)
Abstract—Due to the unstructuredness and the lack of schemas of graphs, such as knowledge graphs, social networks and RDF graphs, keyword search has been proposed for querying such graphs/networks. In many applications (e.g., social networks), users may prefer to hide her/his parts or all of the data graph (e.g., private friendships) from the public. This leads to a recent graph model, namely the public-private network model, in which each user has his/her own network. While there have been studies on public-private network analysis, the keyword search on public- private networks has not yet been studied. For example, the query answers on the private networks and the combination of the private and public networks can be different. In this paper, we propose a new keyword search framework, called public- private keyword search (PPKWS). PPKWS consists of three major steps: partial evaluation, answer refinement, and answer completion. We propose optimizations for these steps. Since there have been plenty of keyword search semantics, we select three representative ones and show that they can be implemented on the model with minor modifications. In addition to the ease of implementation, we have verified through experiments that, on average, the algorithms implemented on the top of PPKWS run 113 times faster than the original algorithms directly running on the public network attached to the private network.
R09: Database Security and Privacy 2 (Wed 22nd April, 10:00-11:30)
Title Authors
R09.1 140 "Privacy-preserving Real-time Anomaly Detection Using Edge Computing" Shagufta Mehnaz (Purdue University)*; Elisa Bertino (Purdue University, USA)
Anomaly detection on data collected by devices, such as sensors and IoT objects, is inevitable for many critical systems, e.g., an anomaly in the data of a patient's health monitoring device may indicate a medical emergency situation. Because of the resource-constrained nature of these devices, data collected by such devices are usually off-loaded to the cloud/edge for storage and/or further analysis. However, to ensure data privacy it is critical that the data be transferred to and managed by the cloud/edge in an encrypted form which necessitates efficient processing of such encrypted data for real-time anomaly detection. Motivated by the simultaneous demands for data privacy and real-time data processing, in this paper, we investigate the problem of a privacy-preserving real-time anomaly detection service on sensitive, time series, streaming data. We propose (i) a lightweight and aggregation optimized encryption scheme to encrypt the data before off-loading the data to the edge and (ii) a privacy-preserving framework that enables efficient anomaly detection on encrypted data. We demonstrate our solution for a widely used anomaly detection algorithm, windowed Gaussian anomaly detector and evaluate the performance of the solution in terms of the obtained model privacy, accuracy, latency, and communication cost.
R09.2 712 "To Warn or Not to Warn: Online Signaling in Audit Games" Chao Yan (Vanderbilt UNIVERSITY)*; Haifeng Xu (University of Virginia); Yevgeniy Vorobeychik (Washington University in St. Louis); Bo Li (UIUC); Daniel Fabbri (Vanderbilt UNIVERSITY); Bradley Malin (Vanderbilt University)
Routine operational use of sensitive data is often governed by law and regulation. For instance, in the medical domain, there are various statues at the state and federal level that dictate who is permitted to work with patients’ records and under what conditions. To screen for potential privacy breaches, logging systems are usually deployed to trigger alerts whenever a suspicious access is detected. However, such mechanisms are often inefficient because 1) the vast majority of triggered alerts are false positives, 2) small budgets make it unlikely that a real attack will be detected, and 3) attackers can behave strategically, such that traditional auditing mechanisms cannot easily catch them. To improve efficiency, information systems may invoke signaling, so that whenever a suspicious access request occurs, the system can, in real time, warn the user that the access may be audited. Then, at the close of a finite period, a selected subset of suspicious accesses are audited. This gives rise to an online problem in which one needs to determine 1) whether a warning should be triggered and 2) the likelihood that the data request event will be audited. In this paper, we formalize this auditing problem as a Signaling Audit Game (SAG), in which we model the interactions between an auditor and an attacker in the context of signaling and the usability cost is represented as a factor of the auditor’s payoff. We study the properties of its Stackelberg equilibria and develop a scalable approach to compute its solution. We show that a strategic presentation of warnings adds value in that SAGs realize significantly higher utility for the auditor than systems without signaling. We perform a series of experiments with 10 million real access events, containing over 26K alerts, from a large academic medical center to illustrate the value of the proposed auditing model and the consistency of its advantages over existing baseline methods.
R09.3 802 "One-sided Differential Privacy" Ios Kotsogiannis (Duke University)*; Stelios Doudalis (U.C. Irvine); Sam Haney (Duke); Ashwin Machanavajjhala (Duke); Sharad Mehrotra (U.C. Irvine)
We study the problem of privacy-preserving data sharing, wherein only a subset of the records in a database is sensitive, possibly based on predefined privacy policies. Existing solutions, viz, differential privacy (DP), are over-pessimistic as they treat all records as sensitive.Alternatively, techniques like access control and personalized differential privacy that reveal all non-sensitive records truthfully indirectly leak whether a record is sensitive and consequently the record's value. In this work we introduce one-sided differential privacy (OSDP) that offers provable privacy guarantees to the sensitive records. In addition, OSDP satisfies the sensitivity masking property which ensures that any algorithm satisfying OSDP does not allow an attacker to significantly decrease his/her uncertainty about whether a record is sensitive or not. We design OSDP algorithms that can truthfully release a sample of non-sensitive records. Such algorithms can be used to support applications that must output true data with little loss in utility, especially when using complex types of data like images or location trajectories. Additionally, we present OSDP algorithms for releasing count queries, which leverage the presence of non-sensitive records and are able to offer up to a x6 improvement in accuracy over state-of-the-art DP-solutions.
R09.4 807 "Providing Input-Discriminative Protection for Local Differential Privacy" Xiaolan Gu (University of Arizona)*; Ming Li (University of Arizona); Li Xiong (Emory University); Yang Cao (Kyoto University)
Local Differential Privacy (LDP) provides provable privacy protection for data collection without the assumption of the trusted data server. In the real-world scenario, different data have different privacy requirements due to the distinct sensitivity levels. However, LDP provides the same protection for all data. In this paper, we tackle the challenge of providing input-discriminative protection to reflect the distinct privacy requirements of different inputs. We first present the Input-Discriminative LDP (ID-LDP) privacy notion and focus on a specific version termed MinID-LDP, which is shown to be a fine-grained version of LDP. Then, we develop the IDUE mechanism based on Unary Encoding for single-item input and the extended mechanism IDUE-PS (with Padding-and-Sampling protocol) for item-set input. The results on both synthetic and real-world datasets validate the correctness of our theoretical analysis and show that the proposed mechanisms satisfying MinID-LDP have better utility than the state-of-the-art mechanisms satisfying LDP due to the input-discriminative protection.
R09.5 299 "Differentially Private Online Task Assignment in Spatial Crowdsourcing: A Tree-based Approach" Qian Tao (Beihang University); Yongxin Tong (Beihang University)*; Zimu Zhou (ETH Zurich); Yexuan Shi (Beihang University); Lei Chen (Hong Kong University of Science and Technology); Ke Xu (Beihang University)
With spatial crowdsourcing applications such as Uber and Waze deeply penetrated into everyday life, there is a growing concern to protect user privacy in spatial crowdsourcing. Particularly, locations of workers and tasks should be properly processed via certain privacy mechanism before reporting to the untrusted spatial crowdsourcing server for task assignment. Privacy mechanisms typically permute the location information, which tends to make task assignment ineffective. Prior studies only provide guarantees on privacy protection without assuring the effectiveness of task assignment. In this paper, we investigate privacy protection for online task assignment with the objective of minimizing the total distance, an important task assignment formulation in spatial crowdsourcing. We design a novel privacy mechanism based on Hierarchically Well-Separated Trees (HSTs). We prove that the mechanism is $\epsilon$-Geo-Indistinguishable and show that there is a task assignment algorithm with a competitive ratio of $O(\frac{1}{\epsilon^4}\log N \log^2 k)$, where $\epsilon$ is the privacy budget, $N$ is the number of predefined points on the HST, and $k$ is the matching size. Extensive experiments on synthetic and real datasets show that online task assignment under our privacy mechanism is notably more effective in terms of total distance than under prior differentially private mechanisms.
R10: Query Processing 2 (Wed 22nd April, 10:00-11:30)
Title Authors
R10.1 257 "Chainlink: Indexing Big Time Series Data For Long Subsequence Matching" Noura S Alghamdi (WPI)*; liang zhang (WPI); Huayi Zhang (WPI); Elke Rundensteiner (WPI); Mohamed Y. Eltabakh (Worcester Polytechnic Institute)
Scalable subsequence matching is critical for supporting analytics on big time series from mining, prediction to hypothesis testing. However, state-of-the-art subsequence matching techniques do not scale well to TB-scale data sets. Not only does index construction become prohibitively expensive, but also the query response time deteriorates quickly as the length of the query subsequence exceeds several 100s of data points. Although Locality Sensitive Hashing (LSH) has emerged as promising solution for indexing long time series, it relies on expensive hash functions that take multiple passes over the data and thus is impractical for big time series. In this work, we propose a lightweight distributed indexing framework, called Chainlink, that supports approximate kNN queries over TB-scale time series data. As formal foundation of Chainlink, we design a novel hashing technique, called Single Pass Signature (SPS), that successfully tackles the above problem. In particular, we prove theoretically and demonstrate experimentally that the similarity proximity of the indexed subsequences is preserved by our proposed single-pass SPS scheme. Leveraging this SPS innovation, Chainlink then adopts a three-step approach for scalable index building: (1) in-place data re-organization within each partition to enable efficient record-level random access to all subsequences, (2) parallel building of hash-based local indices on top of the re-organized data using our SPS scheme for efficient search within each partition, and (3) efficient aggregation of the local indices to construct a centralized yet highly compact global index for effective pruning of irrelevant partitions during query processing. Chainlink achieves the above three steps in one single map-reduce process. Our experimental evaluation shows that Chainlink indices are compact at less than 2% of dataset size while state-of-the-art index sizes tend to be almost the same size as the dataset. Better still, Chainlink is up to 2 orders of magnitude faster in its index construction time compared to state-of-the-art techniques, while improving both the final query response time by up to 10 fold and the result accuracy by 15%.
R10.2 262 "Random Sampling for Group-By Queries
-Hung Shih"
Trong Nguyen (Iowa State University); Ming-Hung Shih (Iowa State University)*; Sai Parvathaneni (Iowa State University); Bojian Xu (Eastern Washington University); Divesh Srivastava (AT&T Labs Research); Srikanta Tirthapura (Iowa State University)
Random sampling has been widely used in approximate query processing on large databases, due to its potential to significantly reduce resource usage and response times, at the cost of a small approximation error. We consider random sampling for answering the ubiquitous class of group-by queries, which first group data according to one or more attributes, and then aggregate within each group after filtering through a predicate. The challenge with group-by queries is that a sampling method cannot focus on optimizing the quality of a single answer (e.g. the mean of selected data), but must simultaneously optimize the quality of a set of answers (one per group). We present CVOPT, a query- and data-driven sampling framework for a set of queries that return multiple answers, e.g. group-by queries. To evaluate the quality of a sample, CVOPT defines a metric based on the norm (e.g. $\ell_2$ or $\ell_\infty$) of the coefficients of variation (CVs) of different answers, and constructs a stratified sample that provably optimizes the metric. CVOPT can handle group-by queries on data where groups have vastly different statistical characteristics, such as frequencies, means, or variances. CVOPT jointly optimizes for multiple aggregations and multiple group-by clauses, and provides a way to prioritize specific groups or aggregates. It can be tuned to cases when partial information about a query workload is known, such as a data warehouse where queries are scheduled to run periodically. Our experimental results show that CVOPT outperforms current state-of-the-art on sample quality and estimation accuracy for group-by queries. On a set of queries on two real-world data sets, CVOPT yields relative errors that are $5 \times$ smaller than competing approaches, under the same budget.
R10.3 500 "PA-Tree: Polled-Mode Asynchronous B+-Tree for NVMe" Li Wang (Yitu Technology)*; Zining Zhang (Yitu Technology); Bingsheng He (National University of Singapore); Zhenjie Zhang (Singapore R&D, Yitu Technology Ltd.,)
R10.4 539 "Distributed Streaming Set Similarity Join" Jianye Yang (Hunan University)*; Wenjie Zhang (University of New South Wales); Xiang Wang (UNSW); Ying Zhang (University of Technology Sydney); Xuemin Lin (University of New South Wales)
With the prevalence of Internet access and user generated content, a large number of documents/records, such as news and web pages, have been continuously generated in an unprecedented manner. In this paper, we study the problem of efficient stream set similarity join over distributed systems, which has broad applications in data cleaning and data integration tasks, such as on-line near-duplicate detection. In contrast to prefix-based distribution strategy which is widely adopted in offline distributed processing, we propose a simple yet efficient length-based distribution framework which dispatches incoming records by their length. A load-aware length partition method is developed to find a balanced partition by effectively estimating local join cost to achieve good load balance. Our length-based scheme is surprisingly superior to its competitors since it has no replication, small communication cost, and high throughput. We further observe that the join results from the current incoming record can be utilized to guide the index construction, which in turn can facilitate the join processing of future records. Inspired by this observation, we propose a novel bundle-based join algorithm by grouping similar records on-the-fly to reduce filtering cost. A by-product of this algorithm is an efficient verification technique, which verifies a batch of records by utilizing their token differences to share verification costs, rather than verifying them individually. Extensive experiments conducted on Storm, a popular distributed stream processing system, suggest that our methods can achieve up to one order of magnitude throughput improvement over baselines.
R10.5 557 "Cool: a COhort OnLine analytical processing system" Zhongle Xie (National University of Singapore); Cong Yue (National University of Singapore); YING Hongbin (MZH Tech); Meihui Zhang (Beijing Institute of Technology); Gang Chen (Zhejiang University); Beng Chin Ooi (NUS)*
With a huge volume and variety of data accumulated over the years, OnLine Analytical Processing (OLAP) systems are facing challenges in query efficiency.Furthermore, the design of OLAP systems cannot serve modern applications well due to their inefficiency in processing complex queries such as cohort queries with low query latency.In this paper, we present Cool, a cohort online analytical processing system.As an integrated system with the support of two newly proposed operators on top of a sophisticated storage layer, it processes both cohort queries and conventional OLAP queries with superb performance. Its distributed design requires minimal load balancing and fault tolerance support and is scalable.Our evaluation results show that Cool outperforms two state-of-the-art systems, MonetDB and Druid, by a wide margin in single-node setting,and a state-of-the-art distributed system, SparkSQL, by one order of magnitude in terms of query efficiency in distributed environments.
R11: Data Mining and Knowledge Discovery 1 (Wed 22nd April, 10:00-11:30)
Title Authors
R11.1 35 "TransN: Heterogeneous Network Representation Learning by Translating Node Embeddings" Zijian Li (Hong Kong University of Science and Technology)*; Wenhao Zheng (Nanjing University); Xueling Lin (HKUST); Ziyuan Zhao (Tencent); Zhe Wang (Hong Kong University of Science and Technology); Yue WANG (Hong Kong University of Science and Technology); Xun Jian (HKUST); Lei Chen (Hong Kong University of Science and Technology); Qiang Yan (Tencent); Tiezheng Mao (Tencent)
Learning network embeddings has attracted growing attention in recent years. However, most of the existing methods focus on homogeneous networks, which cannot capture the important type information in heterogeneous networks. To address this problem, in this paper, we propose TransN, a novel multi-view network embedding framework for heterogeneous networks. Compared with the existing methods, TransN is an unsupervised framework which does not require node labels or user-specified meta-paths as inputs. In addition, TransN is capable of handling more general types of heterogeneous networks than the previous works. Specifically, in our framework TransN, we propose a novel algorithm to capture the proximity information inside each single view. Moreover, to transfer the learned information across views, we propose an algorithm to translate the node embeddings between different views based on the dual-learning mechanism, which can both capture the complex relations between node embeddings in different views, and preserve the proximity information inside each view during the translation. We conduct extensive experiments on real-world heterogeneous networks, whose results demonstrate that the node embeddings generated by TransN outperform those of competitors in various network mining tasks.
R11.2 327 "An Adaptive Master-Slave Regularized Model for Unexpected Revenue Prediction Enhanced with Alternative Data" jin xu (Institute for Interdisciplinary Information Sciences (IIIS), Tsinghua University)*; Jingbo Zhou (Baidu Inc.); Yongpo Jia (HBIS Group); Jian LI (Tsinghua University); Hui Xiong (Rutgers University)
In this paper, we study the challenging problem to predict better unexpected revenue of a company via machine learning with alternative data. To the best of our knowledge, this is the first work studying this problem in the literature. However, it is nontrivial to quantitatively model the relations between the unexpected revenue and the information provided by alternative data with a machine learning approach. Thus we proposed an adaptive master-slave regularized model, AMS for short, to effectively leverage alternative data for unexpected revenue prediction. Our AMS model first trains a master model upon the company graph, which captures the relations among companies, using a graph neural network (GNN).Then for a target company, the master model generates an adaptive slave-model, which is specially optimized for this target company. Finally, we use this slave-model to predict the unexpected revenue of the target company. Besides its excellent prediction performance, another critical advantage of our AMS model lies in its superior interpretability, which is crucial for portfolio managers to understand the predicted results. With extensive experiments using two real-world alternative datasets, we have demonstrated the effectiveness of our model against a set of competitors.
R11.3 360 "Exact and Consistent Interpretation of Piecewise Linear Models Hidden behind APIs: A Closed Form Solution" Zicun Cong (Simon Fraser University)*; Lingyang Chu (Huawei Technologies Canada); Lanjun Wang (N/A); Xia Hu (Simon Fraser University); Jian Pei (Simon Fraser University)
More and more AI services are provided through APIs on cloud where predictive models are hidden behind APIs. To build trust with users and reduce potential application risk, it is important to interpret how such predictive models hidden behind APIs make their decisions.The biggest challenge of interpreting such predictions is that no access to model parameters or training data is available. Existing works interpret the predictions of a model hidden behind an API by heuristically probing the response of the API with perturbed input instances.However, these methods do not provide any guarantee on the exactness and consistency of their interpretations.In this paper, we propose an elegant closed form solution named \texttt{OpenAPI} to compute exact and consistent interpretations for the family of Piecewise Linear Models (PLM), which includes many popular classification models.The major idea is to first construct a set of overdetermined linear equation systems with a small set of perturbed instances and the predictions made by the model on those instances. Then, we solve the equation systems to identify the decision features that are responsible for the prediction on an input instance. Our extensive experiments clearly demonstrate the exactness and consistency of our method.
R11.4 456 "Statistical Estimation of Diffusion Network Topologies" Keqi Han (Wuhuan University); Yuan Tian (Wuhan University); Yunjia Zhang (University of Wisconsin-Madison)*; Ling Han (Wuhan University); Hao Huang (Wuhan University); Yunjun Gao (" Zhejiang University, China")
Reconstructing the topology of a diffusion network based on observed diffusion results is an open challenge in data mining. Existing approaches mostly assume that the observed diffusion results are available and consist of not only the final infection statuses of nodes, but also the exact timestamps that pinpoint when infections occur. Nonetheless, the exact infection timestamps are often unavailable in practice, due to a high cost and uncertainties in the monitoring of node infections. In this work, we investigate the problem of how to infer the topology of a diffusion network from only the final infection statuses of nodes. To this end, we propose a new scoring criterion for diffusion network reconstruction, which is able to estimate the likelihood of potential topologies of the objective diffusion network based on infection status results with a relatively low statistical error. As the proposed scoring criterion is decomposable, our problem is transformed into finding for each node in the network a set of most probable parent nodes that maximizes the value of a local score. Furthermore, to eliminate redundant computations during the search of most probable parent nodes, we identify insignificant candidate parent nodes by checking whether their infections have negative or extremely low positive correlations with the infections of a corresponding child node, and exclude them from the search space. Extensive experiments on both synthetic and real-world networks are conducted, and the results verify the effectiveness and efficiency of our approach.
R11.5 457 Multiple Dense Subtensor Estimation with High Density Guarantee Quang-Huy Duong (NTNU)*; Heri Ramampiaro (Norwegian University of Science and Technology (NTNU)); Kjetil Nørvåg (Norwegian University of Science and Technology)
Dense subtensor detection is a well-studied area, with a wide range of application areas, and numerous efficient approaches and algorithms have been proposed. Existing algorithms are generally efficient for dense subtensor detection and could perform well in many applications. However, the main drawback of most of these algorithms is that they can estimate only one subtensor at a time, with a low guarantee on the subtensor's density. While some methods can, on the other hand, estimate multiple subtensors, they can give a guarantee on the density with respect to the input tensor for the first estimated subsensor only. We address these drawbacks by providing both theoretical and practical solution for estimating multiple dense subtensors in tensor data. In particular, we guarantee and prove a higher bound of the lower-bound density of the estimated subtensors. We also propose a novel approach to show that there are multiple dense subtensors with a guarantee on its density that is greater than the lower bound used in the state-of-the-art algorithms. We evaluate our approach with extensive experiments on several real-world datasets, which demonstrates its efficiency and feasibility.
R12: Graphs and Social Networks 3 (Wed 22nd April, 10:00-11:30)
Title Authors
R12.1 217 "Efficient Approximation Algorithms for Adaptive Target Profit Maximization" Keke Huang (Nanyang Technological University)*; Jing Tang (National University of Singapore); Xiaokui Xiao (National University of Singapore); Aixin Sun (Nanyang Technological University); Andrew Lim (National University of Singapore)
Given a social network G, the profit maximization (PM) problem asks for a set of seed nodes to maximize the profit, i.e., revenue of influence spread less the cost of seed selection. The target profit maximization (TPM) problem, which generalizes the PM problem, aims to select a subset of seed nodes from a target user set T to maximize the profit. Existing algorithms for PM mostly consider the nonadaptive setting, where all seed nodes are selected in one batch without any knowledge on how they may influence other users. In this paper, we study TPM in adaptive setting, where the seed users are selected through multiple batches, such that the selection of a batch exploits the knowledge of actual influence in the previous batches. To acquire an overall understanding, we study the adaptive TPM problem under both the oracle model and the noise model, and propose ADG and ADDATP algorithms to address them with strong theoretical guarantees, respectively. In addition, to better handle the sampling errors under the noise model, we propose the idea of hybrid error based on which we design a novel algorithm HATP that boosts the efficiency of ADDATP significantly. We conduct extensive experiments on real social networks to evaluate the performance, and the experimental results strongly confirm the superiorities and effectiveness of our solutions.
R12.2 417 "Efficient Bitruss Decomposition for Large-scale Bipartite Graphs" Kai Wang (University of New South Wales)*; Xuemin Lin (University of New South Wales); Lu Qin (UTS); Wenjie Zhang (University of New South Wales); Ying Zhang (University of Technology Sydney)
Cohesive subgraph mining in bipartite graphs becomes a popular research topic recently. An important structure k-bitruss is the maximal cohesive subgraph where each edge is contained in at least k butterflies (i.e., (2, 2)-bicliques). In this paper, we study the bitruss decomposition problem which aims to find all the k-bitrusses for k >= 0. The existing bottom-up techniques need to iteratively peel the edges with the lowest butterfly support. In this peeling process, these techniques are time-consuming to enumerate all the supporting butterflies for each edge. To relax this issue, we first propose a novel online index -- the BE-Index which compresses butterflies into k-blooms (i.e., (2, k)-bicliques). Based on the BE-Index, the new bitruss decomposition algorithm BiT-BU is proposed, along with two batch-based optimizations, to accomplish the butterfly enumeration of the peeling process in an efficient way. Furthermore, the BiT-PC algorithm is devised which is more efficient against handling the edges with high butterfly supports. We theoretically show that our new algorithms significantly reduce the time complexities of the existing algorithms. Also, we conduct extensive experiments on real datasets and the result demonstrates that our new techniques can speed up the state-of-the-art techniques by up to two orders of magnitude.
R12.3 450 "Kaleido: An Efficient Out-of-core Graph MiningSystem on A Single Machine" Cheng Zhao (Institute of Computing Technology, Chinese Academy of Sciences)*; Zhibin Zhang (Institute of Computing Technology, Chinese Academy of Sciences); Peng Xu (Institute of Computing Technology, Chinese Academy of Sciences); Tianqi Zheng (Institute of Computing Technology, Chinese Academy of Sciences); Jiafeng Guo (Institute of Computing Technology, Chinese Academy of Sciences)
Graph mining is one of the most important categories of graph algorithms. However, exploring the subgraphs of an input graph produces a huge amount of intermediate data. The “think like a vertex” programming paradigm, pioneered by Pregel, cannot readily formulate mining problems, which is designed to produce graph computation problems like PageRank. Existing mining systems like Arabesque and RStream need large amounts of computing and memory resources.In this paper, we present Kaleido, an efficient single machine, out-of-core graph mining system which treats disks as an extension of memory. Kaleido treats intermediate data in graph mining tasks as a tensor and adopts a succinct data structure for the intermediate data. Kaleido adopts a lightweight isomorphism checking strategy which uses an eigenvalue-based algorithm to identify graphs with less than 8 vertices and solves tree isomorphism for the other graphs. Kaleido implements half-memory-half-disk storage for storing large intermediate data, which treats the disk as an extension of the memory. Comparing with two state-of-the-art mining systems, Arabesque and RStream, Kaleido outperforms them by a GeoMean 13.2x and 64.8x respectively.
R12.4 463 "Finding the Best k in Core Decomposition: A Time and Space Optimal Solution" Deming Chu (East China Normal University); Fan Zhang (Guangzhou University)*; Xuemin Lin (University of New South Wales); Wenjie Zhang (University of New South Wales); Ying Zhang (University of Technology Sydney); Yinglong Xia (Facebook); Chenyi Zhang (Huawei Technologies)
The mode of k-core and its hierarchical decomposition have been applied in many areas, such as sociology, the world wide web, and biology. Algorithms on related studies often need an input value of parameter k, while the existing solution is tomanually choose a value for k. In this paper, given a graph and a scoring metric, we aim to efficiently find the best value of k such that the quality (score) of k-core is the highest. The problem is challenging because there are various community scoring metrics serving different objectives, and the computation is costly on large datasets. With the well-designed vertex ordering techniques, we propose time and space optimal algorithms to compute the best k, which are applicable to most community metrics, especially the representative ones. The proposed algorithms can be extended to many scenarios, e.g., finding the best connected component of k-core, and computing the quality of k-core for every possible value of k. Extensive experiments are conducted on 10 real-world networks with size up to billion-scale, which validates both the effectiveness of the results and the efficiency of our algorithms.
R12.5 485 "Updates-Aware Graph Pattern based Node Matching" Guohao Sun (Macquarie University)*; Guanfeng Liu (Macquarie University); Yan Wang (Macquarie University, Austrilia); Xiaofang Zhou (University of Queensland)
Graph Pattern based Node Matching (GPNM) is to find all the matches of the nodes in a data graph GD based on a given pattern graph GP. GPNM has become increasingly important in many applications, e.g., group finding and expert recommendation. In real scenarios, both GP and GD are updated frequently. However, the existing GPNM methods need to either perform a new GPNM procedure from scratch to deliver the node matching results based on the updated GP and GD or incrementally perform the GPNM procedure for each of the updates, leading to low efficiency. Therefore, there is a pressing need for a new method to efficiently deliver the node matching results on the updated graphs. In this paper, we first analyze and detect the elimination relationships among the updates. Then, we construct an Elimination Hierarchy Tree (EH-Tree) to index these elimination relationships. In order to speed up the GPNM process, we propose a graph partition method and then propose a new GPNM method, called UA-GPNM, considering the single-graph elimination relationships among the updates in a single graph of GP or GD, and also the cross-graph elimination relationships between the updates in GP and the updates in GD. UA-GPNM first delivers the GPNM result of an initial query, and then delivers the GPNM result of a subsequent query, based on the initial GPNM result and the multiple updates that occur between two queries. The experimental results on five real-world social graphs demonstrate that our proposed UA-GPNM is much more efficient than the state-of-the-art GPNM methods.
R13: Data Cleaning, Curation and Analytics (Wed 22nd April, 12:00-13:30)
Title Authors
R13.1 242 "Dataset Discovery in Data Lakes" Alex Bogatu (University of Manchester)*; Dr.Alvaro Fernandes (The University Of Manchester); Norman Paton (University of Manchester); Nikolaos Konstantinou (University of Manchester)
Data analytics stands to benefit from the increasing availability of datasets that are held, or can be obtained, without their conceptual relationships being explicitly known. When collected, these datasets form a data lake from which, by processes like data wrangling, specific target data\-sets can be constructed that enable value-adding analytics. Given the potential vastness of such data lakes, the issue arises of how to pull out of the lake those datasets that might contribute to wrangling out a given target. We refer to this as the problem of dataset discovery in data lakes and this paper contributes an effective and efficient solution to it. Our approach uses features of the values in a dataset to construct hash-based indexes that map those features into a uniform distance space. This makes it possible to define similarity distances between features and to take those distances as measurements of relatedness w.r.t. a target table. Given the latter (and exemplar tuples), our approach returns the most related tables in the lake. We provide a detailed description of the approach and report on empirical results for two forms of relatedness (unionability and joinability) comparing them with prior work, where pertinent, and showing significant improvements in all of precision, recall, target coverage, indexing and discovery times.
R13.2 268 "Swapping Repair for Misplaced Attribute Values" Yu Sun (Tsinghua University); Shaoxu Song (Tsinghua University)*; Chen Wang (" Tsinghua University, China"); Jianmin Wang ("Tsinghua University, China")
Misplaced data in a tuple are prevalent, e.g., a value “Passport” is misplaced in the passenger-name attribute, which should belong to the travel-document attribute instead. While repairing in-attribute errors have been widely studied, i.e., to repair the error by other value in the attribute domain, misplaced attribute values are surprisingly untouched, where the true value is simply misplaced in some other attribute of the same tuple. For instance, the true passenger-name is indeed misplaced in the travel-document attribute of the record. In this sense, we need a novel swapping repair model (to swap the misplaced passenger-name and travel-document values “Passport” and “John Adam” in the same tuple). Determining a proper swapping repair, however, is non-trivial. The minimum change criterion, evaluating the distance between the swapping repaired values, is obviously meaningless, since they are from different attribute domains. Intuitively, one may examine whether the swapped value (“John Adam”) is similar to other values in the corresponding attribute domain (passenger- name). In a holistic view of all (swapped) attributes, we propose to evaluate the likelihood of a swapping repaired tuple by studying its distances (similarity) to neighbors. The rationale of distance likelihood refers to the Poisson process of nearest neighbor appearance. The optimum repair problem is thus to find a swapping repair with the maximum likelihood on distances. Our technical contributions include (1) showing NP- hardness of the optimum swapping repair problem as well as approximating the problem, (2) devising bounds of distances for pruning and incremental computation in exact algorithms, and (3) approximation algorithm with performance guarantees. Experiments over datasets with real-world misplaced attribute values demonstrate the superiority of our proposal in repairing misplacement. Remarkably, by complementing with the existing approaches for repairing in-attribute errors, we show that the joint repairs significantly improve the potential applications such as classification and record matching when applicable.
R13.3 274 "Interactive Cleaning for Progressive Visualization through Composite Questions" Yuyu Luo (Tsinghua University); Chengliang Chai (Tsinghua University); xuedi qin (thu); Guoliang Li (Tsinghua University)*; Nan Tang (Qatar Computing Research Institute, HBKU)
Data visualization is especially important when it meets big data analytics. However, generating good visualizations is not easy. One practical reason that commonly causes bad visualizations is because real-life data is often dirty. Unfortunately, totally cleaning a dataset to get its ground truth, before using it for a data analytics task, rarely happens in practice.In this paper, we study the problem of interactive cleaning for progressive visualization (ICPV): Given a bad visualization V, it is to obtain a “cleaned” visualization V' whose distance is far from V, under a given (small) budget w.r.t. human cost. In ICPV, a system interacts with a user in many iterations. During each iteration, it asks the user a data cleaning question such as “how to clean detected errors x?”, and takes value updates from the user to clean V. Traditional wisdom typically picks a single question (e.g., “Are SIGMOD conference and SIGMOD the same?”) with the maximum expected benefit in each iteration. We propose to use a composite question – i.e., a group of single questions to be treated as one question – in each iteration (for example, Are SIGMOD conference in t1 and SIGMOD in t2 the same value, and t1 and t2 duplicates?). A composite question is presented to the user as a small connected graph through a novel GUI that the user can directly operate on. We propose algorithms to select the “best” composite question in each iteration. Experiments on real-world datasets verify that composite questions are more effective than asking single questions in isolation w.r.t. the human cost, and we can obtain a much cleaned visualization in a few interactions.
R13.4 733 "User-driven Error Detection for Time Series with Events" Kim Hung Le (University of Information Technology - VNUHCM)*; Paolo Papotti (Eurecom)
Anomalies are pervasive in time series data, such as sensor readings. Existing methods for anomaly detection cannot distinguish between anomalies that represent data errors, such as incorrect sensor readings, and notable events, such as the watering action in soil monitoring. In addition, the quality performance of such detection methods highly depends on the configuration parameters, which are dataset specific. In this work, we exploit active learning to detect both errors and events in a single solution that aims at minimizing user interaction. For this joint detection, we introduce an algorithm that accurately detects and labels anomalies with a non-parametric concept of neighborhood and probabilistic classification. Given a desired quality, the confidence of the classification is then used as termination condition for the active learning algorithm.Experiments on real and synthetic datasets demonstrate that our approach achieves F-score above 80% in detecting errors by labeling 2 to 5 points in one data series. We also show the superiority of our solution compared to the state-of-the-art approaches for anomaly detection. Finally, we demonstrate the positive impact of our error detection methods in downstream data repairing algorithms.
R13.5 332 "An Agile Sample Maintenance Approach for Agile Analytics" Hanbing Zhang (Fudan University); Yazhong Zhang (Fudan University); Zhenying He (Fudan University)*; Yinan Jing (Fudan University); Kai Zhang (Fudan University); X. Sean Wang (Fudan University)
In today's market, agile analytics can help organizations to gain and sustain competitive advantage by making timely decisions. Approximate query processing (AQP) is one of the useful approaches in agile analytics, which facilitates fast queries on big data by leveraging precomputed samples. Nevertheless, when new data is imported into the data warehouse, the initial sample will increasingly become stale and yield inaccurate query results. To avoid query accuracy degeneration, existing AQP systems usually construct a new sample from scratch by re-sampling. However, it will adversely inhibit the system efficiency due to the expensive cost of re-sampling. In this paper, to reduce the cost of sample updates, we propose an adaptive sample update (ASU) approach to avoid unnecessary expensive re-sampling operations as much as possible while keeping the query accuracy. Moreover, by sacrificing a little query accuracy, we propose an enhanced approach (T-ASU) to further speed up sample updates. In addition, we integrate these two approaches with a state-of-the-art AQP engine and conduct extensive experiments on it. Experimental results on both real-world and synthetic datasets show that the two approaches are more agile than the full re-sampling strategy when the underlying data is continuously changing, while achieving almost equivalent query accuracy as re-sampling.
R14: Query Processing 3 (Wed 22nd April, 12:00-13:30)
Title Authors
R14.1 567 "Continuously Tracking Core Items in Data Streams with Probabilistic Decays" Junzhou Zhao (King Abdullah University of Science and Technology)*; Pinghui Wang (Xi'an Jiaotong University); Jing Tao (Xi'an Jiaotong University); Shuo Zhang (Xi'an Jiaotong University); John C. S. Lui (The Chinese University of Hong Kong)
The sheer scale of big data causes the information overload issue and there is an urgent need for tools that can draw valuable insights from massive data. This paper investigates the core items tracking (CIT) problem where the goal is to continuously track representative items, called core items, in a data stream so to best represent/summarize the stream. In order to simultaneously satisfy the recency and continuity requirements, we consider CIT over probabilistic-decaying streams where items in the stream are forgotten gradually in a probabilistic manner. We first introduce an algorithm, called PNDCIT, to find core items in a special kind of probabilistic non-decaying streams}. Furthermore, using PNDCIT as a building block, we design two novel algorithms, namely PDCIT and PDCIT+, to maintain core items over probabilistic-decaying streams with constant approximation ratios. Finally, extensive experiments on real data demonstrate that PDCIT+ achieves a speedup of up to one order of magnitude over a batch algorithm while providing solutions with comparable quality.
R14.2 658 The Art of Efficient In-memory Query Processing on NUMA Systems: a Systematic Approach Puya Memarzia (University of New Brunswick)*; Suprio Ray (University of New Brunswick); Virendra C. Bhavsar (University of New Brunswick)
Data analytics systems commonly utilize in-memory query processing techniques to achieve better throughput and lower latency. Modern computers increasingly rely on Non-Uniform Memory Access (NUMA) architectures to achieve scalability. A key drawback of NUMA architectures is that many existing software solutions are not aware of the underlying NUMA topology and thus do not take full advantage of the hardware. Modern operating systems are designed to provide basic support for NUMA systems. However, default system configurations are typically sub-optimal for large data analytics applications. Additionally, rewriting the application from the ground up is not always feasible. In this work, we evaluate a variety of strategies that aim to accelerate memory-intensive data analytics workloads on NUMA systems. Our findings indicate that the operating system default configurations can be detrimental to query performance. We analyze the impact of different memory allocators, memory placement strategies, thread placement, and kernel-level load balancing and memory management mechanisms. With extensive experimental evaluation, we demonstrate that the methodical application of these techniques can be used to obtain significant speedups in four commonplace in-memory query processing tasks, on three different hardware architectures. Furthermore, we show that these strategies can improve the performance of four popular database systems running a TPC-H workload. Lastly, we summarize our findings in a decision flowchart for practitioners.
R14.3 666 "Speeding Up GED Verification for Graph Similarity Search" Lijun Chang (The University of Sydney)*; Xing Feng (University of Technology Sydney); Xuemin Lin (University of New South Wales); Lu Qin (UTS); Wenjie Zhang (University of New South Wales); Dian Ouyang (The University of Sydney)
Graph similarity search retrieves from a database all graphs whose edit distance (GED) to a query graph is within a threshold. As GED computation is NP-hard, the existing works adopt the filtering-and-verification paradigm to reduce the number of GED verifications, and they mainly focus on designing filtering techniques while using the now out-dated algorithm A∗GED for verification. In this paper, we aim to speed up GED verification, which is orthogonal to the index structures used in the filtering phase. We propose a best-first search algorithm AStar+-LSa which improves A∗ GED by (1) reducing memory consumption, (2) tightening lower bound estimation, and (3) improving the time complexity for lower bound computation. We formally show that AStar+-LSa has a lower space and time complexity than A∗GED. We further modify AStar+-LSa into a depth-first search algorithm to contrast these two search paradigms, and we extend our algorithms for exact GED computation. We conduct extensive empirical studies on real graph datasets, and show that our algorithm AStar+-LSa outperforms the state-of-the-art algorithms by several orders of magnitude for both GED verification and GED computation.
R14.4 758 "Scaling Out Schema-free Stream Joins" Damjan Gjurovski (Technische Universität Kaiserslautern)*; Sebastian Michel (TU Kaiserslautern)
In this work, we consider computing natural joins over massive streams of JSON documents that do not adhere to a specific schema. We first propose an efficient and scalable partitioning algorithm that uses the main principles of association analysis to identify patterns of co-occurrence of the attribute-values pairs within the documents. Data is then accordingly forwarded to compute nodes and locally joined using a novel FP-tree-based join algorithm. By compactly storing the documents and efficiently traversing the FP-tree structure, the proposed join algorithm can operate on large input sizes and provide the needed results in real-time. We discuss data-dependent scalability limitations that are inherent to natural joins over schema-free data and show how to practically circumvent them by artificially expanding the space of possible attribute-value pairs. The proposed algorithms are realized in the Apache Storm stream processing framework. Through extensive experiments with real and artificially created data, we evaluate the proposed algorithms and show that they outperform competing approaches.
R14.5 41 "Contribution Maximization in Probabilistic Datalog" Tova Milo (Tel Aviv University); Yuval Moskovitch (Tel Aviv Univesity); Brit Youngmann (Tel Aviv Univesity)*
The use of probabilistic datalog programs has been recently advocated for applications that involve recursive computation and uncertainty.While using such programs allows for a flexible knowledge derivation, it makesthe analysis of query results a challenging task. Particularly, given a set O of output tuples and a number k, one would like to understand which k-size subset of the input tuples have contributed the most to the derivation of O. This is usefulfor multiple tasks, such as identifying the critical sources of errors and understanding surprising results. Previous works have mainly focused on the quantification of tuples contribution to a query result in non-recursive SQL queries, very often disregarding probabilistic inference. To quantify the contribution in probabilistic datalog programs, one must account for the recursive relations between input and output data, and the uncertainty should be reflected in the computation. To this end, we formalize the Contribution Maximization (CM) problem. We then reduce CM to the well-studied Influence Maximization (IM) problem in social networks, showing that we can harness concepts and techniques developed for IM to our setting. However, we show that such naive adoption results in poor performance. To overcome this, we propose an optimized algorithm which injects a refined variant of the classic Magic Sets technique, integrated with a sampling method, into IM algorithms, achieving a significant saving of space and execution time. Our experiments demonstrate the effectiveness of our algorithm, even where the naive approach is infeasible.
R15: Data Mining and Knowledge Discovery 2 (Wed 22nd April, 12:00-13:30)
Title Authors
R15.1 248 Multiscale Frequent Co-movement pattern Mining Shahab Helmi (University of Colorado Denver)*; Farnoush Banaei-Kashani (University of Colorado Denver)
Thanks to recent prevalence of location sensors, collecting massive spatiotemporal datasets containing moving object trajectories has become possible, providing an exceptional opportunity to derive interesting insights about the behavior of the moving objects such as people, animals, and vehicles. In particular, mining patterns from \quotes{co-movements} of objects (such as movements by players of a sports team, joints of a person while walking, and cars in a transportation network) can lead to the discovery of interesting patterns (e.g., offense tactics of a sports team, gait signature of a person, and driving behaviors causing heavy traffic). Various trajectory mining and frequent pattern mining techniques have been proposed to discover patterns in trajectory datasets and more generally, event sequences. However, existing approaches are inapplicable to co-movement pattern mining from multi-trajectory datasets. In this paper, we propose a series of novel and efficient algorithms for co-movement pattern mining at both single and multiple scales. The performance of the proposed solutions is evaluated by conducting extensive experiments using two real datasets, a soccer game dataset and a human gait dataset.
R15.2 264 Self-paced Ensemble for Highly Imbalanced Massive Data Classification Zhining Liu (Jilin University); Wei Cao (MSRA); Zhifeng Gao (Microsoft Research); Jiang Bian (Microsoft Research); Hechang Chen (Jilin University)*; Yi Chang (Jilin University); Tieyan Liu (Microsoft Research)
Many real-world applications reveal difficulties inlearning classifiers from imbalanced data. The rising big dataera has been witnessing more classification tasks with largescalebut extremely imbalance and low-quality datasets. Mostof existing learning methods suffer from poor performance orlow computation efficiency under such a scenario. To tackle thisproblem, we conduct deep investigations into the nature of classimbalance, which reveals that not only the disproportion betweenclasses, but also other difficulties embedded in the nature of data,especially, noises and class overlapping, prevent us from learningeffective classifiers. Taking those factors into consideration, wepropose a novel framework for imbalance classification that aimsto generate a strong ensemble by self-paced harmonizing datahardness via under-sampling. Extensive experiments have shownthat this new framework, while being very computationallyefficient, can lead to robust performance even under highlyoverlapping classes and extremely skewed distribution. Note that,our methods can be easily adapted to most of existing learningmethods (e.g., C4.5, SVM, GBDT and Neural Network) to boosttheir performance on imbalanced data.
R15.3 651 SAN: Scale-Space Attention Networks Yash Garg (Arizona State University)*; K. Selçuk Candan (Arizona State University); Maria Luisa Sapino ("U. Torino, Italy")
Deep neural networks, especially convolutional neural networks (CNNs), have been effective in various data-driven applications. Yet, DNNs suffer from several major challenges; in particular, in many applications where the input data is relatively sparse, DNNs face the problems of overfitting to the input data and poor generalizability. This brings up several critical questions: ``Are all inputs equally important?'' ``Can we selectively focus on parts of the input data in a way that reduces overfitting to irrelevant observations?'' Recently, attention networks showed some success in helping the overall process focus onto parts of the data that carry higher importance in the current context. Yet, we note that the current attention network design approaches are not sufficiently informed about the key data characteristics in identifying salient regions in the data. we propose an innovative robust feature learning framework, scale-invariant attention networks (SAN), that identifies salient regions in the input data for the CNN to focus on. Unlike the existing attention networks, SAN concentrates attention on parts of the data where there is major change across space and scale. We argue, and experimentally show, that the salient regions identified by SAN lead to better network performance compared to state-of-the-art (attentioned and non-attentioned) approaches, including architectures such as LeNet-5, VGG-16, and RESNet-18, with common benchmark datasets, MNIST, FMNIST, CIFAR10/20/100, GTSRB and ImageNet.
R15.4 859 A Novel Approach to Learning Consensus and Complementary Information for Multi-View Data Clustering Khanh Luong (Queensland University of Technology)*; Richi Nayak (Queensland University of Technology, Brisbane)
With the advances in digital technology, multi-view data has become commonly available. Effective methods are required to be developed that can deal with the multi faceted nature of this data. We design a factorization-based loss functionbased method to simultaneously learn two components encoding the consensus and complementary information present in multi-view data respectively, by using both the Coupled Matrix Factorization (CMF) and Non-negative Matrix Factorization (NMF). We propose a novel optimal manifold for multi-view data which is the most consensed manifold embedded in the high-dimensional multi-view data. This newly optimal manifold is learned and incorporated in the proposed factorizing-based framework. A new complementary enhancing term is added in the loss function to include all possible consensus and complementary informationinherent in the multi-view data. An extensive experiment with diverse datasets, benchmarking the state-of-the-art multi-view clustering methods, has demonstrated the effectiveness of the proposed method in obtaining accurate clustering solution. Keywords: Multi-view Clustering, Non-negative Matrix Factorization, Manifold Learning, Consensus Manifold, $k$ nearest neighbour graph
R15.5 179 "Summarizing Hierarchical Multidimensional Data" Alexandra Kim (University of British Columbia); Laks V.S. Lakshmanan (The University of British Columbia); Divesh Srivastava (AT&T Labs Research)*
Data scientists typically analyze and extract insights from large multidimensional data sets such as US census data, enterprise sales data, and so on. But before sophisticated machine learning and statistical methods are employed, it is useful to build and explore concise summaries to gain an understanding of the data set. While a variety of summaries have been proposed over the years, the goal of creating a concise summary of multidimensional data that can provide worst-case accuracy guarantees has remained elusive.In this paper, we propose Tree Summaries, which attain this challenging goal over hierarchical multidimensional data sets. Intuitively, a Tree Summary is a weighted "embedded tree” in the lattice that is the cross-product of the dimension hierarchies; individual data values can be efficiently estimated by looking up the weight of its unique closest ancestor in the Tree Summary. We study the problems of generating lossless as well as (given a desired worst-case accuracy guarantee α) lossy Tree Summaries. We develop a polynomial-time algorithm that constructs the optimal (i.e., most concise) Tree Summary for each of these problems; this is a surprising result given the NP-hardness of constructing a variety of other optimal summaries over multidimensional data. We complement our analytical results with an empirical evaluation of our algorithm, and demonstrate with a detailed set of experiments on real and synthetic data that our algorithm outperforms prior methods in terms of conciseness of summaries or accuracy of estimation.
R16: Graphs and Social Networks 4 (Wed 22nd April, 12:00-13:30)
Title Authors
R16.1 513 "Efficient Team Formation in Social Networks based on Constrained Pattern Graph" Yue Kou (Northeastern University)*; Derong Shen (Northeastern University); Quinn Snell (Brigham Young University); Dong Li (Northeastern University); Tiezheng Nie (Northeastern University); Ge Yu (Northeastern University); Shuai Ma (Beihang University)
Finding a team that is both competent in performing the task and compatible in working together has been extensively studied. However, most methods for team formation tend to rely on a set of skills. In order to solve this issue, in this paper, we present an efficient team formation method based on Constrained Pattern Graph (called CPG). Unlike traditional methods, our method takes into account both structure constraint and communication constraint on team members, which can better meet the requirements of users. First, a decomposition method is proposed to decompose the CPG into one core part with a core node and several noncore parts. Second, we construct a Communication Cost Index (called CCI) to speed up the matching between the CPG and social networks. Third, an efficient CCI-based node matching algorithm is proposed to determine the matching order of nodes with the aim to minimize the total number of intermediate results. Also some incremental maintenance strategies for the changes of social networks are proposed. We conduct experimental studies based on two real-world social networks. Experiments demonstrate the effectiveness and the efficiency of our proposed method compared with traditional methods.
R16.2 550 "Effective and Efficient Truss Computation Over Large Heterogeneous Information Networks" Yixing Yang (University of New South Wales)*; Yixiang Fang (University of New South Wales); Xuemin Lin (University of New South Wales); Wenjie Zhang (University of New South Wales)
Recently, the topic of truss computation has gained plenty of attention, where the k-truss of a graph is the maximum subgraph in which each edge participates in at least (k − 2) triangles. While this topic has been extensively studied and applied in various applications, existing solutions mainly focus on homogeneous networks, where vertices are of the same type, and thus cannot be applied to heterogeneous information networks (HINs) which consist of multi-typed and interconnected objects, such as the bibliographic networks and knowledge graphs. In this paper, we study the problem of truss computation over HINs, which aims to find groups of vertices that are of the same type and densely connected.To model the relationship between two vertices of the same type, we adopt the well-known concept of meta-path, which is a sequence of vertex types and edge types between two given vertex types. We then introduce two kinds of HIN triangles for three vertices, regarding a specific meta-path P. The first one requires that each pair of vertices is connected by an instance of P, while the second one also has such a connectivity constraint but further needs that the three instances of P form a circle. Based on these two kinds of triangles, we propose two HIN truss models respectively. We further develop efficient truss computation algorithms. We have performed extensive experiments on five real large HINs, and the results show that the proposed solutions are highly effective and efficient.
R16.3 559 "Index-Free Approach with Theoretical Guarantee for Efficient Random Walk with Restart Query" Dandan Lin (Hong Kong University of Science and Technology)*; Raymond Chi-Wing Wong (Hong Kong University of Science and Technology); Min XIE (HKUST); Victor Junqiu Wei (Huawei technologies)
Due to the prevalence of graph data, graph analysis is very important nowadays.One popular analysis on graph data is Random Walk with Restart (RWR)since it provides a good metric for measuring the proximity of two nodes in a graph.Although RWR is important, it is challenging to design an algorithm for RWR.To the best of our knowledge, there are no existing RWR algorithms which, at the same time, (1) are index-free,(2) return answers with a theoretical guarantee and (3) are efficient. Motivated by this, in this paper, we propose an index-free algorithm called \emph{\underline{Res}idue-\underline{Acc}umulated approach} (ResAcc)which returns answers with a theoretical guarantee efficiently. Our experimental evaluations on large-scale real graphs show that \emph{RecAcc} is up to 4 times faster than the best-known previous algorithm, guaranteeing the same accuracy.Under typical settings, the best-known algorithm ran around 1000 seconds on a large dataset containing 41.7 million nodes, which is too time-consuming, while \emph{RecAcc} finished in 225 seconds with the same accuracy.Moreover, \emph{RecAcc} is up to 6 orders of magnitude more accurate than the best-known algorithm in practice with the same execution time, which is considered as a substantial improvement.
R16.4 578 "Optimization of a GPU-based Sparse Matrix Multiplication for Large Sparse Networks" JeongMyung Lee (Hanyang Univeristy); Seokwon Kang (Hanyang Univeristy); YongSeung Yu (Hanyang University); Yong-Yeon Jo (Hanyang University); Sang-Wook Kim (Hanyang University, Korea); Yongjun Park (Hanyang University)*
Sparse matrix multiplication (spGEMM) is widely used for core computations toanalyze sparse networks of social networks and biology data, and extract theirimportant information based on an adjacency matrix representation. As sparsematrix multiplication contains high data parallelism, many efficientimplementations are introduced using data-parallel programming platforms suchas CUDA and OpenCL on graphic processing unit (GPU) architectures.Most spGEMM techniques, such as cuSPARSE and CUSP, consist of row-product based intermediate data expansion and parallel data merge processes.While they generally show fair performance gain on GPUs, they often do not utilize theresource fully due to load imbalance between threads on the expansion process,and high memory contention on the merge process.Whereas several outer-product based spGEMM techniques are proposed to solve the load balancing problem on expansion, they still do not utilizethe GPU resources fully because a severe computation load variation exists between multiple thread blocks (TBs).To maximize GPU resource utilization by solving these problems, this paperproposes a new optimization pass called Thread Block Reorganizer tobalance the total computations of each computing unit on target GPUs based onthe outer-product based expansion process, and lower memory pressure the on the merge process.For expansion, the Thread Block Reorganizer first identifiesthe actual computation amount of each TB, and then performs two TBtransformation processes based on a deep understanding of the GPU executionmodel: 1) TB-Splitting to transform a heavy computation TB into multiple smallTBs and 2) TB-Merging to merge multiple small computation TBs into a larger TB.For merge, the Thread Block Reorganizer improves the total performance byperforming TB-Limiting to limit the number of allocated TBs on an SM.Experimental results show that the combination of multiple techniques highlyimproves the total performance of kernel execution by 1.28x over the row-product spGEMM baseline on average on NVIDIA Titan Xp GPUs for real-world datasets.
R16.5 586 "VAC: Vertex-Centric Attributed Community Search" Qing Liu (Hong Kong Baptist University); Yifan Zhu (Zhejiang University); Minjun Zhao (Zhejiang University); Xin Huang (Hong Kong Baptist University); Jianliang Xu (Hong Kong Baptist University); Yunjun Gao (" Zhejiang University, China")*
Attributed community search aims to find the community with high structure cohesiveness and attribute cohesiveness from attributed graphs. However, existing studies suffer from two major limitations: (i) it is not easy to set the conditions on query attributes; and (ii) the queries support only a single type of attributes. To this end, in this paper, we explore a novel attributed community search called vertex-centric attributed community (VAC) search. Given an attributed graph and a query vertex set, VAC search returns the community which is densely connected (guaranteed by the k-truss model) and has the best attribute score. We show that the problem is NP-hard. To answer VAC search efficiently, we develop both exact and approximate algorithms. We present two exact algorithms. One finds the community in a depth-first manner and the other is in a best-first manner. Moreover, we devise a suite of heuristics to prune away the unqualified search space, using the structure and attribute properties. To further improve search efficiency, we also propose a 2-approximation algorithm. Extensive experimental evaluation on various real-world attributed graphs demonstrate the effectiveness of our proposed model and the efficiency of the presented algorithms.
R17: Temporal and Spatial Data 1 (Wed 22nd April, 14:00-15:30)
Title Authors
R17.1 69 "Online Anomalous Trajectory Detection with Deep Generative Sequence Modeling" Yiding Liu (Nanyang Technological University)*; Kaiqi Zhao (University of Auckland); Gao Cong (Nanyang Technological Univesity); Zhifeng Bao (RMIT University)
Detecting anomalous trajectory has become an important and fundamental concern in many real-world applications. However, most of the existing studies 1) cannot handle the complexity and variety of trajectory data and 2) do not support efficient anomaly detection in an online manner. To this end, we propose a novel model, namely Gaussian Mixture Variational Sequence AutoEncoder (GM-VSAE), to tackle these challenges. Our GM-VSAE model is able to (1) capture complex sequential information enclosed in trajectories, (2) discover different types of normal routes from trajectories and represent them in a continuous latent space, and (3) support efficient online detection via trajectory generation. Our experiments on two real-world datasets demonstrate that GM-VSAE is more effective than the state-of-the-art baselines and is efficient for online anomalous trajectory detection.
R17.2 92 "Mobility-Aware Dynamic Taxi Ridesharing" Zhidan Liu (Shenzhen University)*; Zengyang Gong (Shenzhen University); Jiangzhou Li (Shenzhen University); Kaishun Wu (Shenzhen University)
Taxi ridesharing becomes promising and attractive because of the wide availability of taxis in a urban city and the tremendous benefits of ridesharing, e.g., alleviating traffic congestion and reducing energy consumption. Existing taxi ridesharing schemes, however, are not efficient and practical, due to they simply match ride requests and taxis based on partial trip information and omit the offline passengers, who hail a taxi at roadside with no explicit requests to the system. In this paper, we consider the mobility-aware taxi ridesharing problem, and present mT-Share to address these limitations. mT-Share fully exploits the mobility information of ride requests and taxis to achieve efficient indexing of taxis/requests and better passenger-taxi matching, while still satisfying the constraints on passengers’ deadlines and taxis’ capacities. Specifically, mT-Share indexes taxis and ride requests with both geographical information and travel direction, and supports the shortest path based routing and probabilistic routing to serve both online and offline ride requests. Extensive experiments with a large real-world taxi dataset demonstrate the efficiency and effectiveness of mT-Share, which can response each ride request in milliseconds and with an moderate detour cost. Compared to the state-of-the-art methods, mT-Share serves 42% and 62% more ride requests in peak and non-peak hours, respectively.
R17.3 155 "Online Trichromatic Pickup and Delivery Scheduling in Spatial Crowdsourcing" Bolong Zheng (Huazhong University of Science and Technology)*; Chenze Huang (Sun Yat-sen University); Christian S Jensen (Aalborg University); Lu Chen (Aalborg University, Denmark); Quoc Viet Hung Nguyen (Griffith University); Guohui Li (School of Computer Science and Technology Huazhong University of Science and Technology); Kai Zheng (University of Electronic Science and Technology of China)
In Pickup-and-Delivery problems (PDP), mobile workers are employed to pick up and deliver items with the goal of reducing travel and fuel consumption. Unlike most existing efforts that focus on finding a schedule that enables the delivery of as many items as possible at the lowest cost, we consider trichromatic (worker-itemtask) utility that encompasses worker reliability, item quality, and task profitability. Moreover, we allow customers to specify keywords for desired items when they submit tasks, which may result in multiple pickup options, thus further increasing the difficulty of the problem. Specifically, we formulate the problem of Online Trichromatic Pickup and Delivery Scheduling (OTPD) that aims to find optimal delivery schedules with maximum overall utility. In order to quickly respond to submitted tasks, we propose a greedy solution that finds the schedule with the maximum incremental utility ratio. Next, we introduce a skyline kinetic tree-based solution that materializes intermediate results to improve the result quality. Finally, we propose a density-based grouping solution that partitions tasks and efficiently assigns them to the workers with high overall utility. Extensive experiments with real and synthetic data offer evidence that the proposed solutions excel over baselines with respect to both effectiveness and efficiency.
R17.4 160 "Dependency-Aware Task Assignment in Spatial Crowdsourcing" Wangze Ni (Hong Kong University of Science and Technology); Peng CHENG (East China Normal University)*; Lei Chen (Hong Kong University of Science and Technology); Xuemin Lin (University of New South Wales)
Ubiquitous smart devices and high-quality wireless networks enable people to participate in spatial crowdsourcing tasks easily, which require workers to physically move to specific locations to conduct their assigned tasks. Spatial crowdsourcing has attracted much attention from both academia and industry. In this paper, we consider a spatial crowdsourcing scenario, where the tasks may have some dependencies among them. Specifically, one task can only be conducted when its dependent tasks have already been finished. In fact, task dependencies are quite common in many real life applications, such as house repairing and party preparation. We formally define the dependency-aware spatial crowdsourcing (DA-SC), which focuses on finding an optimal worker-and-task assignment under the constraints of dependencies, skills of workers, moving distances and deadlines with a goal of maximizing the successfully finished tasks. We prove that the DA-SC problem is NP-hard and thus intractable. Therefore, we propose two approximation algorithms, including greedy approach and game-theoretic approach, which can guarantee the approximate bounds of the results in each batch process. Through extensive experiments on both real and synthetic data sets, we demonstrate the efficiency and effectiveness of our DA-SC approaches.
R17.5 616 "Parallel semantic trajectory similarity join" Lisi Chen (HKBU)*; Shuo Shang (KAUST); Christian S Jensen (Aalborg University); Bin Yao ("Shanghai Jiaotong University, China"); Panos Kalnis (King Abdullah University of Science and Technology)
The matching of similar pairs of trajectories, called trajectory similarity join, isfundamental functionality in spatial data management. We consider the problem of semantic trajectory similarity join (STS-Join). Each semantic trajectory is a sequence of Points-of-interest (POIs) with both location and text information. Thus, given two sets of semantic trajectories and a threshold $\theta$, the STS-Join returns all pairs of semantic trajectories from the two sets with spatio-textual similarity no less than $\theta$. This join targets applications such as term-based trajectory near-duplicate detection, geo-text data cleaning, personalized ridesharing recommendation, keyword-aware route planning, and travel itinerary recommendation.With these applications in mind, we provide a purposeful definition of spatio-textual similarity. To enable efficient STS-Join processing on large sets of semantic trajectories, we develop trajectory pair filtering techniques and take into account the parallel processing capabilities of modern processors. Specifically, we present a two-phase parallel search algorithm. We first group semantic trajectories based on their text information. The algorithm's per-group searches are independent of each other and thus can be performed in parallel. For each group, the trajectories are further partitioned based on the spatial domain. We generate spatial and textual summaries for each trajectory batch, based on which we develop batchfiltering and trajectory-batch filtering techniques to prune unqualified trajectory pairs in a batch mode. Additionally, we propose an efficient divide-and-conquer algorithm to derive bounds of spatial similarity and textual similarity between two semantic trajectories, which enable us prune dissimilar trajectory pairs without the need of computing the exact value of spatio-textual similarity. An empirical study with large semantic trajectory data offers insight in the performance of the algorithm and demonstrates that is capable of outperforming a well-designed baseline algorithm by a factor of 8--12.
R18: Search and Information Extraction (Wed 22nd April, 14:00-15:30)
Title Authors
R18.1 52 "Being Happy with the Least: Achieving α-happiness with Minimum Number of Tuples" Min XIE (HKUST)*; Raymond Chi-Wing Wong (Hong Kong University of Science and Technology); Peng Peng (inspir.ai); Vassilis J. Tsotras (UC Riverside)
When faced with a database containing millions of products, a user may be only interested in a (typically much) smaller representative subset. Various approaches were proposed to create a good representative subset that fits the user's needs which are expressed in the form of a utility function (e.g., the top-k and diversification query). Recently, a regret minimization query was proposed: it does not require users to provide their utility functions and returns a small set of tuples such that any user's favorite tuple in this subset is guaranteed to be not much worse than his/her favorite tuple in the whole database. In a sense, this query finds a small set of tuples that makes the user happy (i.e., not regretful) even if s/he gets the best tuple in the selected set but not the best tuple among all tuples in the database.In this paper, we study the min-size variation of the regret minimization query; that is, we want to determine the least tuples needed to keep users happy at a given level. We term this problem as the α-happiness query where we quantify the user's happiness level by a criterion, called the happiness ratio, and guarantee that each user is at least α happy with the set returned (i.e., the happiness ratio is at least α) where α is a real number from 0 to 1. As this is an NP-hard problem, we derive an approximate solution with theoretical guarantee by considering the problem from a geometric perspective. Since in practical scenarios, users are interested in higher happiness levels (i.e., α is closer to 1), we performed extensive experiments for these scenarios, using both real and synthetic datasets. Our evaluations show that our algorithm outperforms the best-known previous approaches in two ways: (i) it answers the α-happiness query by returning fewer tuples to users and, (ii) it answers much faster (up to orders of magnitude times improvement for large α).
R18.2 138 "Improving Neural Relation Extraction with Implicit Mutual Relations" Jun Kuang (East China Normal University)*; Yixin Cao (NUS); Jianbing Zheng (East China Normal University); Xiangnan He (University of Science and Technology of China); Ming Gao (East China Normal University); Aoying Zhou (East China Normal University )
Relation extraction (RE) aims at extracting the relation between two entities from the text corpora. It is a crucial task for Knowledge Graph (KG) construction. Most existing methods predict the relation between an entity pair by learning the relation from the training sentences, which contain the targeted entity pair. In contrast to existing distant supervision approaches that suffer from insufficient training corpora to extract relations, our proposal of mining implicit mutual relation from the massive unlabeled corpora transfers the semantic information of entity pairs into the RE model, which is more expressive and semantically plausible. After constructing an entity proximity graph based on the implicit mutual relations, we preserve the semantic relations of entity pairs via embedding each vertex of the graph into a low-dimensional space. As a result, we can easily and flexibly integrate the implicit mutual relations and other entity information, such as entity types, into the existing RE methods. Our experimental results on a New York Times and another Google Distant Supervision datasets suggest that our proposed neural RE framework provides a promising improvement for the RE task, and significantly outperforms the state-of-the-art methods. Moreover, the component for mining implicit mutual relations is so flexible that can help to improve the performance of both CNN-based and RNN-based RE models significant.
R18.3 256 "SONG: Approximate Nearest Neighbor Search on GPU" Weijie Zhao (Baidu Research)*; Shulong Tan (Baidu Research); Ping Li (Baidu Research)
Approximate nearest neighbor searching is a fundamental problem in fields such as machine learning, data mining and information retrieval.Recent studies show that graph-based searching methods outperform other types of ANN methods in various common metrics.Although different graph-based methods build variant graph indices, most of the graph-based methods share the same graph searching algorithm.The searching algorithm is executed iteratively and each iteration relies on the progress of previous iterations. The execution dependency prohibits trivial GPU adaptations for graph-based methods.In this paper, we show a novel framework that decouples the searching on graph algorithm into 3 stages: candidates locating, bulk distance computation and data structures updating. Unlike other GPU Breadth-First search methods, our proposed framework is optimized specifically for the graph-based ANN searching algorithm by considering the high-dimensional nature of ANN problems.We evaluate experimentally the proposed system and compare it against HNSW--the state-of-the-art ANN method on CPU algorithm--and Faiss--the top GPU-accelerated ANN platform for dense features--on various real datasets. The results confirm the effectiveness of SONG---SONG has around \textit{30-200x} speedup compared with single-thread HNSW, while it outperforms Faiss with a factor of \textit{3-5}.
R18.4 461 "R2LSH: A Nearest Neighbor Search Scheme Based on Two-dimensional Projected Spaces" Kejing Lu (Hokkaido University)*; Mineichi Kudo (Hokkaido University)
Locality sensitive hashing(LSH) is a widely practiced c-approximate nearest neighbour(c-ANN) search algorithm due to its appealing theoretical guarantee and empirical performance. However, existing LSH-based solutions do not achieve a good balance between the cost and quality because of some limitations on their index structures. In this paper, we propose a novel and easy-to-implement disk-based method named R2LSH to answer approximate NN queries in high-dimensional spaces. In the indexing phase, R2LSH maps data objects into multiple two-dimensional projected spaces. In each space, a group of B+-trees are built to characterize the corresponding data distribution. In the query phase, by setting a query-centric ball in each projected space and using dynamic counting technique, R2LSH efficiently determines candidates and returns query results with required quality. Rigorous theoretical analysis shows that the proposed algorithm supports c-ANN search for arbitrarily small c>=1 with probability guarantee. Extensive experiments on real datasets confirm the superiority of R2LSH over the state-of-the-arts. Index Terms-Locality Sensitive Hashing, Approximate Nearest Neighbour Search, High Dimensional Spaces.
R18.5 625 "Online Indices for Predictive Top-k Entity and Aggregate Queries on Knowledge Graphs" Yan Li (UMass Lowell); Tingjian Ge (University of Massachusetts, Lowell)*; Cindy Chen (UMass Lowell)
Knowledge graphs have seen increasingly broad applications. However, they are known to be incomplete. We define the notion of a virtual knowledge graph which extends a knowledge graph with predicted edges and their probabilities. We focus on two important types of queries: top-k entity queries and aggregate queries. To improve query processing efficiency, we propose an incremental index on top of low dimensional entity vectors transformed from network embedding vectors. We also devise query processing algorithms with the index. Moreover, we provide theoretical guarantees of accuracy, and conduct a systematic experimental evaluation. The experiments show that our approach is very efficient and effective. In particular, with the same or better accuracy guarantees, it is one to two orders of magnitude faster in query processing than the closest previous work which can only handle one relationship type.
R19: Query and Stream Processing (Wed 22nd April, 14:00-15:30)
Title Authors
R19.1 70 Enabling Efficient Random Access to Hierarchically-Compressed Data Feng Zhang (Renmin University of China)*; Jidong Zhai (Tsinghua University); Xipeng Shen (North Carolina State University); Onur Mutlu (ETH Zurich); Xiaoyong Du (Renmin University of China)
Recent studies have shown the promise of direct data processing on hierarchically compressed texts. By removing the needs for decompressing data, the technique brings large savings in both time and space. However, the benefits have been limited to only data traversal operations; for random accesses, the speed is several times slower than the state of the art. This paper presents a set of techniques that successfully eliminate the limitation, and for the first time, establishes the feasibility for hierarchical compression to effectively underpin both kinds of data accesses. The work yields a new library, which achieves 3.2× speedups over the state of the art on random data accesses on compressed data, while preserving the capability in supporting traversal operations efficiently and the large (3.9×) space savings.
R19.2 719 "Adaptive Overlap Set Similarity Top-k Joins" Zhong Yang (Huazhong University of Science and Technology )*; Bolong Zheng (Huazhong University of Science and Technology); Guohui Li (School of Computer Science and Technology Huazhong University of Science and Technology); Xi Zhao (Huazhong University of Science and Technology ); Xiaofang Zhou (University of Queensland); Christian S Jensen (Aalborg University)
Set similarity join is a basic operation in data cleaning, near duplicate object detection and data integration which are fundamental functions in a modern data lake system. Traditional similarity joins require users to specify a similarity threshold. However, in a typic data lake that usually has massive number of datasets, users may not have the knowledge to give a good threshold for each dataset. In this paper, we study the set similarity top-$k$ joins problem with overlap constraint. It computes the most similar $k$ pairs of sets, ranked by their overlap similarity values. The existing methods suffer from the low efficiency and neglect the significant effect of step size in each round on the performance.So we propose a step size based framework to improve efficiency. We first present a fixed-size step algorithm to utilize the benifit of large step size, and then present an adaptive-size step algorithm to adjust step size automatically to reduce redundant calculation. Experimental results show that our methods completely outperform the state-of-the-art set similarity top-$k$ join method \textsf{topk-join} on large-scale real datasets.
R19.3 562 "Load Shedding for Complex Event Processing: Input-based and State-based Techniques" Bo Zhao (Humboldt University of Berlin)*; Quoc Viet Hung Nguyen (Griffith University); Matthias Weidlich (Humboldt-Universität zu Berlin)
Systems for complex event processing (CEP) evaluate queries over streams of events to detect user-specified patterns with low latency. Heterogeneous event sources may yield unpredictable input rates and query selectivities, though, changing by orders of magnitude during short peak times. Then, exhaustive processing is no longer reasonable, or even infeasible, and systems shall resort to best-effort query evaluation, striving for maximal accuracy while staying within a latency bound. In traditional data stream processing, this is achieved by load shedding that discards some stream elements without processing them based on their estimated utility for the query result.We argue that such input-based load shedding is not always suitable for CEP queries. It assumes that the utility of each individual element of a stream can be assessed in isolation. For CEP queries, however, this utility may be highly dynamic: Depending on the presence of partial matches, the impact of discarding a single event can vary drastically. In this work, we therefore complement input-based load shedding with a state-based technique that discards partial matches. We introduce a hybrid model that combines both input-based and state-based shedding to achieve high processing accuracy under constrained resources. Our experiments indicate that such hybrid shedding improves processing accuracy by up to 14 times for synthetic data and 11.4 times for real-world data, compared to baseline approaches.
R19.4 798 "SPEAr: Expediting Stream Processing with Accuracy Guarantees" Nikos R. Katsipoulakis (Amazon Web Services)*; Alexandros Labrinidis (University of Pittsburgh); Panos K. Chrysanthis (University of Pittsburgh)
Stream Processing Engines (SPEs) are used for real- time and continuous processing, that involves stateful opera- tions. This type of processing poses numerous challenges due to its associated complexity, unpredictable input data, and its need for timely results. As a result, users tend to over-provision resources, and online scaling is required in order to overcome overloaded situations. Current attempts for expediting stateful processing are impractical, due to their inability to uphold the quality of results, maintain performance, and reduce memory requirements.In this paper, we present the SPEAr system, which can automatically expedite processing of stateful operations by trading accuracy for performance. SPEAr detects when it can expedite processing by employing online sampling and accuracy estimation at no additional cost. We built SPEAr on top of Storm and our experiments indicate that it can reduce processing times by more than an order of magnitude, use more than an order of magnitude less memory, and offer accuracy guarantees in real-world benchmarks.
R20: Temporal Graphs (Wed 22nd April, 14:00-15:30)
Title Authors
R20.1 10 "Temporal Network Representation Learning via Historical Neighborhoods Aggregation" Shixun Huang (RMIT); Zhifeng Bao (RMIT University)*; Guoliang Li (Tsinghua University); Yanghao Zhou (NUAA); J. Shane Culpepper (RMIT University)
Abstract—Network embedding is an effective method to learn low-dimensional representations of nodes, which can be applied to various real-life applications such as visualization, node classification, and link prediction. Although significant progress has been made on this problem in recent years, several important challenges remain, such as how to properly capture temporal information in evolving networks. In practice, most networks are continually evolving. Some networks only add new edges or nodes such as authorship networks, while others support removal of nodes or edges such as internet data routing. If patterns exist in the changes of the network structure, we can better understand the relationships between nodes and the evolution of the network, which can be further leveraged to learn representations with more meaningful information. In this paper, we propose an embedding via historical neighborhoods aggregation (EHNA) method. More specifically, we first propose a temporal random walk which can identify relevant nodes in historical neighborhoods which have impact on edge formations. Then we present a deep learning model which uses a custom attention mechanism to generate node embeddings to directly capture temporal information in the underlying feature representations. We conduct extensive experiments on real-world datasets, and the results demonstrate the effectiveness of our new approach in the network reconstruction and the link prediction tasks.
R20.2 111 "An Interval-centric Model for Distributed Computing over Temporal Graphs" Swapnil S Gandhi (Indian Institute of Science Bangalore)*; Yogesh Simmhan (Indian Institute of Science Bangalore)
Algorithms for temporal property graphs may be time-dependent (TD), navigating the structure and time concurrently, or time-independent (TI), operating separately on different snapshots. Currently, there is no unified and scalable programming abstraction to design TI and TD algorithms over large temporal graphs. We propose an interval-centric computing model (ICM) for distributed and iterative processing of temporal graphs, where a vertex’s time-interval is a unit of data-parallel computation. It introduces a unique time-warp operator for temporal partitioning and grouping of messages that hides the complexity of designing temporal algorithms, while avoiding redundancy in user logic calls and messages sent. GRAPHITE is our implementation of ICM over Apache Giraph, and we use it to design 10 TI and TD algorithms from literature. We rigorously evaluate its performance for diverse real-world temporal graphs - as large as 131M vertices and 5.5B edges, and as long as 219 snapshots. Our comparison with 4 baseline platforms on a 10-node commodity cluster shows that ICM shares compute and messaging across intervals to out-perform them by up to 25x, and matches them even in worst-case scenarios. GRAPHITE also exhibits weak-scaling with near-perfect efficiency.
R20.3 679 "CrashSim: An efficient algorithm for computing SimRank over static and temporal graphs" Mo Li (Northeastern University); Farhana Choudhury (The University of Melbourne); Renata Borovica-Gajic (University of Melbourne); Zhiqiong Wang (Northeastern University); Junchang Xin ( Northeastern University)*; Jianxin Li (Deakin University)
SimRank is a significant metric to measure the similarity of nodes in graph data analysis. The problem of SimRank computation has been studied extensively. But there is no existing work that can provide one unified algorithm to support the SimRank computation on both static and temporal graphs.Therefore, in this work, we first propose CrashSim, an index-free algorithm for single-source SimRank computation in static graphs. CrashSim can provide provable approximation guarantees for the computational result in an efficient way. In addition, as the real-life graphs are often represented as temporal graphs, it is also required for CrashSim to support the efficient computation of SimRank in temporal graphs.Hence, we formally define two typical SimRank queries in temporal graphs, and then solve them by developing an efficient algorithm based on CrashSim, i.e., CrashSim-T. From the extensive experimental evaluation using five real and synthetic datasets, it can be seen that the CrashSim algorithm and CrashSim-T can substantially improve the efficiency of the state-of-the-art SimRank algorithms by about 30%, while achieving the precision of the result set with about 97%.
R20.4 736 "Efficiently Answering Span-Reachability Queries in Large Temporal Graphs" Dong Wen (University of Technology Sydney)*; Yilun Huang (University of Technology Sydney); Ying Zhang (University of Technology Sydney); Lu Qin (UTS); Wenjie Zhang (University of New South Wales); Xuemin Lin (University of New South Wales)
Reachability query is a fundamental problem in graph analysis. In applications such as social networks and collaboration networks, edges are always associated with timestamps. Most existing works on reachability queries in temporal graphs assume that two vertices are related if they are connected by a path with non-decreasing timestamps (time-respecting) of edges. This assumption fails to capture the relationship between entities involved in the same group or activity with no time-respecting path connecting them. In this paper, we define a new reachability model, called span-reachability, designed to relax the time order dependency and identify the relationship between entities in a given time period. We adopt the idea of two-hop cover and propose an index-based method to answer span-reachability queries. Several optimizations are also given to improve the efficiency of index construction and query processing. We conduct extensive experiments on 17 real-world datasets to show the efficiency of our proposed solution.
R21: Temporal and Spatial Data 2 (Thu 23rd April, 11:00-12:30)
Title Authors
R21.1 55 "Turbo-boosting Geospatial Visualization Dashboards via a Materialized Sampling Cube Approac" Jia Yu (Arizona State University)*; Mohamed Sarwat (Arizona State University)
In this paper, we present a middleware that runs on top of a SQL data system with the purpose of increasing the interactivity of geospatial visualization dashboards. The proposed system adopts a sampling cube approach that stores pre-materialized spatial samples and allows users to define their own accuracy loss function such that the produced samples can be used for various user-defined visualization tasks. The system ensures that the difference between the sample fed into the visualization dashboard and the raw query answer never exceeds the user-specified loss threshold. To reduce the number of cells in the sampling cube and hence mitigate the initialization time and memory utilization, the system employs two main strategies: (1) a partially materialized cube to only materialize local samples of those queries for which the global sample (the sample drawn from the entire dataset) exceeds the required accuracy loss threshold. (2) a sample selection technique that finds similarities between different local samples and only persists a few representative samples. Based on the extensive experimental evaluation, Tabula can bring down the total data-to-visualization time (including both data-system and visualization times) of a heat map generated over 700 million taxi rides to 600 milliseconds with 250 meters user-defined accuracy loss. Besides, Tabula costs up to two orders of magnitude less memory footprint (e.g., only 800 MB for the running example) and one order of magnitude less initialization time than the fully materialized sampling cube approach.
R21.2 260 "Sya: Enabling Spatial Awareness inside Probabilistic Knowledge Base Construction" Ibrahim Sabek (University of Minnesota, Twin Cities); Mohamed Mokbel (Qatar Computing Research Institute)*
This paper presents Sya; the first spatial probabilistic knowledge base construction system, based on Markov Logic Networks (MLN). Sya injects the awareness of spatial relationships inside the MLN grounding and inference phases, which are the pillars of the knowledge base construction process, and hence results in a better knowledge base output. In particular, Sya generates a probabilistic model that captures both logical and spatial correlations among knowledge base relations. Sya provides a simple spatial high-level language, a spatial variation of factor graph, a spatial rules-query translator, and a spatially-equipped statistical inference technique to infer the factual scores of relations. In addition, Sya provides an optimization that ensures scalable grounding and inference for large-scale knowledge bases. Experimental evidence, based on building two real knowledge bases with spatial nature, shows that Sya can achieve 120% higher quality over a state-of-the-art system, namely DeepDive, while achieving at least 30% reduction in the execution times.
R21.3 292 "Fast Query Decomposition for Batch Shortest Path Processing in Road Networks" Lei Li (University of Queensland)*; Mengxuan Zhang (The University of Queensland); Wen Hua (The University of Queensland); Xiaofang Zhou (University of Queensland)
Shortest path query is a fundamental operation in various location-based services (LBS) and most of them process queries on the server side. As the business expands, scalability becomes a severe issue. Instead of simply deploying more servers to cope with the quickly increasing query number, batch shortest path algorithms have been proposed recently to answer a set of queries together using shareable computation. Besides, they can also work in a highly dynamic environment as no index is needed. However, the existing batch algorithms either assume the batch queries are finely decomposed or just process them without differentiation, resulting in poor query efficiency. In this paper, we aim to improve the performance of batch shortest path algorithms by revisiting the problem of query clustering. Specifically, we first propose three query set decomposition methods to cluster queries: \textit{Zigzag} that considers the \textit{1-N} shared computation; \textit{Co-Clustering} that considers the source and target's spatial locality; and \textit{Search-Space-Aware} that further incorporates search space estimation. After that, we propose two batch algorithms that take advantage of the previously decomposed query sets for efficient query answering: \textit{R2R} that finds a set of approximate shortest paths from one region to another with bounded error; and \textit{Local Cache} that improves the existing \textit{Global Cache} with higher cache hit ratio. Experiments on a large real-world query sets verify the effectiveness and efficiency of our decomposition methods compared with the state-of-the-art batch algorithms.
R21.4 314 "Efficient Attribute-Constrained Co-Located Community Search" Jiehuan Luo (SIAT,CAS)*; Xin Cao (University of New South Wales); Xike Xie (University of Science and Technology of China); Qiang Qu (SIAT, China); Zhiqiang Xu (Baidu); Christian S Jensen (Aalborg University)
Networked data, notably social networks data, oftencomes with a rich set of annotations, or attributes, such as doc-uments (e.g., tweets) and locations (e.g., check-ins). Communitysearch in such attributed networks has been studied intensivelydue to its many applications in friends recommendation, eventorganization, advertising, etc. We study the problem of search-ing attributed-constrained co-located community (ACOC), whichreturns a community that satisfies three properties: i) structuralcohesiveness: members in the community are densely connected;ii) spatial co-location: members are close to each other; and iii)attribute constraint: a set of attributes are covered by vertexattributes. The ACOC problem is shown to be NP-hard. Wedevelop four efficient approximation algorithms with guaranteederror bounds and one exact solution to obtain optimal resultson relatively small graphs. Extensive experiments conducted onboth real and synthetic datasets offer insight into the efficiencyand effectiveness of the proposed methods, showing that theyoutperform three adapted state-of-the-art algorithms by an orderof magnitude. We also find that the approximation algorithms aremuch faster than the exact solution and yet offer high accuracy.
R21.5 355 "Indoor Top-k Keyword-aware Routing Query" Zijin Feng (Hong Kong Baptist University); Tiantian Liu (Aalborg University)*; Huan Li (Aalborg University); Hua Lu (Aalborg University); Lidan Shou (Zhejiang University); Jianliang Xu (Hong Kong Baptist University)
People have many activities indoors and there is an increasing demand of keyword-aware route planning for indoor venues. In this paper, we study the indoor top-k keyword-aware routing query (IKRQ). Given two indoor points s and t, an IKRQ returns k s-to-t routes that do not exceed a given distance constraint but have optimal ranking scores integrating keyword relevance and spatial distance. It is challenging to efficiently compute the ranking scores and find the best yet diverse routes in a large indoor space with complex topology. We propose prime routes to diversify top-k routes, devise mapping structures to organize indoor keywords and computing route keyword relevances, and derive pruning rules to reduce search space in routing. With these techniques, we design two search algorithms with different routing expansions. Experiments on synthetic and real data demonstrate the efficiency of our proposals.
R22: Modern Hardware 1 (Thu 23rd April, 11:00-12:30)
Title Authors
R22.1 13 "LATTE: A Native Table Engine on NVMe Storage" Jiajia Chu (East China Normal University); Yunshan Tu (East China Normal University); Yao Zhang (East China Normal University); Chuliang Weng (East China Normal University)*
Most database systems rely on complex multi-layer and compatibility-oriented storage stacks, which results in suboptimal database management system (DBMS) performance and significant write amplification. A heavy storage stack can be tolerated in the slow disk era because its storage overhead is completely overlapped by hardware delay. However, with advances in storage technologies, emerging NVMe devices have reached the same level of latency as software, which in turn has caused the storage stack to become a new bottleneck. On the other hand, NVMe devices not only improve I/O efficiency but also introduce distinctive hardware features that require software modifications to take advantage of them. To fully exploit the hardware potential of NVMe devices, we propose a lightweight native storage stack called Lightstack to minimize the software overhead. The core of Lightstack is an efficient table storage engine, LATTE, which abstracts the essential data service of the database’s 2D table. LATTE is designed from the ground up to use NVMe devices efficiently. It directly accesses NVMe devices to reduce single I/O latency and utilizes a parallel scheduling strategy to leverage multiple deep I/O queues and CPU cores. Besides, undo logging on heterogeneous storage is proposed to mitigate the write amplification further. We also implement a working prototype and evaluate it with standard benchmarks on the Intel Optane DC P4800X NVMe SSD and the DC P3608 Series NVMe SSD. Experimental results show that LATTE has up to 3.6-6.5X the throughput of MySQL’s InnoDB and MyRocks engines, with latency as low as 28% in the same hardware environment.
R22.2 34 Doubleheader Logging: Eliminating Journal Write Overhead for Mobile DBMS Sehyeon Oh (UNIST); Wook-Hee Kim (Sungkyunkwan University)*; Jihye Seo (Naver); Hyeonho Song (UNIST); Sam H. Noh (UNIST); Beomseok Nam (Sungkyunkwan University)
Various transactional systems use out-of-place updates such as logging orcopy-on-write mechanisms to update data in a failure-atomic manner. Suchout-of-place update methods double the I/O traffic due to back-up copies indatabase layer and quadruple the I/O traffic due to the file system journaling.In mobile systems, transaction sizes of mobile apps are known to be tiny and transactions run at low concurrency. For such mobile transactions, legacy out-of-place update methods such as WAL are sub-optimal.In this work, we propose a crash consistent in-place update logging method - doubleheader logging (DHL) for SQLite. DHL prevents previous consistent records from being lost by performing a copy-on-write inside the databasepage and co-locating the metadata-only journal information within the page.This is done, in turn, with minimal sacrifice to page utilization. DHL issimilar to when journaling is disabled, in the sense that it incurs almost noadditional overhead in terms of both I/O and computation. Our experimentalresults show that DHL outperforms other logging methods such as out-of-placeupdate write-ahead logging (WAL) and in-place update multi-version B-tree(MVBT).
R22.3 230 "GSI: GPU-friendly Subgraph Isomorphism" Li Zeng (Peking University)*; Lei Zou (Peking University); Tamer Özsu (Waterloo University); Lin Hu (Peking University); Fan Zhang (Peking University)
Subgraph isomorphism is a well-known \emph{NP-hard} problem that is widely used in many applications, such as social network analysis and query over the knowledge graph. Due to the inherent hardness, its performance is often a bottleneck in various real-world applications. Therefore, we address this by designing an efficient subgraph isomorphism algorithm leveraging features of GPU architecture, such as massive parallelism and memory hierarchy.Existing GPU-based solutions adopt a two-step output scheme, performing the same join process twice in order to write intermediate results concurrently. They also lack GPU architecture-aware optimizations that allow scaling to large graphs. In this paper, we propose a \emph{G}PU-friendly \emph{s}ubgraph \emph{i}somorphism algorithm, \emph{GSI}. Different from existing edge join-based GPU solutions, we propose a \emph{Prealloc-Combine} strategy based on the vertex-oriented framework, which avoids joining-twice in existing solutions. Also, a GPU-friendly data structure (called \emph{PCSR}) is proposed to represent an edge-labeled graph. Extensive experiments on both synthetic and real graphs show that GSI outperforms the state-of-the-art algorithms by up to several orders of magnitude and has good scalability with graph size scaling to hundreds of millions of edges.
R22.4 309 "FPGA-based Compaction Engine for Accelerating LSM-tree Key-Value Stores" Xuan SUN (City University of Hong Kong)*; Jinghuan YU (City University of Hong Kong); Zimeng ZHOU (City University of Hong Kong); Chun Jason XUE (City University of Hong Kong)
With the rapid growing of big data, LSM-tree based key-value stores draw much attention due to its high efficiency in writing performance. However, the compaction plays a critical role in merging old data, which could significantly reduce the overall throughput for the whole system especially for writing-intensive workloads. Benefiting from its high parallelism, hardware acceleration for database has become a popular trend in recent years. In this paper, we design and implement an FPGA-based compaction engine to accelerate data compaction in LSM-tree based key-value stores. To maximize the benefits of the pipeline mechanism on FPGA, the key-value separation and index-data block separation strategies are designed in the proposed solution. In order to improve the compaction performance, the bandwidth width of FPGA-chip is fully utilized. In addition, the proposed acceleration engine is integrated with a classic LSM-tree based store without modifications on the original storage format. The experimental results demonstrate that the proposed FPGA-based compaction engine can improve the throughput of random writing performance by up to 6 times.
R22.5 489 "Getting Swole: Generating Access-aware Code with Predicate Pullups" Andrew Crotty (Brown University)*; Alex Galakatos (Brown University); Tim Kraska (MIT)
For in-memory DBMSs, code generation has become a common technique to accelerate OLAP query processing. Existing code generation strategies seek to produce more streamlined code that uses a wide variety of tricks (e.g., short-circuit evaluation, SIMD processing) to reduce the amount of work performed for each data item. However, we argue that, in many cases, optimizing for data access patterns is actually more important than minimizing data processing effort.We therefore present a new access-aware code generation strategy, called Swole, that reasons about and optimizes for the data access patterns of queries. Contradictory to the traditional wisdom, Swole heavily leverages "predicate pullups" to produce more efficient access patterns, which outweighs the penalties of performing wasted work. Our experiments show that our access-aware code generation strategy is able to outperform existing approaches over a diverse range of queries.
R23: ML & Databases 1 (Thu 23rd April, 11:00-12:30)
Title Authors
R23.1 255 "Video Monitoring Queries" Nick Koudas (University of Toronto)*; Raymond Li (University of Toronto); Ioannis Xarchakos (University of Toronto)
Recent advances in video processing utilizing deep learning primitives achieved breakthroughs in fundamental problems in video analysis such as frame classification and object detection enabling an array of new applications.In this paper we study the problem of interactive declarative query processing on video streams. In particular we introduce a set of approximate filters to speed up queries that involve objects of specific type (e.g., cars, trucks, etc.) on video frames with associated spatial relationships among them (e.g., car left of truck).The resulting filters are able to assess quickly if the query predicates are true to proceed with further analysis of the frame or otherwise not consider the frame further avoiding costly object detection operations.We propose two classes of filters IC and OD, that adapt principles from deep image classification and object detection. The filters utilize extensible deep neural architectures and are easy to deploy and utilize. In addition, we propose statistical query processing techniques to process aggregate queries involvingobjects with spatial constraints on video streams and demonstrate experimentally the resulting increase in accuracy on aggregate estimation. Finally, we introduce a framework based on extreme value theory to detect unexpected objects on video streams and experimentally demonstrate its utility.Combined these techniques constitute a robust set of video monitoring query processing techniques. We demonstrate that the application of the techniques proposed in conjunction with declarative queries on video streams candramatically increase the frame processing rate and speed up query processing by several orders of magnitude.We present the results of a thorough experimental study utilizing benchmark video data sets at scale demonstrating the performance benefits and the practical relevance of our proposals.
R23.2 291 "Reinforcement Learning with Tree-LSTM for Join Order Selection" Xiang Yu (Tsinghua DBgroup); Guoliang Li (Tsinghua University)*; Nan Tang (Qatar Computing Research Institute, HBKU); Chengliang Chai (Tsinghua University)
the problem of finding the optimal join order for an SQL query – is a primary focus of database query optimizers. The problem is hard due to its large solution space. Exhaustively traversing the solution space is prohibitively expensive, which is often combined with heuristic pruning. Despite the decades-long effort, traditional optimizers still suffer from low scalability or low accuracy when handling complicated SQL queries. Recent attempts using deep reinforcement learning (DRL), by encoding join trees with fixed-length hand- tuned feature vectors, have shed some light on JOS. However, using fixed-length feature vectors cannot capture the structural information of a join tree, which may produce poor join plans. Moreover, it may also cause retraining the neural network when handling schema changes (e.g., adding tables/columns) or multi-alias table names that are common in SQL queries.In this paper, we present RTOS, a novel learned optimizer that uses Reinforcement learning with Tree-structured long short- term memory (LSTM) for join Order Selection. RTOS improves existing DRL-based approaches in two main aspects: (1) it adopts graph neural networks to capture the structures of join trees; and (2) it well supports the modification of database schema and multi-alias table names. Extensive experiments on Join Order Benchmark (JOB) and TPC-H show that RTOS outperforms traditional optimizers and existing DRL-based learned optimizers. In particular, the plan RTOS generated for JOB is averagely 101% on (estimated) cost and 67% on latency (i.e., execution time) compared with the state-of-the-art dynamic programs with polynomial time complexity of plan generation.
R23.3 342 "Approximate Query Processing for Data Exploration using Deep Generative Models" Saravanan Thirumuruganathan (QCRI)*; Shohedul Hasan (UTA); Nick Koudas (University of Toronto); Gautam Das (U. of Texas Arlington)
Data is generated at an unprecedented rate surpassing our ability to analyze them. The database community has pioneered many novel techniques for Approximate Query Processing (AQP) that could give approximate results in a fraction of time needed for computing exact results. In this work, we explore the usage of deep learning (DL) for answering aggregate queries specifically for interactive applications such as data exploration and visualization. We use deep generative models, an unsupervised learning based approach, to learn the data distribution faithfully such that aggregate queries could be answered approximately by generating samples from the learned model. The model is often compact – few hundred KBs – so that arbitrary AQP queries could be answered on the client side without contacting the database server. Our other contributions include identifying model bias and minimizing it through a rejection sampling based approach and an algorithm to build model ensembles for AQP for improved accuracy. Our extensive experiments show that our proposed approach can provide answers with high accuracy and low latency.
R23.4 597 "SuRF: Identification of Interesting Data Regions with Surrogate Models" Fotis Savva (University of Glasgow)*; Christos Anagnostopoulos (University of Glasgow); Peter Triantafillou (University of Warwick)
Several data mining tasks focus on repeatedly inspecting multidimensional data regions summarized by a statistic. The value of this statistic (e.g., region-population sizes, order moments) is used to classify the region's interesting-ness. These regions can be naively extracted from the entire dataspace,which is extremely time-consuming and compute-resource demanding. This paper studies the reverse problem: analysts provide a cut-off value for a statistic of interest and in turn our proposed framework efficiently identifies multidimensional regions whose statistic exceeds (or is below) the given cut-off value (according to user's needs). However, as data dimensions and size increase such task inevitably becomes laborious and costly. To alleviate this cost, our solution coined SuRF (SUrrogate Region Finder) leverages historical region evaluations to train surrogate models that learn to approximate the distribution of the statistic of interest. It then makes use of evolutionary multi-modal optimization to effectively and efficiently identify regions of interest regardless of data size and dimensionality. The accuracy, efficiency, and scalability of our approach are demonstrated with experiments using synthetic and real-world datasets and compared with other methods.
R23.5 720 "Two-Level Data Compression using Machine Learning in Time Series Database" xinyang yu (Alibaba); Yanqing Peng (Alibaba)*; Feifei Li (Alibaba); Sheng Wang (Alibaba Group); xiaowei shen (Alibaba); Huijun Mai (Alibaba); yue xie (Alibaba)
R24: Distributed and Parallel Data Management (Thu 23rd April, 11:00-12:30)
Title Authors
R24.1 158 "SeeMoRe: A Fault-Tolerant Protocol for Hybrid Cloud Environments" Mohammad Javad Amiri (University of California, Santa Barbara)*; Sujaya Maiyya (UCSB); Divy Agrawal (University of California, Santa Barbara); Amr El Abbadi (UC Santa Barbara)
Large scale data management systems utilize State Machine Replication to provide fault tolerance and to enhance performance. Fault-tolerant protocols are extensively used in the distributed database infrastructure of large enterprises such as Google, Amazon, and Facebook, as well as permissioned blockchain systems like IBM’s Hyperledger Fabric. However, and in spite of years of intensive research, existing fault-tolerant protocols do not adequately address all the characteristics of distributed system applications. In particular, hybrid cloud environments consisting of private and public clouds are widely used by enterprises. However, fault-tolerant protocols have not been adapted for such environments. In this paper, we introduce SeeMoRe, a hybrid State Machine Replication protocol to handle both crash and malicious failures in a public/private cloud environment. SeeMoRe considers a private cloud consisting of nonmalicious nodes (either correct or crash) and a public cloud with both Byzantine faulty and correct nodes. SeeMoRe has three different modes which can be used depending on the private cloud load and the communication latency between the public and the private cloud. We also introducea dynamic mode switching technique to transition from one mode to another. Furthermore, we evaluate SeeMoRe using a series of benchmarks. The experiments reveal that SeeMoRe’s performance is close to the state of the art crash fault-tolerant protocols while tolerating malicious failures.
R24.2 178 "On Sharding Open Blockchains with Smart Contracts" Yuechen Tao (HKUST)*; Bo Li (Hong Kong University of Science and Technology); Jingjie Jiang (Future Network Theory Lab); Hok Chu Ng (HKUST); Cong Wang (City University of Hong Kong); Baochun Li (University of Toronto)
Current blockchain systems suffer from a number of inherent drawbacks in its scalability, latency, and processing throughput. By enabling parallel confirmations of transactions, sharding has been proposed to mitigate these drawbacks, which usually requires frequent communication among miners through a separate consensus protocol.In this paper, we propose, analyze, and implement a new distributed and dynamic sharding system to substantially improve the throughput of blockchain systems based on smart contracts, while requiring minimum cross-shard communication. Our key observation is that transactions sent by users who only participate in a single smart contract can be validated and confirmed independently without causing double spending. Therefore, a natural formation of a shard is to surround one smart contract to start with. The complication lies in the different sizes of shards being formed, in which a small shard with few transactions tends to generate a large number of empty blocks resulting in a waste of mining power, while a large shard adversely affects parallel confirmations. To overcome this problem, we propose an inter-shard merging algorithm with incentives to encourage small shards to merge with one another and form a larger shard, an intra-shard transaction selection mechanism to encourage miners to select different subsets of transactions for validation, and a parameter unification method to further improve these two algorithms to reduce the communication cost and improve system reliability.We analyze our proposed algorithms using a game theoretic approach, and prove that they converge to a Nash Equilibrium. We also present a security analysis on our sharding design, and prove that it resists adversaries who occupy at most 33% of the computation power. We have implemented our designs on go-Ethereum 1.8.0 and evaluated their performance using both real-world blockchain transactions and large-scale simulations. Our results show that throughput has been improved by 7.2 times, and the number of empty blocks has been reduced by 90%.
R24.3 418 "G-thinker: A Distributed Framework for Mining Subgraphs in a Big Graph" Da Yan (University of Alabama at Birmingham)*; Guimu Guo (The University of Alabama at Birmingham); Md Mashiur Rahman Chowdhury (The University of Alabama at Birmingham); Tamer Özsu (University of Waterloo); Wei-Shinn Ku (Auburn University); John C. S. Lui (The Chinese University of Hong Kong)
Mining from a big graph those subgraphs that satisfy certain conditions is useful in many applications such as community detection and subgraph matching. These problems have a high time complexity, but existing systems to scale them are all IO-bound in execution. We propose the first truly CPU-bound distributed framework called G-thinker that adopts a user-friendly subgraph-centric vertex-pulling API for writing distributed subgraph mining algorithms. To utilize all CPU cores of a cluster, G-thinker features (1) a highly-concurrent vertex cache for parallel task access and (2) a lightweight task scheduling approach that ensures high task throughput. These designs well overlap communication with computation to minimize the CPU idle time. Extensive experiments demonstrate that G-thinker achieves orders of magnitude speedup compared even with the fastest existing subgraph-centric system, and it scales well to much larger and denser real-world network data.
R24.4 516 "DynaMast: Adaptive Dynamic Mastering for Replicated Systems" Michael T Abebe (University of Waterloo)*; Brad Glasbergen (University of Waterloo); Khuzaima Daudjee (University of Waterloo)
Single-master replicated database systems strive to be scalable by offloading reads to replica nodes. However, single-master systems suffer from the performance bottleneck of all updates executing at a single site. Multi-master replicated systems distribute updates among sites but incur costly coordination for multi-site transactions. We present DynaMast, a lazily replicated, multi-master database system that guarantees one-site transaction execution while effectively distributing both reads and updates among multiple sites. DynaMast benefits from these advantages by dynamically transferring the mastership of data, or remastering, among sites using a lightweight metadata-based protocol. DynaMast leverages remastering to intelligently place master copies to balance load and minimize future remastering. Using benchmark workloads, we demonstrate that DynaMast delivers superior performance over existing replicated database system architectures.
R24.5 695 "Fela: Incorporating Flexible Parallelism and Elastic Tuning to Accelerate Large-Scale DML" Jinkun Geng (Tsinghua University)*; Dan Li (Tsinghua University); Shuai Wang (Tsinghua University)
Distributed machine learning (DML) has become the common practice in industry, because of the explosive volume of training data and the growing complexity of training model. Traditional DML follows data parallelism but causes significant communication cost, due to the huge amount of parameter transmission. The recently emerging model-parallel solutions can reduce the communication workload, but leads to load imbalance and serious straggler problems. More importantly, the existing solutions, either data-parallel or model-parallel, ignore the nature of flexible parallelism for most DML tasks, thus failing to fully exploit the GPU computation power. Targeting at these existing drawbacks, we propose Fela, which incorporates both flexible parallelism and elastic tuning mechanism to accelerate DML. In order to fully leverage GPU power and reduce communication cost, Fela adopts hybrid parallelism and uses flexible parallel degrees to train different parts of the model. Meanwhile, Fela designs token-based scheduling policy to elastically tune the workload among different workers, thus mitigating the straggler effect and achieve better load balance. Our comparative experiments show that Fela can significantly improve the training throughput and outperforms the three main baselines (i.e. data-parallel, model-parallel, and hybrid-parallel) by up to 3.23×, 12.22×, and 1.85× respectively.
R25: Temporal and Spatial Data 3 (Thu 23rd April, 13:00-14:30)
Title Authors
R25.1 144 "Sequence-Aware Factorization Machines for Temporal Predictive Analytics" Tong Chen (The University of Queensland); Hongzhi Yin (The University of Queensland)*; Quoc Viet Hung Nguyen (Griffith University); Wen-Chih Peng (National Chiao Tung University); Xue Li (University of Queensland); Xiaofang Zhou (University of Queensland)
In various web applications like targeted advertising and recommender systems, the available categorical features (e.g., product type) are often of great importance but sparse. As a widely adopted solution, models based on Factorization Machines (FMs) are capable of modelling high-order interactions among features for effective sparse predictive analytics. As the volume of web-scale data grows exponentially over time, sparse predictive analytics inevitably involves dynamic and sequential features. However, existing FM-based models assume no temporal orders in the data, and are unable to capture the sequential dependencies or patterns within the dynamic features, impeding the performance and adaptivity of these methods. Hence, in this paper, we propose a novel Sequence-Aware Factorization Machine (SeqFM) for temporal predictive analytics, which models feature interactions by fully investigating the effect of sequential dependencies. As static features (e.g., user gender) and dynamic features (e.g., user interacted items) express different semantics, we innovatively devise a multi-view self-attention scheme that separately models the effect of static features, dynamic features and the mutual interactions between static and dynamic features in three different views. In SeqFM, we further map the learned representations of feature interactions to the desired output with a shared residual network. To showcase the versatility and generalizability of SeqFM, we test SeqFM in three popular application scenarios for FM-based models, namely ranking, classification and regression tasks. Extensive experimental results on six large-scale datasets demonstrate the superior effectiveness and efficiency of SeqFM.
R25.2 548 "Stochastic Origin-Destination Matrix Forecasting Using Dual-Stage Graph Convolutional, Recurrent Neural Networks" Jilin Hu (Inception Institute of Artificial Intellegence); Bin Yang (Aalborg University)*; Chenjuan Guo (Aalborg University); Christian S Jensen (Aalborg University); Hui Xiong (the State University of New Jersey)
Origin-destination (OD) matrices are used widely in transportation and logistics to record the travel cost (e.g., travel speed or greenhouse gas emission) between pairs of OD regions during different intervals within a day. We model a travel cost as a distribution because when traveling between a pair of OD regions, different vehicles may travel at different speeds even during the same interval, e.g., due to different driving styles or different waiting times at intersections. This yields stochastic OD matrices. We consider an increasingly pertinent setting where a set of vehicle trips is used for instantiating OD matrices. Since the trips may not cover all OD pairs for each interval, the resulting OD matrices are likely to be sparse. We then address the problem of forecasting complete, near future OD matrices from sparse, historical OD matrices. To solve this problem, we propose a generic learning framework that (i) employs matrix factorization and graph convolutional neural networks to contend with the data sparseness while capturing spatial correlations and that (ii) captures spatio-temporal dynamics via recurrent neural networks extended with graph convolutions. Empirical studies using two taxi trajectory data sets offer detailed insight into the properties of the framework and indicate that it is effective.
R25.3 598 "Query Results over Ongoing Databases that Remain Valid as Time Passes By" Yvonne Mülle (University of Zurich)*; Michael H Böhlen (University of Zurich)
Ongoing time point now is used to state that a tuple is valid from the start point onward. For database systems ongoing time points have far-reaching implications since they change continuously as time passes by. State-of-the-art approaches deal with ongoing time points by instantiating them to the reference time. The instantiation yields query results that are only valid at the chosen time and get invalidated by time passing by.We propose a solution that keeps ongoing time points uninstantiated during query processing. We do so by evaluating predicates and functions at all possible reference times. This renders query results independent of a specific reference time and yields results that remain valid as time passes by. As query results, we propose ongoing relations that include a reference times attribute. The value of the reference times attribute is restricted by predicates and functions on ongoing attributes. We describe and evaluate an efficient implementation of ongoing data types and operations in PostgreSQL.
R25.4 739 "Indoor Mobility Semantics Annotation Using Coupled Conditional Markov Networks" Huan Li (Aalborg University)*; Hua Lu (Aalborg University); Muhammad Aamir Cheema (Monash University); Lidan Shou (Zhejiang University); Gang Chen (Zhejiang University)
Indoor mobility semantics analytics can greatly benefit many pertinent applications. Existing semantic annotation methods mainly focus on outdoor space and require extra knowledge such as POI category or human activity regularity. However, these conditions are difficult to meet in indoor venues with relatively small extents but complex topology. This work studies the annotation of indoor mobility semantics that describe an object’s mobility event (what) at a semantic indoor region (where) during a time period (when). A coupled conditional Markov network (C2MN) is proposed with a set of feature functions carefully designed according to the characteristics of indoor topology and mobility behaviors. C2MN is able to capture the probabilistic dependencies among positioning records, semantic regions, and mobility events jointly. Nevertheless, the correlation of semantic regions and mobility events complicates the learning of relevant parameters for C2MN. Therefore, we devise an alternate learning algorithm to estimate the optimal parameters of the two variables. The results of extensive experiments demonstrate that our C2MN-based semantic annotation is efficient and effective on both real and synthetic indoor mobility data.
R26: Modern Hardware 2 (Thu 23rd April, 13:00-14:30)
Title Authors
R26.1 493 "Towards Concurrent Stateful Stream Processing on Multicore Processors" Shuhao Zhang (National University of Singapore)*; Yingjun Wu (Amazon Web Services); Feng Zhang (Renmin University of China); Bingsheng He (National University of Singapore)
Recent data stream processing systems (DSPSs) can achieve excellent performance when processing large volumes of streaming data under tight latency constraints. However, one potential limitation is their inadequate support for concurrent state access, where shared mutable application states may be concurrently accessed and state consistency needs to be guaranteed by the system. Several recent works propose to manage concurrent state access during stream processing via employing transactional schematics, where state access operations are modelled as transactions. However, they are still mostly based on locks involving serious contention overhead. Their coarse-grained processing paradigms further intensify contention and poorly utilizing modern multicore architectures in many cases. This paper introduces TStream, a novel DSPS supporting efficient concurrent state access on multicore processors. It follows previous work of employing transactional schematics to manage concurrent state access, but achieves higher scalability via two novel designs, including 1) dual-mode scheduling, which exposes many parallelism opportunities, and 2) dynamic restructuring execution, which aggressively exploits parallelism opportunities without centralized lock contentions. To confirm the effectiveness of our proposal, we evaluate TStream with a benchmark of four applications on a modern multicore machine. The experimental results show that 1) TStream achieves up to 4.8 times higher throughput with similar processing latency compared to the state-of-the-art; 2) differ from prior solutions, TStream is highly tolerant to varying application workloads such as key skewness and multi-partition transactions.
R26.2 515 "FESIA: A Fast and Efficient Set Intersection Approach on Modern CPUs" Jiyuan Zhang (CMU)*; Yi Lu (MIT); Daniele Giuseppe Spampinato (Carnegie Mellon University); Franz Franchetti (Carnegie Mellon University)
Set intersection is an important operation and widely used in both database and graph analytics applications. However, existing state-of-the-art set intersection methods only consider the size of input sets and fail to optimize for the case in which the intersection size is small. In real-world scenarios, the size of most intersections is usually orders of magnitude smaller than the size of the input sets, e.g., keyword search in databases and common neighbor search in graph analytics. In this paper, we present FESIA, a new set intersection approach on modern CPUs. The time complexity of our approach is $O(n/\sqrt{w} + r)$, in which $w$ is the SIMD width, and $n$ and $r$ are the size of input sets and intersection size, respectively. The key idea behind FESIA is that it first uses bitmaps to filter out unmatched elements from the input sets, and then selects suitable specialized kernels at runtime to compute the final intersection on each pair of bitmap segments. In addition, all data structures in FESIA are designed to take advantage of SIMD instructions provided by vector ISAs with various SIMD widths, including SSE, AVX, and the latest AVX-512. Our experiments on both real-world and synthetic datasets show that our intersection method achieves more than a order of magnitude better performance than conventional scalar implementations, and up to 4x better performance than state-of-the-art SIMD implementations.
R26.3 715 "Low-Latency Communication for Fast DBMS Using RDMA and Shared Memory" Philipp Fent (TUM)*; Alexander van Renen (TUM); Andreas Kipf (Technical University of Munich); Viktor Leis (Friedrich Schiller University Jena); Thomas Neumann (TUM); Alfons Kemper (TUM)
While hardware and software improvements greatly accelerated modern database systems' internal operations, the decades-old stream-based Socket API for external communication is still unchanged. We show experimentally, that for modern high-performance systems networking has become a performance bottleneck. Therefore, we argue that the communication stack needs to be redesigned to fully exploit modern hardware - as has already happened to most other database system components.We propose L5, a high-performance communication layer for database systems. L5 rethinks the flow of data in and out of the database system and is based on direct memory access techniques for intra-datacenter (RDMA) and intra-machine communication (Shared Memory). With L5, we provide a building block to accelerate ODBC-like interfaces with a unified and message-based communication framework. Our results show that using interconnects like RDMA (InfiniBand), RoCE (Ethernet), and Shared Memory (IPC), L5 can largely eliminate the network bottleneck for database systems.
R27: ML & Databases 2 (Thu 23rd April, 13:00-14:30)
Title Authors
R27.1 766 "ML-based Cross-Platform Query Optimization" Zoi Kaoudi (TU Berlin)*; Jorge-Arnulfo Quiane-Ruiz (TU Berlin); Bertty Contreras-Rojas (QCRI); Rodrigo Pardo-Meza (QCRI); Anis Troudi (QCRI); Sanjay Chawla (QCRI)
Cost-based optimization is widely known to suffer from a major weakness: administrators spend a significant amount of time to tune the associated cost models. This problem only gets exacerbated in cross-platform settings as there is little control on the underlying data processing platforms and there are many more parameters that need to be tuned. In the era of machine learning (ML), the first step to remedy this problem is to replace the cost model of the optimizer with an ML model. However, such a solution brings in two major challenges. First, the optimizer has to transform a query plan to a vector million times during plan enumeration incurring a very high overhead. Second, a lot of training data is required to effectively train the ML model.We overcome these challenges in Robopt, a novel vector-based optimizer we have built for Rheem, our cross-platform system. Robopt not only uses an ML model to prune the search space but also bases the entire plan enumeration on a set of algebraic operations that operate on vectors, which are a natural fit to the ML model. In this way, it avoids the high cost of transforming a query plan into a vector each time the ML model is invoked. This leads to both speed-up and scale-up of the enumeration process by exploiting modern CPUs via vectorization. In addition, Robopt is accompanied with a scalable training data generator for building its ML model. Our evaluation shows that (i)~the vector-based approach is more efficient and scalable than simply using an ML model and (ii)~Robopt matches and, in some cases, improves Rheem's cost-based optimizer in choosing good plans without requiring any tuning effort.
R27.2 858 "Automatic View Generation with Deep Learning and Reinforcement Learning" Haitao Yuan (Tsinghua University); Guoliang Li (Tsinghua University)*; Ling Feng (Tsinghua university); Ji Sun (Tsinghua University); Yue Han (Tsinghua University)
Materializing views is an important method to alleviate redundant computation in DBMS, especially for processing large scale analytical queries. However, many existing methods still need DBAs to manually generate materialized views, which is unfriendly for ordinary users. To address this issue, we propose an automatic view generation method which judiciously selects some ``highly beneficial'' subqueries to generate materialized views. However, we need to face two challenges: (1) how to estimate the benefit of using a materialized view for a query; (2) how to select optimal subqueries to generate materialized views. For resolving the first challenge, we propose a neural network based method to estimate the benefit of using a materialized view for a query. In particular, we extract significant features from different perspectives and design effective encoding models to transform these features into hidden representations. For resolving the second challenge, we rewrite it to an ILP (Integer Linear Programming) problem, whose target is to maximize the utility by selecting optimal subqueries to materialize. To efficiently solve the problem, we design an iterative optimization method, which selects subqueries to materialize by iteratively computing associated probabilities. However, this method cannot guarantee the convergence of the solution. To address this issue, we regard the iterative optimization process as an MDP (Markov Decision Process) and then use the deep reinforcement learning model DQN (Deep Q-Network) to solve the problem. Extensive experiments on real datasets show that our methods outperform existing solutions significantly.
R27.3 33 "ColumnSGD: A Column-oriented Framework for Distributed Stochastic Gradient Descent" Zhipeng Zhang (Peking University)*; Wentao Wu (Microsoft Research); Jiawei Jiang (Tencent); Lele Yu (Tencent Inc.); Bin Cui (Peking University); Ce Zhang (ETH)
Distributed machine learning (ML) has triggered tremendous research interest in recent years. Stochastic gradient descent (SGD) is one of the most popular algorithms for training ML models, and has been implemented in almost alldistributed ML systems, such as Spark MLlib, Petuum, MXNet, and TensorFlow. However, current implementations often incur huge communication and memory overheads when it comes to large models. One important reason for this inefficiency is the roworiented scheme (RowSGD) that existing systems use to partition the training data, which forces them to adopt a centralized model management strategy that leads to vast amount of data exchange over the network.We propose a novel, column-oriented scheme (ColumnSGD) that partitions training data by columns rather than by rows. As a result, ML model can be partitioned by columns as well, leading to a distributed configuration where individual data and model partitions can be collocated on the same machine. Following this locality property, we develop a simple yet powerful computation framework that significantly reduces communication overheads and memory footprints compared to RowSGD, for large-scale ML models such as generalized linear models (GLMs) and factorization machines (FMs). We implement ColumnSGD on top of Apache Spark, and study its performance both analyticallyand experimentally. Experimental results on both public and realworld datasets show that ColumnSGD is up to 930× faster than MLlib, 63× faster than Petuum, and 14× faster than MXNet.
R27.4 164 "In-database connected component analysis" Harald Bögeholz (Monash University)*; Michael Brand; Radu-Alexandru Todor
We describe a Big Data-practical, SQL-implementable algorithm for efficiently determining connected components for graph data stored in a Massively Parallel Processing (MPP) relational database. The algorithm described is a linear-space, randomised algorithm, always terminating with the correct answer but subject to a stochastic running time, such that for any $\epsilon>0$ and any input graph $G=\langle V, E \rangle$ the algorithm terminates after $\mathop{\text{O}}(\log |V|)$ SQL queries with probability of at least $1-\epsilon$, which we show empirically to translate to a quasi-linear runtime in practice.
R27.5 568 "Towards Factorized SVM with Gaussian Kernels over Normalized Data" Keyu Yang (Zhejiang University); Yunjun Gao ("Zhejiang University, China")*; Lei Liang (Zhejiang University); Bin Yao ("Shanghai Jiaotong University, China"); Shi-ting Wen (Zhejiang University); Gang Chen (Zhejiang University)
There is an emerging trend of integrating machine learning (ML) techniques into database systems (DB). Considering that almost all the ML toolkits assume that the input of ML algorithms is a single table even though many real-world datasets are stored as multiple tables due to normalization. Thus, data scientists have to perform joins before learning a ML model. This strategy is called learning after joins, which incurs redundancy avoided by normalization. In the area of ML, the Support Vector Machine (SVM) is one of the most standard classification tools. In this paper, we focus on the factorized SVM with gaussian kernels over normalized data. We present factorized learning approaches for two main SVM optimization methods, i.e., Gradient Descent (GD) and Sequential Minimal Optimization (SMO), by factorizing gaussian kernel function computation. Furthermore, we transform the normalized data into matrices, and boost the efficiency of SVM learning via linear algebra operations. Extensive experiments with nine real normalized data sets demonstrate the efficiency and scalability of our proposed approaches.