Filters








752 Hits in 1.9 sec

Collecting Telemetry Data Privately [article]

Bolin Ding, Janardhan Kulkarni, Sergey Yekhanin
2017 arXiv   pre-print
The collection and analysis of telemetry data from users' devices is routinely performed by many software companies. Telemetry collection leads to improved user experience but poses significant risks to users' privacy. Locally differentially private (LDP) algorithms have recently emerged as the main tool that allows data collectors to estimate various population statistics, while preserving privacy. The guarantees provided by such algorithms are typically very strong for a single round of
more » ... ngle round of telemetry collection, but degrade rapidly when telemetry is collected regularly. In particular, existing LDP algorithms are not suitable for repeated collection of counter data such as daily app usage statistics. In this paper, we develop new LDP mechanisms geared towards repeated collection of counter data, with formal privacy guarantees even after being executed for an arbitrarily long period of time. For two basic analytical tasks, mean estimation and histogram estimation, our LDP mechanisms for repeated data collection provide estimates with comparable or even the same accuracy as existing single-round LDP collection mechanisms. We conduct empirical evaluation on real-world counter datasets to verify our theoretical results. Our mechanisms have been deployed by Microsoft to collect telemetry across millions of devices.
arXiv:1712.01524v1 fatcat:n2aggferrrabhpcinxm6wagza4

Competitive Information Disclosure with Multiple Receivers [article]

Bolin Ding, Yiding Feng, Chien-Ju Ho, Wei Tang
2021 arXiv   pre-print
This paper analyzes a model of competition in Bayesian persuasion in which two symmetric senders vie for the patronage of multiple receivers by disclosing information about the qualities (i.e., binary state – high or low) of their respective proposals. Each sender is allowed to commit to a signaling policy where he sends a private (possibly correlated) signal to every receiver. The sender's utility is a monotone set function of receivers who make a patron to this sender. We characterize the
more » ... haracterize the equilibrium structure and show that the equilibrium is not unique (even for simple utility functions). We then focus on the price of stability (PoS) in the game of two senders – the ratio between the best of senders' welfare (i.e., the sum of two senders' utilities) in one of its equilibria and that of an optimal outcome. When senders' utility function is anonymous submodular or anonymous supermodular, we analyze the relation between PoS with the ex ante qualities λ (i.e., the probability of high quality) and submodularity or supermodularity of utility functions. In particular, in both families of utility function, we show that PoS = 1 when the ex ante quality λ is weakly smaller than 1/2, that is, there exists equilibrium that can achieve welfare in the optimal outcome. On the other side, we also prove that PoS > 1 when the ex ante quality λ is larger than 1/2, that is, there exists no equilibrium that can achieve the welfare in the optimal outcome. We also derive the upper bound of PoS as a function of λ and the properties of the value function. Our analysis indicates that the upper bound becomes worse as the ex ante quality λ increases or the utility function becomes more supermodular (resp. submodular).
arXiv:2103.03769v1 fatcat:ium6g2utrvch3bgigwvelvsxbe

Automated Relational Meta-learning [article]

Huaxiu Yao, Xian Wu, Zhiqiang Tao, Yaliang Li, Bolin Ding, Ruirui Li, Zhenhui Li
2020 arXiv   pre-print
In order to efficiently learn with small amount of data on new tasks, meta-learning transfers knowledge learned from previous tasks to the new ones. However, a critical challenge in meta-learning is the task heterogeneity which cannot be well handled by traditional globally shared meta-learning methods. In addition, current task-specific meta-learning methods may either suffer from hand-crafted structure design or lack the capability to capture complex relations between tasks. In this paper,
more » ... . In this paper, motivated by the way of knowledge organization in knowledge bases, we propose an automated relational meta-learning (ARML) framework that automatically extracts the cross-task relations and constructs the meta-knowledge graph. When a new task arrives, it can quickly find the most relevant structure and tailor the learned structure knowledge to the meta-learner. As a result, the proposed framework not only addresses the challenge of task heterogeneity by a learned meta-knowledge graph, but also increases the model interpretability. We conduct extensive experiments on 2D toy regression and few-shot image classification and the results demonstrate the superiority of ARML over state-of-the-art baselines.
arXiv:2001.00745v1 fatcat:ky673vwxsjfgrftut3aqwvyphy

Simple and Deep Graph Convolutional Networks [article]

Ming Chen, Zhewei Wei, Zengfeng Huang, Bolin Ding, Yaliang Li
2020 arXiv   pre-print
Graph convolutional networks (GCNs) are a powerful deep learning approach for graph-structured data. Recently, GCNs and subsequent variants have shown superior performance in various application areas on real-world datasets. Despite their success, most of the current GCN models are shallow, due to the over-smoothing problem. In this paper, we study the problem of designing and analyzing deep graph convolutional networks. We propose the GCNII, an extension of the vanilla GCN model with two
more » ... odel with two simple yet effective techniques: Initial residual and Identity mapping. We provide theoretical and empirical evidence that the two techniques effectively relieves the problem of over-smoothing. Our experiments show that the deep GCNII model outperforms the state-of-the-art methods on various semi- and full-supervised tasks. Code is available at https://github.com/chennnM/GCNII .
arXiv:2007.02133v1 fatcat:euxtywpbm5cb7jwwnolew6qtu4

A Statistical Approach Towards Robust Progress Estimation [article]

Arnd Christian König, Bolin Ding, Surajit Chaudhuri, Vivek Narasayya
2011 arXiv   pre-print
The need for accurate SQL progress estimation in the context of decision support administration has led to a number of techniques proposed for this task. Unfortunately, no single one of these progress estimators behaves robustly across the variety of SQL queries encountered in practice, meaning that each technique performs poorly for a significant fraction of queries. This paper proposes a novel estimator selection framework that uses a statistical model to characterize the sets of conditions
more » ... ets of conditions under which certain estimators outperform others, leading to a significant increase in estimation robustness. The generality of this framework also enables us to add a number of novel "special purpose" estimators which increase accuracy further. Most importantly, the resulting model generalizes well to queries very different from the ones used to train it. We validate our findings using a large number of industrial real-life and benchmark workloads.
arXiv:1201.0234v1 fatcat:tyeimoc3gfgvrc72lblscw2oty

Practical Data Poisoning Attack against Next-Item Recommendation [article]

Hengtong Zhang, Yaliang Li, Bolin Ding, Jing Gao
2020 arXiv   pre-print
Online recommendation systems make use of a variety of information sources to provide users the items that users are potentially interested in. However, due to the openness of the online platform, recommendation systems are vulnerable to data poisoning attacks. Existing attack approaches are either based on simple heuristic rules or designed against specific recommendations approaches. The former often suffers unsatisfactory performance, while the latter requires strong knowledge of the target
more » ... edge of the target system. In this paper, we focus on a general next-item recommendation setting and propose a practical poisoning attack approach named LOKI against blackbox recommendation systems. The proposed LOKI utilizes the reinforcement learning algorithm to train the attack agent, which can be used to generate user behavior samples for data poisoning. In real-world recommendation systems, the cost of retraining recommendation models is high, and the interaction frequency between users and a recommendation system is restricted.Given these real-world restrictions, we propose to let the agent interact with a recommender simulator instead of the target recommendation system and leverage the transferability of the generated adversarial samples to poison the target system. We also propose to use the influence function to efficiently estimate the influence of injected samples on the recommendation results, without re-training the models within the simulator. Extensive experiments on two datasets against four representative recommendation models show that the proposed LOKI achieves better attacking performance than existing methods.
arXiv:2004.03728v1 fatcat:xscx7fkeqjgppcqhoj4vmoqeja

Swarm

Zhenhui Li, Bolin Ding, Jiawei Han, Roland Kays
2010 Proceedings of the VLDB Endowment  
Recent improvements in positioning technology make massive moving object data widely available. One important analysis is to find the moving objects that travel together. Existing methods put a strong constraint in defining moving object cluster, that they require the moving objects to stick together for consecutive timestamps. Our key observation is that the moving objects in a cluster may actually diverge temporarily and congregate at certain timestamps. Motivated by this, we propose the
more » ... we propose the concept of swarm which captures the moving objects that move within arbitrary shape of clusters for certain timestamps that are possibly nonconsecutive. The goal of our paper is to find all discriminative swarms, namely closed swarm. While the search space for closed swarms is prohibitively huge, we design a method, ObjectGrowth, to efficiently retrieve the answer. In ObjectGrowth, two effective pruning strategies are proposed to greatly reduce the search space and a novel closure checking rule is developed to report closed swarms on-thefly. Empirical studies on the real data as well as large synthetic data demonstrate the effectiveness and efficiency of our methods.
doi:10.14778/1920841.1920934 fatcat:lo64j5suxzgzvoxrshdlwuzo64

Comparing Population Means under Local Differential Privacy: with Significance and Power [article]

Bolin Ding, Harsha Nori, Paul Li, Joshua Allen
2018 arXiv   pre-print
., mean/density estimations Duchi, Wainwright, and Jordan 2016; Ding, Kulkarni, and Yekhanin 2017) , and histogram estimations Kairouz, Bonawitz, and Ramage 2016; Wang et al. 2016; Wang, Wu, and Hu 2016  ... 
arXiv:1803.09027v1 fatcat:dllkflxrprhb5luiwtofgaej5e

Fast set intersection in memory

Bolin Ding, Arnd Christian König
2011 Proceedings of the VLDB Endowment  
Set intersection is a fundamental operation in information retrieval and database systems. This paper introduces linear space data structures to represent sets such that their intersection can be computed in a worst-case efficient way. In general, given k (preprocessed) sets, with totally n elements, we will show how to compute their intersection in expected time O(n/ √ w + kr), where r is the intersection size and w is the number of bits in a machine-word. In addition,we introduce a very
more » ... roduce a very simple version of this algorithm that has weaker asymptotic guarantees but performs even better in practice; both algorithms outperform the state of the art techniques for both synthetic and real data sets and workloads.
doi:10.14778/1938545.1938550 fatcat:rgb5c6lmivcurdl6xiqh6bwqna

S4

Fotis Psallidas, Bolin Ding, Kaushik Chakrabarti, Surajit Chaudhuri
2015 Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data - SIGMOD '15  
§ ... hundreds to thousands of tables § ... tens to thousands of columns per table § ... and numerous relations Background An information worker • Has no exact knowledge of the database and its underlying schema • Seeks queries to cover information needs • Spends a lot of time to go over the database schema and manually discover the queries in need Q: How to help the information worker discover the queries of interest? Observation: The information worker knows a few example tuples that should
more » ... uples that should be present in the output of the queries. Jill Hans Surface § Enter example tuples that should be in the output of the desired PJ query § The system replies with the set of PJ queries relevant to the example tuples Contributions
doi:10.1145/2723372.2749452 dblp:conf/sigmod/PsallidasDCC15 fatcat:ypfirr5h7bfr3fkf3ak7h2co4i

Approximate Query Processing

Surajit Chaudhuri, Bolin Ding, Srikanth Kandula
2017 Proceedings of the 2017 ACM International Conference on Management of Data - SIGMOD '17  
In this paper, we reflect on the state of the art of Approximate Query Processing. Although much technical progress has been made in this area of research, we are yet to see its impact on products and services. We discuss two promising avenues to pursue towards integrating Approximate Query Processing into data platforms.
doi:10.1145/3035918.3056097 dblp:conf/sigmod/ChaudhuriDK17 fatcat:x2lmtvyhbfdfhpohqlyd23pffq

Contrastive Learning for Sequential Recommendation [article]

Xu Xie, Fei Sun, Zhaoyang Liu, Shiwen Wu, Jinyang Gao, Bolin Ding, Bin Cui
2021 arXiv   pre-print
Sequential recommendation methods play a crucial role in modern recommender systems because of their ability to capture a user's dynamic interest from her/his historical interactions. Despite their success, we argue that these approaches usually rely on the sequential prediction task to optimize the huge amounts of parameters. They usually suffer from the data sparsity problem, which makes it difficult for them to learn high-quality user representations. To tackle that, inspired by recent
more » ... red by recent advances of contrastive learning techniques in the computer version, we propose a novel multi-task model called Contrastive Learning for Sequential Recommendation (CL4SRec). CL4SRec not only takes advantage of the traditional next item prediction task but also utilizes the contrastive learning framework to derive self-supervision signals from the original user behavior sequences. Therefore, it can extract more meaningful user patterns and further encode the user representation effectively. In addition, we propose three data augmentation approaches to construct self-supervision signals. Extensive experiments on four public datasets demonstrate that CL4SRec achieves state-of-the-art performance over existing baselines by inferring better user representations.
arXiv:2010.14395v2 fatcat:2pissecqs5dopo5nderaunptyq

Sample + Seek

Bolin Ding, Silu Huang, Surajit Chaudhuri, Kaushik Chakrabarti, Chi Wang
2016 Proceedings of the 2016 International Conference on Management of Data - SIGMOD '16  
Data volumes are growing exponentially for our decision-support systems making it challenging to ensure interactive response time for ad-hoc queries without increasing cost of hardware. Aggregation queries with Group By that produce an aggregate value for every combination of values in the grouping columns are the most important class of ad-hoc queries. As small errors are usually tolerable for such queries, approximate query processing (AQP) has the potential to answer them over very large
more » ... over very large datasets much faster. In many cases analysts require the distribution of (group, aggvalue) pairs in the estimated answer to be guaranteed within a certain error threshold of the exact distribution. Existing AQP techniques are inadequate for two main reasons. First, users cannot express such guarantees. Second, sampling techniques used in traditional AQP can produce arbitrarily large errors even for SUM queries. To address those limitations, we first introduce a new precision metric, called distribution precision, to express such error guarantees. We then study how to provide fast approximate answers to aggregation queries with distribution precision guaranteed within a userspecified error bound. The main challenges are to provide rigorous error guarantees and to handle arbitrary highly selective predicates without maintaining large-sized samples. We propose a novel sampling scheme called measure-biased sampling to address the former challenge. For the latter, we propose two new indexes to augment in-memory samples. Like other sampling-based AQP techniques, our solution supports any aggregate that can be estimated from random samples. In addition to deriving theoretical guarantees, we conduct experimental study to compare our system with state-ofthe-art AQP techniques and a commercial column-store database system on both synthetic and real enterprise datasets. Our system provides a median speed-up of more than 100x with around 5% distribution error compared with the commercial database.
doi:10.1145/2882903.2915249 dblp:conf/sigmod/DingHCC016 fatcat:ozu2b2nw2fg7pemm26nbv72oqu

Towards Differentially Private Truth Discovery for Crowd Sensing Systems [article]

Yaliang Li, Houping Xiao, Zhan Qin, Chenglin Miao, Lu Su, Jing Gao, Kui Ren, Bolin Ding
2018 arXiv   pre-print
Nowadays, crowd sensing becomes increasingly more popular due to the ubiquitous usage of mobile devices. However, the quality of such human-generated sensory data varies significantly among different users. To better utilize sensory data, the problem of truth discovery, whose goal is to estimate user quality and infer reliable aggregated results through quality-aware data aggregation, has emerged as a hot topic. Although the existing truth discovery approaches can provide reliable aggregated
more » ... iable aggregated results, they fail to protect the private information of individual users. Moreover, crowd sensing systems typically involve a large number of participants, making encryption or secure multi-party computation based solutions difficult to deploy. To address these challenges, in this paper, we propose an efficient privacy-preserving truth discovery mechanism with theoretical guarantees of both utility and privacy. The key idea of the proposed mechanism is to perturb data from each user independently and then conduct weighted aggregation among users' perturbed data. The proposed approach is able to assign user weights based on information quality, and thus the aggregated results will not deviate much from the true results even when large noise is added. We adapt local differential privacy definition to this privacy-preserving task and demonstrate the proposed mechanism can satisfy local differential privacy while preserving high aggregation accuracy. We formally quantify utility and privacy trade-off and further verify the claim by experiments on both synthetic data and a real-world crowd sensing system.
arXiv:1810.04760v1 fatcat:e477ktq5zjezfnhyyzcsjwijsm

Glue: Adaptively Merging Single Table Cardinality to Estimate Join Query Size [article]

Rong Zhu, Tianjing Zeng, Andreas Pfadler, Wei Chen, Bolin Ding, Jingren Zhou
2021 arXiv   pre-print
Cardinality estimation (CardEst), a central component of the query optimizer, plays a significant role in generating high-quality query plans in DBMS. The CardEst problem has been extensively studied in the last several decades, using both traditional and ML-enhanced methods. Whereas, the hardest problem in CardEst, i.e., how to estimate the join query size on multiple tables, has not been extensively solved. Current methods either reply on independence assumptions or apply techniques with
more » ... echniques with heavy burden, whose performance is still far from satisfactory. Even worse, existing CardEst methods are often designed to optimize one goal, i.e., inference speed or estimation accuracy, which can not adapt to different occasions. In this paper, we propose a very general framework, called Glue, to tackle with these challenges. Its key idea is to elegantly decouple the correlations across different tables and losslessly merge single table CardEst results to estimate the join query size. Glue supports obtaining the single table-wise CardEst results using any existing CardEst method and can process any complex join schema. Therefore, it easily adapts to different scenarios having different performance requirements, i.e., OLTP with fast estimation time or OLAP with high estimation accuracy. Meanwhile, we show that Glue can be seamlessly integrated into the plan search process and is able to support counting distinct number of values. All these properties exhibit the potential advances of deploying Glue in real-world DBMS.
arXiv:2112.03458v1 fatcat:zsw6tzgbafbbbkduj2tjoefcyq
« Previous Showing results 1 — 15 out of 752 results