Filters








25,050 Hits in 1.3 sec

Provenance for Interactive Visualizations [article]

Fotis Psallidas, Eugene Wu
2018 arXiv   pre-print
We highlight the connections between data provenance and interactive visualizations. To do so, we first incrementally add interactions to a visualization and show how these interactions are readily expressible in terms of provenance. We then describe how an interactive visualization system that natively supports provenance can be easily extended with novel interactions.
arXiv:1805.02622v1 fatcat:y4dm3b455fhdteq3q2c2aht2jy

Chinese Taoism in Eugene O' Neill's Marco Millions

Xin Li, Chengjun Wu
2021 OALib  
Eugene O' Neill, an American playwright, is very interested in Chinese Taoism. In many of his middle and late works, we can see the profound influence of Taoism on him.  ...  奥尼尔与道家文化 Daoism and Eugene O'Neill's Plays (全英文)混合式 MOOC 建设"(X20305)。  ... 
doi:10.4236/oalib.1107262 fatcat:j4mqbaiharbwdatfmj6ego4j6e

Precision Interfaces [article]

Haoci Zhang, Thibault Sellam, Eugene Wu
2017 arXiv   pre-print
Acknowledgements: We thank Yifan Wu for the initial inspiration, Anant Bhardwaj for data collection, Laura Rettig on early formulations of the problem, and the support of NSF 1527765 and 1564049.  ... 
arXiv:1704.03022v2 fatcat:n5h26hcizrh53doql7gwkbpvou

Indexing Cost Sensitive Prediction [article]

Leilani Battle, Edward Benson, Aditya Parameswaran, Eugene Wu
2014 arXiv   pre-print
Predictive models are often used for real-time decision making. However, typical machine learning techniques ignore feature evaluation cost, and focus solely on the accuracy of the machine learning models obtained utilizing all the features available. We develop algorithms and indexes to support cost-sensitive prediction, i.e., making decisions using machine learning models taking feature evaluation cost into account. Given an item and a online computation cost (i.e., time) budget, we present
more » ... o approaches to return an appropriately chosen machine learning model that will run within the specified time on the given item. The first approach returns the optimal machine learning model, i.e., one with the highest accuracy, that runs within the specified time, but requires significant up-front precomputation time. The second approach returns a possibly sub- optimal machine learning model, but requires little up-front precomputation time. We study these two algorithms in detail and characterize the scenarios (using real and synthetic data) in which each performs well. Unlike prior work that focuses on a narrow domain or a specific algorithm, our techniques are very general: they apply to any cost-sensitive prediction scenario on any machine learning algorithm.
arXiv:1408.4072v1 fatcat:oorh5l2qknekbhkxa7um3gkjn4

View Composition Algebra for Ad Hoc Comparison [article]

Eugene Wu
2022 arXiv   pre-print
. • Eugene Wu is with Columbia University. Email: ewu@cs.columbia.edu.  ...  Eugene Wu is an Associate Professor at Columbia University. His research interests are in systems for human data management.  ... 
arXiv:2202.07836v1 fatcat:dx5rfoq375bvjmugzmpu4guwqi

Enabling SQL-based Training Data Debugging for Federated Learning [article]

Yejia Liu, Weiyuan Wu, Lampros Flokas, Jiannan Wang, Eugene Wu
2021 arXiv   pre-print
How can we debug a logistical regression model in a federated learning setting when seeing the model behave unexpectedly (e.g., the model rejects all high-income customers' loan applications)? The SQL-based training data debugging framework has proved effective to fix this kind of issue in a non-federated learning setting. Given an unexpected query result over model predictions, this framework automatically removes the label errors from training data such that the unexpected behavior disappears
more » ... in the retrained model. In this paper, we enable this powerful framework for federated learning. The key challenge is how to develop a security protocol for federated debugging which is proved to be secure, efficient, and accurate. Achieving this goal requires us to investigate how to seamlessly integrate the techniques from multiple fields (Databases, Machine Learning, and Cybersecurity). We first propose FedRain, which extends Rain, the state-of-the-art SQL-based training data debugging framework, to our federated learning setting. We address several technical challenges to make FedRain work and analyze its security guarantee and time complexity. The analysis results show that FedRain falls short in terms of both efficiency and security. To overcome these limitations, we redesign our security protocol and propose Frog, a novel SQL-based training data debugging framework tailored for federated learning. Our theoretical analysis shows that Frog is more secure, more accurate, and more efficient than FedRain. We conduct extensive experiments using several real-world datasets and a case study. The experimental results are consistent with our theoretical analysis and validate the effectiveness of Frog in practice.
arXiv:2108.11884v1 fatcat:veq3cxlyajf5zbmgspy2csx6cm

Smoke: Fine-grained Lineage at Interactive Speed [article]

Fotis Psallidas, Eugene Wu
2018 arXiv   pre-print
Data lineage describes the relationship between individual input and output data items of a workflow, and has served as an integral ingredient for both traditional (e.g., debugging, auditing, data integration, and security) and emergent (e.g., interactive visualizations, iterative analytics, explanations, and cleaning) applications. The core, long-standing problem that lineage systems need to address---and the main focus of this paper---is to capture the relationships between input and output
more » ... ta items across a workflow with the goal to streamline queries over lineage. Unfortunately, current lineage systems either incur high lineage capture overheads, or lineage query processing costs, or both. As a result, applications, that in principle can express their logic declaratively in lineage terms, resort to hand-tuned implementations. To this end, we introduce Smoke, an in-memory database engine that neither lineage capture overhead nor lineage query processing needs to be compromised. To do so, Smoke introduces tight integration of the lineage capture logic into physical database operators; efficient, write-optimized lineage representations for storage; and optimizations when future lineage queries are known up-front. Our experiments on microbenchmarks and realistic workloads show that Smoke reduces the lineage capture overhead and streamlines lineage queries by multiple orders of magnitude compared to state-of-the-art alternatives. Our experiments on real-world applications highlight that Smoke can meet the latency requirements of interactive visualizations (e.g., <150ms) and outperform hand-written implementations of data profiling primitives.
arXiv:1801.07237v1 fatcat:mgxaqqmfdrffdl4atpsfss72wu

Reptile: Aggregation-level Explanations for Hierarchical Data [article]

Zezhou Huang, Eugene Wu
2021 arXiv   pre-print
Recent query explanation systems help users understand anomalies in aggregation results by proposing predicates that describe input records that, if deleted, would resolve the anomalies. However, it can be difficult for users to understand how a predicate was chosen, and these approaches are limited to errors that can be resolved through deletion. In contrast, data errors may be due to group-wise errors, such as missing records or systematic value errors. This paper presents Reptile, an
more » ... ion system for hierarchical data. Given an anomalous aggregate query result, Reptile recommends the next drill-down attribute,and ranks the drill-down groups based on the extent repairing the group's statistics to its expected values resolves the anomaly. Reptile efficiently trains a multi-level model that leverages the data's hierarchy to estimate the expected values, and uses a factorised representation of the feature matrix to remove redundancies due to the data's hierarchical structure. We further extend model training to support factorised data, and develop a suite of optimizations that leverage the data's hierarchical structure. Reptile reduces end-to-end runtimes by more than 6 times compared to a Matlab-based implementation, correctly identifies 21/30 data errors in John Hopkin's COVID-19 data, and correctly resolves 20/22 complaints in a user study using data and researchers from Columbia University's Financial Instruments Sector Team.
arXiv:2103.07037v1 fatcat:fkr3nrceyfb77janwngnooexhm

Explaining Inference Queries with Bayesian Optimization [article]

Brandon Lockhart, Jinglin Peng, Weiyuan Wu, Jiannan Wang, Eugene Wu
2021 arXiv   pre-print
Obtaining an explanation for an SQL query result can enrich the analysis experience, reveal data errors, and provide deeper insight into the data. Inference query explanation seeks to explain unexpected aggregate query results on inference data; such queries are challenging to explain because an explanation may need to be derived from the source, training, or inference data in an ML pipeline. In this paper, we model an objective function as a black-box function and propose BOExplain, a novel
more » ... mework for explaining inference queries using Bayesian optimization (BO). An explanation is a predicate defining the input tuples that should be removed so that the query result of interest is significantly affected. BO - a technique for finding the global optimum of a black-box function - is used to find the best predicate. We develop two new techniques (individual contribution encoding and warm start) to handle categorical variables. We perform experiments showing that the predicates found by BOExplain have a higher degree of explanation compared to those found by the state-of-the-art query explanation engines. We also show that BOExplain is effective at deriving explanations for inference queries from source and training data on a variety of real-world datasets. BOExplain is open-sourced as a Python package at https://github.com/sfu-db/BOExplain.
arXiv:2102.05308v2 fatcat:4jw2ulwmwfeofarlene5rviedq

Extending the View Composition Algebra to Hierarchical Data [article]

Eugene Wu
2022 arXiv   pre-print
Comparison is a core task in visual analysis. Although there are numerous guidelines to help users design effective visualizations to aid known comparison tasks, there are few formalisms that define the semantics of comparison operations in a way that can serve as the basis for a grammar of comparison interactions. Recent work proposed a formalism called View Composition Algebra (VCA) that enables ad hoc comparisons between any combination of marks, trends, or charts in a visualization
more » ... . However, VCA limits comparisons to visual representations of data that have an identical schema, or where the schemas form a strict subset relationship (e.g., comparing price per state with price, but not with price per county). In contrast, the majority of real-world data - temporal, geographical, organizational - are hierarchical. To bridge this gap, this paper presents an extension to VCA (called VCAH) that enables ad hoc comparisons between visualizations of hierarchical data. VCAH leverages known hierarchical relationships to enable ad hoc comparison of data at different hierarchical granularities. We illustrate applications to hierarchical and Tableau visualizations.
arXiv:2205.01283v1 fatcat:rgobhlrzefgllbepscecoa7age

PopFactor: Live-Streamer Behavior and Popularity [article]

Robert Netzorg, Lauren Arnett, Augustin Chaintreau, Eugene Wu
2018 arXiv   pre-print
Live video-streaming platforms such as Twitch enable top content creators to reap significant profits and influence. To that effect, various behavioral norms are recommended to new entrants and those seeking to increase their popularity and success. Chiefly among them are to simply put in the effort and promote on social media outlets such as Twitter, Instagram, and the like. But does following these behaviors indeed have a relationship with eventual popularity? In this paper, we collect a
more » ... s of Twitch streamer popularity measures --- spanning social and financial measures --- and their behavior data on Twitch and third party platform. We also compile a set of community-defined behavioral norms. We then perform temporal analysis to identify the increased predictive value that a streamer's future behavior contributes to predicting future popularity. At the population level, we find that behavioral information improves the prediction of relative growth that exceeds the median streamer. At the individual level, we find that although it is difficult to quickly become successful in absolute terms, streamers that put in considerable effort are more successful than the rest, and that creating social media accounts to promote oneself is effective irrespective of when the accounts are created. Ultimately, we find that studying the popularity and success of content creators in the long term is a promising and rich research area.
arXiv:1812.03379v1 fatcat:h7fj5fvocreipafvrnfis5pntu

Continuous Prefetch for Interactive Data Applications [article]

Haneen Mohammed, Ziyun Wei, Eugene Wu, Ravi Netravali
2020 arXiv   pre-print
Interactive data visualization and exploration (DVE) applications are often network-bottlenecked due to bursty request patterns, large response sizes, and heterogeneous deployments over a range of networks and devices. This makes it difficult to ensure consistently low response times (< 100ms). Khameleon is a framework for DVE applications that uses a novel combination of prefetching and response tuning to dynamically trade-off response quality for low latency. Khameleon exploits DVE's
more » ... tion tolerance: immediate lower-quality responses are preferable to waiting for complete results. To this end, Khameleon progressively encodes responses, and runs a server-side scheduler that proactively streams portions of responses using available bandwidth to maximize user's perceived interactivity. The scheduler involves a complex optimization based on available resources, predicted user interactions, and response quality levels; yet, decisions must also be real-time. To overcome this, Khameleon uses a fast greedy approximation which closely mimics the optimal approach. Using image exploration and visualization applications with real user interaction traces, we show that across a wide range of network and client resource conditions, Khameleon outperforms classic prefetching approaches that benefit from perfect prediction models: response latencies with Khameleon are never higher, and typically between 2 to 3 orders of magnitude lower while response quality remains within 50%-80%.
arXiv:2007.07858v1 fatcat:2tyctxnj7vgivffeac3e34ldh4

Graphical Perception in Animated Bar Charts [article]

Eugene Wu, Lilong Jiang, Larry Xu, Arnab Nandi
2016 arXiv   pre-print
Eugene Wu is at Columbia University. E-mail: ewu@cs.columbia.edu • Lilong and Arnab are at Ohio State University. E-mail: {jianglil,arnab}@cse.osu.edu • Larry is at U.C. Berkeley.  ... 
arXiv:1604.00080v1 fatcat:f6at6o5jjrddjji3thdugfp64q

Mining Precision Interfaces From Query Logs [article]

Haoci Zhang and Thibault Sellam and Eugene Wu
2017 arXiv   pre-print
Acknowledgements: We thank Yifan Wu for the initial inspiration, Anant Bhardwaj for data collection, Laura Rettig on early formulations of the problem, and the support of NSF 1527765 and 1564049.  ... 
arXiv:1712.00078v1 fatcat:uurijgw7afac5ejqy7mwhv4qzm

Human-powered Sorts and Joins [article]

Adam Marcus, Eugene Wu, David Karger, Samuel Madden, Robert Miller
2011 arXiv   pre-print
Crowdsourcing markets like Amazon's Mechanical Turk (MTurk) make it possible to task people with small jobs, such as labeling images or looking up phone numbers, via a programmatic interface. MTurk tasks for processing datasets with humans are currently designed with significant reimplementation of common workflows and ad-hoc selection of parameters such as price to pay per task. We describe how we have integrated crowds into a declarative workflow engine called Qurk to reduce the burden on
more » ... flow designers. In this paper, we focus on how to use humans to compare items for sorting and joining data, two of the most common operations in DBMSs. We describe our basic query interface and the user interface of the tasks we post to MTurk. We also propose a number of optimizations, including task batching, replacing pairwise comparisons with numerical ratings, and pre-filtering tables before joining them, which dramatically reduce the overall cost of running sorts and joins on the crowd. In an experiment joining two sets of images, we reduce the overall cost from 67 in a naive implementation to about 3, without substantially affecting accuracy or latency. In an end-to-end experiment, we reduced cost by a factor of 14.5.
arXiv:1109.6881v1 fatcat:qlkdzusob5dy3mbmplhl7dc3re
« Previous Showing results 1 — 15 out of 25,050 results