Measuring Crowdsourcing Effort with Error-Time Curves

Justin Cheng, Jaime Teevan, Michael S. Bernstein
2015 Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems - CHI '15  
After describing ETA, we explore the metric via four studies: -Study 1: ETA vs. other measures of effort. For ten common microtasking primitives (e.g., multiple choice questions, long-form text entry), we show that the ETA metric represents effort better than existing measures. -Study 2: ETA vs. market price. We then compare ETA as well as other measures to the market prices of these primitives on a crowdsourcing platform. -Study 3: Modeling perceptual costs. By augmenting ETA with measures of
more » ... erceptual effort, we find we can better model a worker's perceived difficulty of a task. -Study 4: Tasks without ground truth. In order to capture how well people do a task, ETA requires ground truth. We extend the metric to also work for subjective tasks. We then demonstrate how ETA can be used for rapidly prototyping tasks. ETA makes it possible to characterize tasks in terms of their monetary cost and human effort, and paves the way for better task design, payment, and allocation. RELATED WORK Measures of task difficulty or mental workload can be roughly separated into two categories: subjective and objective measures. Subjective measures include multidimensional workload assessment tools such as the NASA Task Load Index (TLX) [10] and time estimates [5] . However, such measures tend to be inaccurate and hard to capture. It is difficult for requesters to accurately estimate task difficulty, as experts categorically underestimate novices' completion times and difficulty [12, 13] . Workers are also inaccurate; subjective metrics collected from workers tend to correlate poorly with each other (e.g., between self-reported effort, self-reported difficulty, and response time [7]), and workers exhibit large variance because they use different ranges of the rating scale [11] . Worker-driven subjective task judgments sometimes appear on web sites such as Turkopticon and mTurk Grind. However, these reviews are limited in number, lag the marketplace by hours, and not available for all tasks. Our own experiments reveal that many subjective metrics, while correlated with effort, are not directly interpretable and cannot differentiate between similar tasks.
doi:10.1145/2702123.2702145 dblp:conf/chi/ChengTB15 fatcat:xrr3etp7qrha7o2fx2rf2wf6pi