Big Data: Pitfalls, Methods and Concepts for an Emergent Field

Zeynep Tufekci
2013 Social Science Research Network  
Big Data, large-scale aggregate databases of imprints of online and social media activity, has captured scientific and policy attention. However, this emergent field is challenged by inadequate attention to methodological and conceptual issues. I review key methodological and conceptual challenges including: 1.Inadequate attention to the implicit and explicit structural biases of the platform(s) most frequently used to generate datasets (the model organism problem). 2.The common practice of
more » ... cting on the dependent variable without corresponding attention to the complications of this path. 3.Lack of clarity with regard to sampling, universe and representativeness (the denominator problem). 4.Most big data analyses come from a single platform (hence missing the ecology of information flows). Conceptual issues include: 1.More research is needed to interpret aggregated mediated interactions. Clicks, status updates, links, retweets, etc. are complex social interactions. 2.Network methods imported from other fields need to be carefully reconsidered to evaluate appropriateness for analyzing human social media imprints. 3.Most big datasets contain information only on "node-to-node" interaction. However, "field" effects--events that affect a society or a group in a wholesale fashion either through shared experience or through broadcast media-are an important part of human sociocultural experience. 4.Human reflexivity -that humans will alter behaviors around metrics--needs to be assumed and built into the analysis. 5.Assuming additivity and counting interactions so that each new interaction is seen as (n+1) without regards to the semantics or context can be misleading. 6.The relationship between network structure and other attributes is complex and multi-faceted. The dramatic proliferation of technologically mediated human interaction produces online imprints which are increasingly aggregated into large databases. Such large datasets, especially of social media imprints, commonly referred to as big data, have been analyzed by scholars, corporations, politicians, journalists, and governments (Lazer et al., 2009; boyd & Crawford, 2012) . Although big data is being variously touted as the key to rigor in social science and as an important basis for policy, this emergent field suffers from inadequate attention to methodological and conceptual issues. Methodological issues which will be examined in this paper include the following: 1. Inadequate attention to the implicit and explicit structural biases of the platform(s) most frequently used to generate datasets (the model organism problem); 2. The common practice of selecting on the dependent variable without corresponding attention to the complications of this path. (Most hashtag analyses, for example, involve selecting on the dependent variable.) 3. Lack of clarity with regard to sampling, universe and representativeness (the denominator problem); 4. Most big data analyses come from a single platform (hence missing the ecology and the natural setting of information flows and interaction). The conceptual issues related to big data analysis that are examined in this paper include: 1. More research is needed to interpret aggregated mediated interactions. Clicks, status updates, links, retweets, etc. are complex social interactions with varying meanings, logics and implications. 2. Network methods imported from other fields need to be carefully and thoroughly reconsidered to evaluate appropriateness for analyzing human social media imprints. BIG DATA: PITFALLS, METHODS AND CONCEPTS FOR AN EMERGENT FIELD Draft Paper by Zeynep Tufekci --zst@princeton.edu BIG DATA: PITFALLS, METHODS AND CONCEPTS FOR AN EMERGENT FIELD Draft Paper by Zeynep Tufekci --zst@princeton.edu 12 (Haythornthwaite, 2002) suggests something different than that: that the stronger the tie, the more the means of communication employed-not that "one medium of communication implies communication by other means". In other words, there is no assumption that using one medium implies communication by other means as well-it depends on the context, the strength of the tie, the content of the message, the availability of the communication, the suitability of the medium, among other factors. These challenges do not mean that nothing valuable can be used from single-platform analyses. However, all such analyses must take into account that they are not examining a closed system --And the Onnela et al. (2007) study certainly has interesting results and was published in the prestigious PNAS)-and that there likely isn't justification to draw some of the broader claims.
doi:10.2139/ssrn.2229952 fatcat:lwoqgonqsjhwbhpnpobn63duma