Visualizing bivariate long-tailed data

Justin S. Dyer, Art B. Owen
2011 Electronic Journal of Statistics  
Variables in large data sets in biology or e-commerce often have a head, made up of very frequent values and a long tail of ever rarer values. Models such as the Zipf or Zipf-Mandelbrot provide a good description. The problem we address here is the visualization of two such long-tailed variables, as one might see in a bivariate Zipf context. We introduce a copula plot to display these data. The plot uses the empirical rankings of the entities. We show that most of the data is accurately placed
more » ... accurately placed by these rankings and so the plot is descriptive of the process generating the data, not just the data. In many ratings data sets we see strong head to tail affinities: popular entities of one type are strongly associated with rare entities of the opposite type. We present two generative models (saturation and bipartite preferential attachment) that show such qualitative behavior and we characterize the power law behavior of the marginal distributions in these models.
doi:10.1214/11-ejs622 fatcat:zuts5pgqmrfpbml52e47zultgq