On issues concerning the assessment of information contained in aggregate data using the F-statistic
Piantadosi, J., Anderssen, R.S. and Boland J. (eds) MODSIM2013, 20th International Congress on Modelling and Simulation
The analysis of aggregate data has been gaining momentum in the statistics and allied disciplines, (including public policy, political science and epidemiology) for more than 20 years. As a result, the issue has received an increasing amount of attention by categorical data analysts. Performing aggregate data analysis is quickly becoming unavoidable in many situations, especially when individual level data is unavailable. For example, the U.S. Justice Department uses aggregate data to formulate
... the public policies against racial discrimination, political scientists are always interested in exploring the political or ideological preferences of different demographic groups while social scientists use aggregate data to study the relationship between crime and unemployment. The availability of aggregate data has increased due to strict confidentiality restrictions imposed upon by government and corporate organisations who are reluctant to release individual level information. There is a wealth of contributions on this issue that is available in the ecological inference (EI) literature which considers the association structure between categorical variables (at the individual level) given only the aggregate information. The main difficulty in EI arises due to the loss of information during the process of aggregation and results in aggregation bias. It is also a matter of concern for aggregate data analysts that the interpretation of the parameters from EI models might be entirely different to analogous parameters for the study of individual level data. An alternative strategy to EI is to consider the recently proposed Aggregate Association Index (AAI) that allows the analyst to quantify the overall extent of association between two dichotomous variables given only the aggregate, or marginal, information of a 2x2 table. Unlike EI, the AAI does not estimate, or model, the conditional proportions but focuses instead on gauging the extent of association between the variables. The AAI can also be further partition into positive and negative association terms thus enabling the analysts to understand the more likely direction of the association given only the aggregate data. However, the major issue with the performance of AAI is the impact the sample size has on its magnitude. In this paper we investigate the informativeness of the aggregate data for inferring an association exists between the variables of a 2x2 table. This article introduces development of an F-test to determine the statistical significance of the information contained in the aggregate data for inferring a statistically significant association between the variables. Unlike Pearson's chi-squared statistic, the F-statistic is robust to any change in the sample size and depends only on the aggregate information in the contingency table. Thus this statistic provides an opportunity to understand the structure of a 2x2 table without being influenced by sample size. The applicability of this test is demonstrated by using the Selikoff's (1981) asbestosis data which was collected from 1117 insulation workers of New York City in 1963 to explore the links between asbestosis and occupational exposure to asbestos fibres. Such work was the key to establishing the link between asbestosis and mesothelioma. As a result of findings of this nature, many international government organisations have now banned the production, and importation, of goods that contain asbestosis fibres.