Ecological inference for 2 × 2 tables
A fundamental problem in many disciplines, including political science, sociology and epidemiology, is the examination of the association between two binary variables across a series of 2 × 2 tables, when only the margins are observed, and one of the margins is fixed. Two unobserved fractions are of interest, with only a single response per table, and it is this non-identifiability that is the inherent difficulty lying at the heart of ecological inference. Many methods have been suggested for ecological inference, often without a probabilistic model; we clarify the form of the sampling distribution and critique previous approaches within a formal statistical framework, thus allowing clarification and examination of the assumptions that are required under all approaches. A particularly difficult problem is choosing between models with and without contextual effects. Various Bayesian hierarchical modelling approaches are proposed to allow the formal inclusion of supplementary data, and/or prior information, without which ecological inference is unreliable. Careful choice of the prior within such models is required, however, since there may be considerable sensitivity to this choice, even when the model assumed is correct and there are no contextual effects. This sensitivity is shown to be a function of the number of areas and the distribution of the proportions in the fixed margin across areas. By explicitly providing a likelihood for each table, the combination of individual level survey data and aggregate level data is straightforward and we illustrate that survey data can be highly informative, particularly if these data are from a survey of the minority population within each area. This strategy is related to designs that are used in survey sampling and in epidemiology. An approximation to the suggested likelihood is discussed, and various computational approaches are described. Some extensions are outlined including the consideration of multiway tables, spatial dependence and area-specific (contextual) variables. Voter registration–race data from 64 counties in the US state of Louisiana are used to illustrate the methods.
Keywords: Auxiliary variables; Contextual effects; Ecological fallacy; Ecological regression; Extended hypergeometric distribution; Hierarchical models; Identifiability; Markov chain Monte Carlo methods; Method of bounds; Missing data; Neighbourhood model; Simpson's paradox; Spatial epidemiology; Survey sampling
Document Type: Research Article
Affiliations: University of Washington, Seattle, USA
Publication date: 2004-08-01