Analysis on binary responses with ordered covariates and missing data

Jeremy M. G. Taylor, Lu Wang, Zhiguo Li
2007 Statistics in Medicine  
We consider the situation of two ordered categorical variables and a binary outcome variable, where one or both of the categorical variables may have missing values. The goal is to estimate the probability of response of the outcome variable for each cell of the contingency table of categorical variables while incorporating the fact that the categorical variables are ordered. The probability of response is assumed to change monotonically as each of the categorical variables changes level. A
more » ... ability model is used in which the response is binomial with parameters p i j for each cell (i, j) and the number of observations in each cell is multinomial. Estimation approaches that incorporate Gibbs sampling with order restrictions on p i j induced via a prior distribution, two-dimensional isotonic regression and multiple imputation to handle missing values are considered. The methods are compared in a simulation study. Using a fully Bayesian approach with a strong prior distribution to induce ordering can lead to large gains in efficiency, but can also induce bias. Utilizing isotonic regression can lead to modest gains in efficiency, while minimizing bias and guaranteeing that the order constraints are satisfied. A hybrid of isotonic regression and Gibbs sampling appears to work well across a variety of scenarios. The methods are applied to a pancreatic cancer case-control study with two biomarkers. The particular research area that motivated this work comes from studies involving cancer biomarkers. There is considerable interest in discovering and assessing the molecular properties of tumors, normal tissues and serum from cancer patients and relating these properties to outcome variables, such as response to treatment or survival or case-control status. The researcher will frequently store specimens, such as a piece of the tumor or normal tissue after surgery, or a vial of serum for each patient. These specimens are later tested to determine specific molecular properties. The particular application we discuss later is from a case-control study of pancreatic cancer, with two serum biomarkers measured. The two biomarkers are CA-19-9 and CA-125, which are known to be relevant in the development and progression of pancreatic cancer. It is biologically reasonable to assume that the probability of being a case changes monotonically as the biomarker values change. Furthermore, since these two biomarkers measure different aspects of the biology of cancer, it is plausible that a combination of them may be useful for predicting the outcome variable. The overall goal is to understand the relationship between the outcome variable and the combination of covariates while utilizing the fact that the covariates are ordered. By utilizing the ordering we hope to be able to gain efficiency, compared to ignoring the ordering; this may be particularly useful in small studies. In studies of this type missing data in one or both of the biomarkers is common. Sometimes the assay does not work for biological reasons, sometimes the specimen is missing, or degraded too much or of insufficient volume for the assay to run. Since the response is measured and one of the biomarkers may be measured it would be inappropriate to discard the observation. There is a considerable statistical literature on statistical models and methods for ordered categorical variables and inference in the presence of monotonicity or order restrictions [1] [2] [3] [4] [5] [6] [7] [8] . In this paper we will focus on the situation of a response variable Y and one or more ordered categorical explanatory variables X , and the general monotonicity constraint we are interested in is that if Isotonic regression is a well-known approach for estimation in a regression model with a single explanatory variable and a continuous response. The pooled adjacent violators algorithm ensures that the response function is a monotonic function of the explanatory variable. The asymptotic convergence of the estimator does not follow the usual root n rate, this presents a problem for calculating standard errors and confidence intervals, particularly in small samples. If there are two or more explanatory variables the concept of isotonic regression generalizes quite naturally, although the algorithms to estimate the response surface are considerably more complex [9] . In a Bayesian approach, in general the ordering can be introduced through prior distributions. For example, if the order restriction is on the parameters of the model, say 1 < 2 , then an appropriate prior would have P( 1 < 2 ) = 1. If it is possible to obtain draws of 1 and 2 from the posterior distribution without the order restriction that 1 < 2 , then it is a simple matter to discard draws that violate the restriction to obtain draws from the desired posterior distribution. For example, in the Gibbs sampling scheme, the parameter 1 is drawn from its unconstrained conditional posterior distribution, but then is discarded if it is greater than the current value of 2 and a new value of 1 is drawn until one satisfying the constraint 1 < 2 is obtained [4]. This is followed by a draw of 2 which must be larger than the latest value of 1 , and so on. In a recent article Dunson and Neelon [10] developed a hybrid of isotonic regression and Gibbs sampling. In particular, they fit a model without order restrictions using Bayesian methods, but also apply isotonic regression within the Gibbs sampling algorithm. We will consider an adaptation of this as one of our approaches.
doi:10.1002/sim.2815 pmid:17219376 fatcat:gznhe76e5rer5hetbyawveikea