General Data Analysis Facilities


statistics package is a direct implementation of the survival analysis package ASURV developed by Eric Feigelson and Takashi Isobe at Pennsylvania State University.*1

Observational astronomers frequently encounter the situation where they observe a particular property (e.g., far infrared emission, calcium line absorption, or CO emission) of a previously defined sample of objects, but fail to detect all of them. The data then contains upper limits as well as detections, preventing the use of simple and familiar statistical techniques in the analysis. Statistical quantities frequently sought include the mean value of the measured variable for the sample, its distribution function, comparison of this distribution with other samples, and the relation of the measured variable to other properties of the objects.

Astronomers recently recognized the existence of a variety of statistical methods, or have derived similar methods on their own, to deal with these problems. The methods are frequently collectively called "survival analysis" or the "analysis of lifetime data," from their origins in actuarial and related fields. This revision of the survival analysis package provides all of the methods described in the papers in the Astrophysical Journal (Feigelson and Nelson, 1985; Isobe, Feigelson and Nelson, 1986).

No statistical procedure can magically recover information that was never measured at the telescope. However, there is frequently important information implicit in the failure to detect some objects which can be partially recovered under reasonable assumptions. We purposely provide several two-sample tests, correlation tests and linear regressions--each based on different models of where upper limits truly lie--so that the astronomer can judge the importance of the different assumptions.

There are reasons for confidence in the methods. First, the volume of scholarly effort devoted to the problem is impressive. A dozen books and scores of papers have been published during the last decade. Second, enterprises involving thousands of lives and billions of dollars (such as the insurance industry) appear to believe in the methods. Third, Monte Carlo simulations of a typical astronomical problem (flux-limited satellite observations of a magnitude limited population of uniformly distributed objects) showed several methods to be quite reliable in this context (Isobe et al., 1986). More simulations of astronomical problems are needed.

There are also reasons for not applying these methods. We describe some of their known limitations in section about cautions and caveats.

Overview of Survival Analysis

Statistical methods dealing with censored data have a long and confusing history. It began in the 17th century with the development of actuarial mathematics for the life insurance industry and demographic studies. Astronomer Edmund Halley was one of the founders. In the mid-20th century, this application grew, along with biomedical and clinical research, into a major field of biometrics. Similar (and sometimes identical) statistical methods were being developed in two other fields: the theory of reliability, with industrial and engineering applications; and econometrics, with attempts to understand the relations between economic forces in groups of people. Finally, we find the same mathematical problems and some of the same solutions arising in modern astronomy as discussed above.

Let us give an illustration on the use of survival analysis in three disparate fields. Consider a linear regression problem. First, an epidemiologist wishes to determine how the longevity or "survival time" (the dependent variable) depends on the number of cigarettes smoked per day (the independent variable). The experiment lasts 10 years, during which some individuals die and others do not. The survival time of the living individuals is only known to be greater than their age when the experiment ends. These values are therefore "right censored data points." Second, a company manufacturing engines wishes to know the average time between breakdowns, as a function of engine speed, to determine the optimal operating range. Test engines are set running until 20% of them fail; the average "lifetime" and dependence on speed is then calculated with 80% of the data points censored. Third, an astronomer observes a sample of nearby galaxies in the far-infrared to determine the relation between dust and molecular gas. Half have detected infrared luminosities but half are upper limits ("left censored data points"). The infrared luminosities are compared to CO observations, which themselves may contain censored values.

Astronomers have dealt with their upper limits in a number of fashions. One is to consider only detections in the analysis; while possibly acceptable for some purposes (e.g., regression), this will clearly bias the results in others (e.g., estimating luminosity functions). A second procedure considers the ratio of detected objects to observed objects in a given sample. When plotted in a range of luminosity bins, this has been called the "fractional luminosity function" and has been frequently used in extragalactic radio astronomy. This is mathematically the same as actuarial life tables. But it is sometimes used incorrectly in comparing different samples: the detected fraction clearly depends on the (usually uninteresting) distance to the objects as well as their (interesting) luminosity. Also, error bars do not apply in these fractional luminosity functions.

A third procedure is to take all of the data, including the exact values of the upper limits, and model the properties of the parent population under certain mathematical constraints, such as maximizing the likelihood that the data fits the model. This technique, which is at the root of much of modern survival analysis, was independently developed by astrophysicist Yoram Avni (Avni et al., 1980; Avni and Tananbaum, 1986) and is now commonly used in X-ray astronomy.

Survival methods have the advantage of having been widely-used in a range of disparate fields so that their mathematical validity has been tested in other disciplines. Issues of great social importance (e.g., establishing Social Security benefits, strategies for dealing with the AIDS epidemic, MIL-STD reliability standards) are based on it. In detail, however, astronomers must study the assumptions underlying each model and be aware that some methods at the forefront of survival analysis may not be fully understood.

Overview of the Software

The statistical methods for dealing with censored data might be divided into a 2 x 2 grid: parametric vs. nonparametric, and univariate vs. multivariate. Parametric methods assume that the censored survival times are drawn from a parent population with a known distribution function (e.g., Gaussian, exponential), and is the principal assumption of industrial reliability models. Nonparametric models make no assumption about the parent population, and frequently rely on maximum-likelihood techniques. Univariate methods are devoted to determining the characteristics of the population from which the censored sample is drawn, and comparing such characteristics for two or more censored samples. Bivariate methods measure whether the censored property of the same is correlated with another property of the objects and, if so, to perform a regression that quantifies the relationship between the two variables. There are few truly multivariate methods for censored data.

We will concentrate on the nonparametric models because the underlying distribution of astronomical populations is usually unknown. Some methods (e.g., EM algorithm regression), however, are fully parametric while others (e.g., Cox regression, Buckley-James regression) are partially parametric. Most of the methods are derived and described in more detail in Schmitt (1985), Feigelson and Nelson (1985) and Isobe et al. (1986).

Two tasks are univariate in nature: kmestimate and twosampt. kmestimate gives the Kaplan-Meier estimator for the distribution function of a randomly censored sample. First derived in 1958, it lies at the root of non-parametric survival analysis. It is the unique, self-consistent, generalized maximum-likelihood estimator for the population from which the sample was drawn. When formulated in cumulative form, it has analytic asymptotic (for large N) error bars. The median is always well-defined (but not the mean when the lowest point is also the upper limit). It is identical to the differential "redistribute-to-the-right" algorithm, independently derived by Avni et al. (1980), but when formulated in integral form has simple analytic error analysis.

The second univariate task is twosampt, giving the variety of ways to test whether two censored samples are drawn from the same parent population. They are mostly generalizations of standard tests for uncensored data, such as the Wilcoxon and logrank nonparametric two-sample tests. They differ in how the censored data are weighted or "scored" in calculating the statistic. Thus, each is more sensitive to differences at one end or the other of the distribution. The Gehan and logrank tests are widely used in biometrics, while some of the others are not.

There are a number of bivariate methods, including two correlation tests and three linear regression analyses. Cox hazard model (correlation test), EM algorithm, and Buckley-James method (linear regressions) can treat several independent variables if the dependent variable contains only one kind of censoring (i.e., upper or lower limits). Generalized Kendall's tau (correlation test) and Schmitt's binned linear regression can treat mixed censoring (including censoring in the independent variable), but can have only one independent variable. correlation test We currently do not provide error analysis for Schmitt's method.

Cautions and Caveats

The Kaplan-Meier estimator works with any underlying distribution (e.g., Gaussian, power law, bimodal), but only if the censoring is "random." That is, the probability that the measurement of an object is censored can not depend on the value of the censored variable. At first glance, this may seem to be inapplicable to most astronomical problems: we detect the brighter objects in a sample, so the distribution of upper limits always depends on brightness. However, two factors often serve to randomize the censoring distribution. First, the censored variable may not be correlated with the variable by which the sample was initially identified. Thus, infrared observations of a sample of radio bright objects will be randomly censored if the radio and infrared emission are unrelated to each other. Second, astronomical objects in a sample usually lie at different distances, so that brighter objects are not always the most luminous. In cases where the variable of interest is censored at a particular value (e.g., flux, spectral line equivalent width, stellar rotation rate) rather than randomly censored, then parametric methods appropriate to "Type 1" censoring should be used. These are described by Lawless (1982) and Schneider (1986), but are not included in this package.

Thus, the censoring mechanisms of each study must be understood individually to judge whether the censoring is or is not likely to be random. The appearance of the data, even if the upper limits are clustered at one end of the distribution, is not a reliable measure. A frequent (if philosophically distasteful) way to avoid having to determine the nature of the censoring is to define the population of interest to be the observed sample. The Kaplan-Meier estimator then always gives a valid redistribution of the upper limits, though the result may not be applicable in wider contexts.

The two-sample tests are somewhat less restrictive than the Kaplan-Meier estimator, since they seek only to compare two samples rather than determine the true underlying distribution. The Gehan test assumes that the censoring patterns of the two samples are the same, but does not require that they be random. The logrank test results appear to be correct even when the censoring patterns differ somewhat. There is little known about the limitations of the other two-sample tests given here or the bivariate generalized Kendall's tau rank statistic.

The two bivariate correlation coefficients, generalized Kendall's tau and Cox regression, are both known to be inaccurate when many tied values are present in the data. Ties are particularly common when the data is binned. Caution should be used in these cases.

Cox regression, though widely used in biostatistical applications, formally applies only if the "hazard rate" is constant; that is, the cumulative distribution function of the censored variable falls exponentially with increasing values. Astronomical luminosity functions, in contrast, are frequently modelled by power law distributions. It is not clear whether a Cox regression is significantly affected by this difference.

There are several limitations to the three linear regression methods--emmethod, buckleyjames, and schmittbin--presented here. First, only schmittbin works when censoring is present in both variables. Second, emmethod requires that the residuals about the fitted line follow a Gaussian distribution. buckleyjames and schmittbin are less restrictive, requiring only that the censoring distribution about the fitted line is random. Both we and Schneider (1986) find little difference in the regression coefficients derived from these two methods. Third, there is considerable uncertainty regarding the error analysis of the regression coefficients of all three models. emmethod gives analytic formulas based on asymptotic normality, while Avni and Tananbaum (1986) numerically calculate and examine the likelihood surface. buckleyjames gives an analytic formula for the slope only, but it lies on a weak theoretical foundation. We do not provide Schmitt's (1985) bootstrap error analysis for schmittbin, partly because of the high computational expense for large data sets. Users may thus wish to run all methods and evaluate the uncertainty with caution. Fourth, use of schmittbin is questionable in certain cases with heavy censoring. The user must arbitrarily choose a bin size. If it is too small, many censored points at the end of the distribution will be changed to detected points. If the bins are too large, accuracy in the regression calculation is reduced. Fifth, the Buckley-James method has a defect in that the final solution occasionally oscillates rather than converging smoothly.

If we consider the field of survival analysis from a broader perspective, we note a number of deficiencies with respect to censored statistical problems in astronomy. Most importantly, survival analysis assumes the upper limits in a given experiment are precisely known, while in astronomy they frequently represent n times the RMS noise level in the experimental detector, where n=2, 3, 5, or whatever. Methods adopted to this situation are clearly needed. A related deficiency is the absence of weighted means or regressions. Survival analysis also has not yet produced any true multi-variate techniques, such as a Principal Components Analysis that permits censoring.

Finally, we note that this package is not unique. Nearly a dozen software packages for analyzing censored data have been developed (Wagner and Meeker 1985). Four are part of large multi-purpose commercial statistical software systems: SAS, SPSS, BMDP, and GLIM. These packages are available on many university mainframes. We have found BMDP to be the most useful for astronomers (see the appendixes in Feigelson and Nelson 1985; Isobe et al. 1986 for instructions on its use). It provides most of the functions in kmestimate and twosampt as well as a full Cox regression, but no linear regressions. Other packages (CENSOR, DASH, LIMDEP, STATPAC, STAR, SURVAN, SURVREG) were written at various universities, medical institutes and industrial labs, and have not been evaluated.

Overview of Survival Analysis
Overview of the Software
Cautions and Caveats

Generated with WebMaker