General Data Analysis Facilities

The STSDAS statistics package is a direct implementation of the survival analysis package ASURV developed by Eric Feigelson and Takashi Isobe at Pennsylvania State University.*1

Observational astronomers frequently encounter the situation where they observe a particular property (e.g., far infrared emission, calcium line absorption, or CO emission) of a previously defined sample of objects, but fail to detect all of them. The data then contains upper limits as well as detections, preventing the use of simple and familiar statistical techniques in the analysis. Statistical quantities frequently sought include the mean value of the measured variable for the sample, its distribution function, comparison of this distribution with other samples, and the relation of the measured variable to other properties of the objects.

Astronomers recently recognized the existence of a variety of statistical methods, or have derived similar methods on their own, to deal with these problems. The methods are frequently collectively called "survival analysis" or the "analysis of lifetime data," from their origins in actuarial and related fields. This revision of the survival analysis package provides all of the methods described in the papers in the Astrophysical Journal (Feigelson and Nelson, 1985; Isobe, Feigelson and Nelson, 1986).

No statistical procedure can magically recover information that was never measured at the telescope. However, there is frequently important information implicit in the failure to detect some objects which can be partially recovered under reasonable assumptions. We purposely provide several two-sample tests, correlation tests and linear regressions--each based on different models of where upper limits truly lie--so that the astronomer can judge the importance of the different assumptions.

There are reasons for confidence in the methods. First, the volume of scholarly effort devoted to the problem is impressive. A dozen books and scores of papers have been published during the last decade. Second, enterprises involving thousands of lives and billions of dollars (such as the insurance industry) appear to believe in the methods. Third, Monte Carlo simulations of a typical astronomical problem (flux-limited satellite observations of a magnitude limited population of uniformly distributed objects) showed several methods to be quite reliable in this context (Isobe et al., 1986). More simulations of astronomical problems are needed.

There are also reasons for not applying these methods. We describe some of their known limitations in section about cautions and caveats.

Let us give an illustration on the use of survival analysis in three disparate fields. Consider a linear regression problem. First, an epidemiologist wishes to determine how the longevity or "survival time" (the dependent variable) depends on the number of cigarettes smoked per day (the independent variable). The experiment lasts 10 years, during which some individuals die and others do not. The survival time of the living individuals is only known to be greater than their age when the experiment ends. These values are therefore "right censored data points." Second, a company manufacturing engines wishes to know the average time between breakdowns, as a function of engine speed, to determine the optimal operating range. Test engines are set running until 20% of them fail; the average "lifetime" and dependence on speed is then calculated with 80% of the data points censored. Third, an astronomer observes a sample of nearby galaxies in the far-infrared to determine the relation between dust and molecular gas. Half have detected infrared luminosities but half are upper limits ("left censored data points"). The infrared luminosities are compared to CO observations, which themselves may contain censored values.

Astronomers have dealt with their upper limits in a number of fashions. One is to consider only detections in the analysis; while possibly acceptable for some purposes (e.g., regression), this will clearly bias the results in others (e.g., estimating luminosity functions). A second procedure considers the ratio of detected objects to observed objects in a given sample. When plotted in a range of luminosity bins, this has been called the "fractional luminosity function" and has been frequently used in extragalactic radio astronomy. This is mathematically the same as actuarial life tables. But it is sometimes used incorrectly in comparing different samples: the detected fraction clearly depends on the (usually uninteresting) distance to the objects as well as their (interesting) luminosity. Also, error bars do not apply in these fractional luminosity functions.

A third procedure is to take all of the data, including the exact values of the upper limits, and model the properties of the parent population under certain mathematical constraints, such as maximizing the likelihood that the data fits the model. This technique, which is at the root of much of modern survival analysis, was independently developed by astrophysicist Yoram Avni (Avni et al., 1980; Avni and Tananbaum, 1986) and is now commonly used in X-ray astronomy.

Survival methods have the advantage of having been widely-used in a range of disparate fields so that their mathematical validity has been tested in other disciplines. Issues of great social importance (e.g., establishing Social Security benefits, strategies for dealing with the AIDS epidemic, MIL-STD reliability standards) are based on it. In detail, however, astronomers must study the assumptions underlying each model and be aware that some methods at the forefront of survival analysis may not be fully understood.

We will concentrate on the nonparametric models because the underlying distribution of astronomical populations is usually unknown. Some methods (e.g., EM algorithm regression), however, are fully parametric while others (e.g., Cox regression, Buckley-James regression) are partially parametric. Most of the methods are derived and described in more detail in Schmitt (1985), Feigelson and Nelson (1985) and Isobe et al. (1986).

Two tasks are univariate in nature: kmestimate and twosampt. kmestimate gives the Kaplan-Meier estimator for the distribution function of a randomly censored sample. First derived in 1958, it lies at the root of non-parametric survival analysis. It is the unique, self-consistent, generalized maximum-likelihood estimator for the population from which the sample was drawn. When formulated in cumulative form, it has analytic asymptotic (for large N) error bars. The median is always well-defined (but not the mean when the lowest point is also the upper limit). It is identical to the differential "redistribute-to-the-right" algorithm, independently derived by Avni et al. (1980), but when formulated in integral form has simple analytic error analysis.

The second univariate task is twosampt, giving the variety of ways to test whether two censored samples are drawn from the same parent population. They are mostly generalizations of standard tests for uncensored data, such as the Wilcoxon and logrank nonparametric two-sample tests. They differ in how the censored data are weighted or "scored" in calculating the statistic. Thus, each is more sensitive to differences at one end or the other of the distribution. The Gehan and logrank tests are widely used in biometrics, while some of the others are not.

There are a number of bivariate methods, including two correlation tests and three linear regression analyses. Cox hazard model (correlation test), EM algorithm, and Buckley-James method (linear regressions) can treat several independent variables if the dependent variable contains only one kind of censoring (i.e., upper or lower limits). Generalized Kendall's tau (correlation test) and Schmitt's binned linear regression can treat mixed censoring (including censoring in the independent variable), but can have only one independent variable. correlation test We currently do not provide error analysis for Schmitt's method.

Thus, the censoring mechanisms of each study must be understood individually to judge whether the censoring is or is not likely to be random. The appearance of the data, even if the upper limits are clustered at one end of the distribution, is not a reliable measure. A frequent (if philosophically distasteful) way to avoid having to determine the nature of the censoring is to define the population of interest to be the observed sample. The Kaplan-Meier estimator then always gives a valid redistribution of the upper limits, though the result may not be applicable in wider contexts.

The two-sample tests are somewhat less restrictive than the Kaplan-Meier estimator, since they seek only to compare two samples rather than determine the true underlying distribution. The Gehan test assumes that the censoring patterns of the two samples are the same, but does not require that they be random. The logrank test results appear to be correct even when the censoring patterns differ somewhat. There is little known about the limitations of the other two-sample tests given here or the bivariate generalized Kendall's tau rank statistic.

The two bivariate correlation coefficients, generalized Kendall's tau and Cox regression, are both known to be inaccurate when many tied values are present in the data. Ties are particularly common when the data is binned. Caution should be used in these cases.

Cox regression, though widely used in biostatistical applications, formally applies only if the "hazard rate" is constant; that is, the cumulative distribution function of the censored variable falls exponentially with increasing values. Astronomical luminosity functions, in contrast, are frequently modelled by power law distributions. It is not clear whether a Cox regression is significantly affected by this difference.

There are several limitations to the three linear regression methods--emmethod, buckleyjames, and schmittbin--presented here. First, only schmittbin works when censoring is present in both variables. Second, emmethod requires that the residuals about the fitted line follow a Gaussian distribution. buckleyjames and schmittbin are less restrictive, requiring only that the censoring distribution about the fitted line is random. Both we and Schneider (1986) find little difference in the regression coefficients derived from these two methods. Third, there is considerable uncertainty regarding the error analysis of the regression coefficients of all three models. emmethod gives analytic formulas based on asymptotic normality, while Avni and Tananbaum (1986) numerically calculate and examine the likelihood surface. buckleyjames gives an analytic formula for the slope only, but it lies on a weak theoretical foundation. We do not provide Schmitt's (1985) bootstrap error analysis for schmittbin, partly because of the high computational expense for large data sets. Users may thus wish to run all methods and evaluate the uncertainty with caution. Fourth, use of schmittbin is questionable in certain cases with heavy censoring. The user must arbitrarily choose a bin size. If it is too small, many censored points at the end of the distribution will be changed to detected points. If the bins are too large, accuracy in the regression calculation is reduced. Fifth, the Buckley-James method has a defect in that the final solution occasionally oscillates rather than converging smoothly.

If we consider the field of survival analysis from a broader perspective, we note a number of deficiencies with respect to censored statistical problems in astronomy. Most importantly, survival analysis assumes the upper limits in a given experiment are precisely known, while in astronomy they frequently represent n times the RMS noise level in the experimental detector, where n=2, 3, 5, or whatever. Methods adopted to this situation are clearly needed. A related deficiency is the absence of weighted means or regressions. Survival analysis also has not yet produced any true multi-variate techniques, such as a Principal Components Analysis that permits censoring.

Finally, we note that this package is not unique. Nearly a dozen software packages for analyzing censored data have been developed (Wagner and Meeker 1985). Four are part of large multi-purpose commercial statistical software systems: SAS, SPSS, BMDP, and GLIM. These packages are available on many university mainframes. We have found BMDP to be the most useful for astronomers (see the appendixes in Feigelson and Nelson 1985; Isobe et al. 1986 for instructions on its use). It provides most of the functions in kmestimate and twosampt as well as a full Cox regression, but no linear regressions. Other packages (CENSOR, DASH, LIMDEP, STATPAC, STAR, SURVAN, SURVREG) were written at various universities, medical institutes and industrial labs, and have not been evaluated.

- Overview of Survival Analysis
- Overview of the Software
- Cautions and Caveats

Generated with WebMaker