Making the Most of Missing Values: Object Clustering with Partial Data in Astronomy


Paper:	Making the Most of Missing Values: Object Clustering with Partial Data in Astronomy
Volume:	347, Astronomical Data Analysis Software and Systems XIV
Page:	172
Authors:	Wagstaff, K.L.; Laidler, V.G.
Abstract:	Modern classification and clustering techniques analyze collections of objects that are described by a set of useful features or parameters. Clustering methods group the objects in that feature space to identify distinct, well separated subsets of the data set. However, real observational data may contain missing values for some features. A “:shape”: feature may not be well defined for objects close to the detection limit, and objects of extreme color may be unobservable at some wavelengths. The usual methods for handling data with missing values, such as imputation (estimating the missing values) or marginalization (deleting all objects with missing values), rely on the assumption that missing values occur by random chance. While this is a reasonable assumption in other disciplines, the fact that a value is missing in an astronomical catalog may be physically meaningful. We demonstrate a clustering analysis algorithm, KSC, that a) does not impute values and b) does not discard the partially observed objects. KSC uses soft constraints defined by the fully observed objects to assist in the grouping of objects with missing values. We present an analysis of objects taken from the Sloan Digital Sky Survey to demonstrate how imputing the values can be misleading and why the KSC approach can produce more appropriate results.