Parametric Statistical Classification by the Minimum Integrated Square Error Criterion

Eric Chi, Rice University

Photo of Eric Chi

Maximizing the likelihood function is the classical approach to parametric estimation and classification. This approach proves to be asymptotically efficient under fairly general conditions – provided that the model is specified correctly. Correctly specifying a model, however, is no trivial task if it is even possible. Even a few outliers among data drawn from an otherwise pure sample of data can result in a very poor fit.

In contrast, minimizing the integrated square error, while less efficient, proves to be robust to a fair amount of contamination in this scenario. This approach is especially suited for analyzing massive data sets where manually inspecting the data for outliers is impractical.

We present the latter approach in the context of classification using logistic regression models where the training set size is much smaller than the number of covariates, the so-called small n big p problem. We show that this optimization problem can be cast as an iterated least squares problem and compare its performance to likelihood methods on synthetic mixture data as well as some genomic data.
 

Abstract Author(s): Eric Chi and David Scott