Dec 24, 2009

High-dimensional Data Processing

The three top performing classes of algorithms for high-dimensional data sets are
  1. logistic regression,
  2. Random Forests and
  3. SVMs.
Although logistic regression can be inferior to non-linear algorithms, e.g. kernel SVMs, for low-dimensional data sets, it often performs equally well in high-dimensions, when the number of features goes over 10000, because most data sets become linearly separable when the numbers of features become very large.

Given the fact that logistic regression is often faster to train than more complex models like Random Forests and SVMs, in many situations it is the preferable method to deal with high dimensional data sets.

1 comment:

Yi Wang said...

I guess when the number of features are getting larger and larger, it is very important to normalize these features. Otherwise, it is hard to get a linear separable boundary. So a high dimension may not give us a good separable data, am I right?