Some intuition on Feature Scaling, Feature Selection and Feature Transformation




December 31, 2016

Notes on feature engineering in Machine Learning.

Feature Scaling

Feature scaling is a method used to standardize the range of independent variables or features of data. In data processing, it is also known as data normalization and is generally performed during the data pre-processing step.

The simplest method is rescaling the range of features to scale the range in [0, 1] or [−1, 1]. The general formula is given as:

X’ = (X – Xmin)/(Xmax – Xmin)

where, X’ is the value we want to rescale, X is the given value, Xmax is the largest value of X and Xmin the smallest.

Let us consider, old weights = [115,140,175] and we are going to scale for the value 140.

X’ = (140 – 115)/(175 – 115) = 0.41666

Therefore, the range is, [0,0.41666,1]

Feature Selection

Why do we have to perform feature selection?

There are two methods of Feature Selection :

Feature Transformation

Feature transformation is a group of methods that create new features (predictor variables). Feature selection is a subset of feature transformation.

Consider an ‘X’ space having ‘n’ features, using feature transformation we are going to transform X to have ‘m’ features, where m < n.

This is done by defining some matrix Px which is a subspace to which we are going to project ‘X’ space. The new features are combination of the old features.

There are many types explained below –

Principal Component Analysis (PCA)

A movie camera takes a 3-D information and flatten it to 2-D without too much loss of information.

What does all of this have to do with PCA?

PCA takes a dataset with a lot of dimensions and flatten it to two or three dimensions so we can look at it.

It tries to find a meaningful way to flatten the data by focusing on the things that are different between cells.

Here, the weights are termed Loadings. And array of loadings is called “Eigen Vector”.

PCA review :

Typical example of PCA is in eigenfaces.

PCA is a linear algebraic approach.

Independent Components Analysis (ICA)

It is a computational method for separating a multivariate signal into additive sub-components. This is done by assuming that the sub-components are non-Gaussian signals and that they are statistically independent from each other.

A common example application is the “cocktail party problem” of listening in on one person’s speech in a noisy room.

ICA is a probabilistic approach.

Random Component Analysis (RCA)

Uses random way to transform input features into principal components (PC)

Linear Discriminant Analysis (LDA)

Finds a projection that descriminates based on the label.

Fundamental assumption is different, although they do the same thing, which is to capture the original data in some new transform space that is somehow better.