“Feature engineering is another topic which doesn’t seem to merit any review papers or books, or even chapters in books, but it is absolutely vital to ML success. Much of the success of machine learning is actually success in engineering features.” — Scott Locklin, in “Neglected machine learning ideas”
When our goal is to get the best possible results from a model, we need to get the most from what we have. But how do you get the most out of our data for modeling? This is the problem that the process and practice of feature engineering solves.
Okay, What is feature engineering..?
When we prepare a table for modeling, not all columns are useful in their raw form. In fact some columns (or attributes) may be useless - one example is the an ID type of attribute, for model building.
Feature engineering as a technique, has three sub categories of techniques: feature selection, dimension reduction and feature generation.
Feature Selection:
This is the process of ranking the attributes by their value to predictive ability of a model. Algorithms such as decision trees automatically rank the attributes in the data set. The top few nodes in a decision tree are considered the most important features from a predictive stand point. As a part of a process, feature selection using entropy based methods like decision trees can be employed to filter out less valuable attributes before feeding the reduced dataset to another modeling algorithm. Regression type models usually employ methods such as forward selection or backward elimination to select the final set of attributes for a model.
Dimension Reduction:
This is sometimes called feature extraction. The most classic example of dimension reduction is principle component analysis or PCA. PCA allows us to combine existing attributes into a new data frame consisting of a much reduced number of attributes by utilizing the variance in the data. The attributes which "explain" the highest amount of variance in the data form the first few principal components and we can ignore the rest of the attributes if data dimensionality is a problem from a computational standpoint. PCA results in a data table whose attributes do not look anything like the attributes of the raw dataset.
Feature Generation or Feature Construction:
This technique is the one which most people are actually referring to when they talk about feature engineering. Quite simply, this is the process of manually constructing new attributes from raw data. It involves intelligently (a.k.a. domain knowledge) combining or splitting existing raw attributes into new one which have a higher predictive power. For example a date stamp may be used to generate 2 new attributes such as AM and PM which may be useful in discriminating whether day or night has a higher propensity to influence the response variable. We may want to convert noisy numerical attributes into simpler nominal attributes, by calculating the mean value and determining if a given row is above or below that mean value. We may generate a new attribute such as number of claims a member has filed for in a given time period, by combining date attribute and a nominal attribute such as claim_filed (Y/N), for example. The possibilities are endless. Feature construction is essentially a data transformation process.
Here is a longer article on feature engineering which provides some excellent links and further readings for those who interested.