Feature engineering basic pipeline

Henrique Peixoto Machado
4 min readAug 24, 2020

If you are in the Data world, the sentence you gonna hear the most is always, 90% is data and 10% is model, but when you starting looking at tutorials on the internet 99% are about models, I’ve seen several data scientist that known how to make a great model and were total noobies when they need to check if there was missing values on the columns, so this post will be about the most important part of data, the data.

1- When exploring a new dataset, the first thing you should do is to check if the inputs are correctly fulfilled and if there kind are correct, I do this with this code:

To check all the inputs from a column and count how many times they repeat:

df[‘Column name’].value_counts()

In case there are some incorrect values, you can use replace() to correct:

df.replace(to_replace= “How it is”, value=”How I want it to be”

This is also important to check if your data is balanced, for a model to work properly you need as much positive as negative results, this can be very difficult depending on what you are analysing, so check step 7 if your data it’s not balanced.

and if you want just to delete part of the data you can use replace to “ ” , and the value will be deleted.

If the column has too many values you can add a .head() to the end of value_counts() and just show the first 10.

2- To check is there are any missing values:

df.isnull().sum()

If I see that the rows that have missing values are not that relevant normally I just drop then with the dropna:

df.dropna()

3- Another very important part of data engineering is to check the data types of the columns, because this can cause all kind of errors in the future:

df.info()

If there is a column let’s say that is all numbers and it is as object or string, you can change with astype()

df[‘column name’].astype(‘int64’)

The most common types used are:

String = text

Float = number that is not an integer. Ex: 1,25

Int/ int64 = Integer number

datetime = for dates

4- if the data is very numerical, is always good to use the function describe()

df.describe()

And also to check correlation of the values, but for correlation I always prefer to see as a graph, normally I use this code:

import seaborn as sns

ext_data_corrs = df.corr()

plt.figure(figsize = (8, 6))

# Heatmap of correlations

sns.heatmap(ext_data_corrs, cmap = plt.cm.RdYlBu_r, vmin = -0.25, annot = True, vmax = 0.6)

plt.title(‘Correlation Heatmap’);

If you are dealing with historical data, is always good to plot as a graph and try to check is visually there are some outliers or seasonality:

import seaborn as sns

# Labeling of plot

plt.xlabel(df[‘datetime’]); plt.ylabel(df[‘Values’]); plt.title(‘My title’)

5- If you plan to make some machine learning model, you will need to deal with categorical data, there are a lot of ways you can do this, there is one post I really like about it here, but if you don’t want to get deep into that here is how a usually deal:

df[‘column name’] = pd.get_dummies(df[‘column name’], drop_first=True))

6- And at least, many ML models deal better with data that has been normalized, there are many ways you can do that, but I end up always using the sklearn library because if very convenient and it also split your data automatic into train and test:

from sklearn.preprocessing import MinMaxScaler, Imputer

from sklearn.model_selection import train_test_split

# Drop the target from the training data

X_train, X_test, y_train, y_test = train_test_split(df.drop(‘TARGET’,axis=1),

df[‘TARGET’], test_size=0.20,

)

If you are doing a time series analysis, I would recommend on the code above you put shuffle=False, I made a post explaining my process of time series here, for more details go there :)

# Feature names

features = list(X_train.columns)

# Copy of the testing data

test = X_test.copy()

# Median imputation of missing values

imputer = Imputer(strategy = ‘median’)

# Scale each feature to 0–1

scaler = MinMaxScaler(feature_range = (0, 1))

# Fit on the training data

imputer.fit(X_train)

# Transform both training and testing data

X_train = imputer.transform(X_train)

X_test = imputer.transform(X_test)

# Repeat with the scaler

scaler.fit(X_train)

X_train = scaler.transform(X_train)

X_test = scaler.transform(X_test)

print(‘Training data shape: ‘, X_train.shape)

print(‘Testing data shape: ‘, X_test.shape)

7- If your data it’s not balanced, you have two options, to cut down the exemples of the part that you have more cases, so the train data stays almost 50%, this is called undersample. And you can also repeat the examples you have few cases until the train is also 50%, this is called oversample. The best way to do this is to take off or repeat random examples, here is how you do this with:

# Apply the random under-sampling

from imblearn.under_sampling import RandomUnderSampler

rus = RandomUnderSampler(return_indices=True)

X_rus, y_rus, idx_resampled = rus.fit_sample(X_train, y_train)

X_res_vis = pca.transform(X_rus)

# Apply the random over-sampling

from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler()

X_rus, y_rus = ros.fit_sample(X_train, y_train)

8- One last advice is that most of the time when cleaning the data your best friend will be the business person that got you the data from, if you look today at most winner jobs at Kaggle you will see that they really got to know the business rules of the competition. So never be afraid to reach out to the business area for help and clarifications when doing feature engineering.

If you follow this pipeline I guarantee that your analysis will improve a lot. There is always more you can do when we talk about data, but these steps will be useful with all kinds of data. And there is a big part that always comes handy when cleaning the data that is Regex, but because of its importance I’ll be making a whole post just to talk about it, And if you liked this kind of data posts, please follow me! ❤

--

--