Data Preprocessing :Crucial Steps in Your ML projects!

Haidour Asmaa
9 min readJan 14, 2021

The first time I tried to train a ML model, I took the dataset as it is, wrote the instruction and executed on R. Of course, an error message was displayed because I didn’t consult the dataset properly.

So that’s why today I wanted to maybe give you the first steps to follow before training the model.

But first of all, I will answer the question that always occurs : “Why Data Preprocessing in Machine Learning?”. Simply because data is incomplete, inconsistent, inaccurate (contains errors or outliers), and often lacks specific attribute values/trends , so in this step we clean, format, and organize the raw data, thereby making it ready-to-go for Machine Learning models.

In the following section, I will list the steps to be followed during pre-processing :

1. Get Your Dataset :

Firstly you need to acquire your data, the dataset will be composed of data gathered from multiple and disparate sources which are then combined in a proper format to form a dataset .

You can find thousands of datasets in different sources but I highly recommend Kaggle https://www.kaggle.com (Which I will write about next time)why Kaggle? because I like the way of presenting the dataset :

In Kaggle, you can find a description of the dataset, i take as example (Movielens)

You will also find the different csv files that compose your dataset :

And details of each column which help you to know the number of null values in your feature column and other important static measurements like (Mean,Min,Max…)

You can also create a dataset by collecting data via different Python APIs. Once the dataset is ready, you must put it in a CSV, HTML, or XLSX file formats.

2. Import all the crucial libraries:

You will generally need 4 of the most famous python libraries which are as follows :

  • NumPy — NumPy is the fundamental library used for data science in python .It’s written in optimized C and offers efficient functions for working with arrays and matrices .
  • Pandas — Pandas is a major library for the preprocessing of data frames in Python. It offers Excel-like table structures that allow the efficient handling of datasets .
  • Matplotlib — Matplotlib is highly used for visualization in Python. It offers an easy way to plot bar charts, pie charts, histograms and more .

You can use Jupyter, Spyder or Google collab as you want but before, you need to install all dependencies then import them :

3. Import your dataset:

In this step, you need to import the dataset/s that you have gathered for the ML project

However, before you can import the dataset/s, you must set the current directory as the working directory.

You can import the dataset using the “read_csv()” function of the Pandas library.

And here is your data frame :

As you already know, in ML some algorithms require the presence of X the independent variable (or matrix features) and Y the dependent variables. To make the difference between X and Y you need to understand your problem and choose what’s X and Y for you. In general case, it’s clear that the independent variables are features like: age, country, state, etc. and Y the dependent variable is a variable whose state or value depends on this feature like : Purchased, benefit, salary ….

Considering my dataset is the following one, it contains the benefit of a startup according to the the spending of each department in each region :

You must extract X and Y from your dataset using the “iloc[ ]” function of the Pandas library :

iloc[:,:-1] : the first ”:” indicates that we want to keep all rows and “-1” is to delete the last column, we will have X which contains 4 columns: the R&D spend, Marketing spend, Administration spend and state .

iloc[:,4] : the first ”:” indicates that we want to keep all rows and “4” is to keep the last column, which is the profit Y.

4. Identifying and handling the missing values:

Firstly, understand that there is NO good way to deal with missing data.

Check for missing data, understand what features and observations have missing values and try to handle these missing values .

Basically, there are two ways to handle missing data:

  • Deleting a particular row:

In this method, you remove a specific row that has a null value for a feature or a particular column where more than 75% of the values are missing. However, this method is not 100% efficient, and it is recommended that you use it only when the dataset has adequate samples. You must ensure that after deleting the data, there remains no addition of bias.

  • Calculating the mean

This method is useful for features having numeric data like age, salary, year, etc. Here, you can calculate the mean, median, or mode of a particular feature or column or row that contains a missing value and replace the result for the missing value. This method can add variance to the dataset, and any loss of data can be efficiently negated. Hence, it yields better results compared to the first method (omission of rows/columns).

I am going to illustrate an example on handling null values with the second method using Scikit learn:

In this case I had the null values in columns 2 and 3. I have tried to replace the null value “Nan” in the table with the mean .

5. Encoding the categorical data:

Categorical data refers to the information that has specific categories within the dataset. In the dataset cited above, there is one categorical variable “State”.

As you know, ML models are based purely on mathematics, more precisely probability, statistics and linear algebra. Leaving categories will really cause problems so we will need to replace these categories with numbers.

There are many ways to convert categorical values into numerical ones. Each approach has its own trade-offs and impact on the feature set. I would focus on 2 main methods: One-Hot-Encoding and Label-Encoder. Both of these encoders are part of SciKit-learn library and are used to convert text or categorical data into numerical data which the model expects and performs better with.

LabelEncoder converts each value in a column to a number but it induces a new problem since it uses number sequencing. The problem using the number is that they introduce relation/comparison between them, and sometimes this is not the case, taking the example of New York, Florida and California, so assigning 0,1,2 to each region : New York(0) ,Florida(1) ,California(2) will create confusion for the model because it will consider that we are expressing that California>Florida and Florida>New York. It should be noted that label encoder is used when we have categories that express a relation of comparison between them, for example : large, medium, small.

Though label encoding is straight, it has the disadvantage that the numeric values can be misinterpreted by algorithms as having some sort of hierarchy/order in them. This ordering issue is resolved in another common alternative approach called ‘One-Hot Encoding’ or Dummy Encoding. In this strategy, each category value is converted into a new column and assigned a 1 or 0 to indicate the absence or presence of a specific categorical, the value 1 indicates the presence of that variable in a particular column while the other variables become of value 0.

In our example, we need to replace “State” with numerical values using OneHotEncoder and ColumnTransformer from Scikit learn. The input code will be as follows :

And the output will be :

6. Splitting the dataset:

To train any machine learning model irrespective of what type of dataset is being used, you have to split the dataset into training data and testing data.

Usually, the dataset is split into 70:30 ratio or 80:20 ratio. This means that you either take 70% or 80% of the data for training the model while leaving out the rest 30% or 20% for testing.

We use the function train_test_split() to split the arrays of the dataset into random train and test subsets. This function includes four parameters, the first two of which are for arrays of data( X and y), the test_size function specifies the size of the test set and the last parameter, “random_state” sets seed for a random generator so that the output is always the same.

To split the dataset, you have to write the following line of code :

7. Feature Scaling:

The last step of your pre-processing. After that, you will have your data ready to use.

Most of the time, your dataset will contain features highly varying in magnitudes, units and range. But since most of the machine learning algorithms use Euclidean distance between two data points in their computations, this is a problem.If a feature in the dataset is big in scale compared to others then in algorithms where Euclidean distance is measured, this big scaled feature becomes dominating and needs to be normalized.

You can perform feature scaling in Machine Learning in three ways:

  • Standardisation:
  • Min-Max Scaling:
  • Mean Normalization :

For our dataset, we will use the standardization method. To do so, we will import StandardScaler class of the sci-kit-learn library using the following line of code:

And after that, we will create the object of StandardScaler class for features column or X then you fit and transform your training and test datasets using the following code :

And we will have all the variables scaled between -1 and 1 :

Now you can start training your model without any issues or errors because your data is ready to be explored .

The application of the 7 steps depends on your dataset .I highly recommend to take a look at your dataset, analyse the different variables and to visualize to get an idea before passing to the action.

I will always incorporate new concepts of data science as I master them. This journey of learning is worth sharing as well as collaborating. So share your ideas and let’s make this article best to get the start for every beginner…

If you’re curious to learn more you can find here he twelve most popular Python Libraries :Https://www.upgrad.com/blog/python-libraries-for-data-science/

You can find a preprocessing template on my GitHub :

--

--