Mar 22, 2024
Tackling Imbalanced Datasets with SMOTE (Synthetic Minority Over-sampling Technique)
Introduction
In the world of machine learning, we often encounter datasets where one class has significantly more instances than the other. These are known as imbalanced datasets, and they can pose a challenge for many machine learning algorithms. In this article, we’ll explore a technique called Synthetic Minority Over-sampling Technique (SMOTE) that can help us deal with such datasets.
What is SMOTE?
Before we begin, it is important to understand what SMOTE is -
SMOTE stands for Synthetic Minority Over-sampling Technique. It’s a statistical technique for increasing the number of cases in your dataset in a balanced way. The module works by generating new instances from existing minority cases that you supply as input. This is done by synthesizing new examples from the existing ones.
SMOTE does this by selecting two or more similar instances (using a distance measure) and perturbing an instance one attribute at a time by a random amount within the difference to the neighboring instances.
Finding the k-nearest-neighbors for minority class observations (finding similar instances): For each observation in the minority class, SMOTE calculates the distance between that observation and every other observation of the same class. The k instances that are most similar (i.e., have the smallest distance) are selected as the k-nearest-neighbors.
Synthetic examples creation: Once the k-nearest-neighbors are found, SMOTE creates synthetic examples to augment the data. For each minority class observation, a new instance is created between that observation and one of its k-nearest-neighbors. This is done by choosing one of the k-nearest-neighbors at random and then calculating the vector between the observation and its chosen neighbor. The vector is multiplied by a random number x (0 ≤ x ≤ 1), and it’s added to the original observation to create the new, synthetic one.
This process effectively increases the number of minority class observations, helping to balance the class distribution. It’s important to note that while SMOTE helps balance the classes, it doesn’t guarantee improved performance. It’s crucial to use it in conjunction with other techniques for handling imbalanced datasets and to validate the results using appropriate metrics.
In the next sections, we’ll see how to apply SMOTE with a real-world dataset and evaluate its impact.
HealthCare Dataset
We’ll be using a Stroke Prediction Dataset from Kaggle for our tutorial. This dataset contains information about patients, including their age, gender, BMI among other health metrics, and whether or not they’ve had a stroke. The target variable is ‘stroke’, which is binary. The dataset is imbalanced, with far more instances of ‘no stroke’ than ‘stroke’ as we can see below -
Press enter or click to view image in full size
Distribution of Target Variable
Let’s look at a sample of the data frame we’ll be working with
Press enter or click to view image in full size
Sample df with the target variable ‘stroke’
Preprocessing and Feature Engineering
Before we could apply SMOTE, we needed to preprocess our data and perform some feature engineering. We filled missing values in the bmi column with the median, dropped duplicates, and excluded ‘Other’ in the gender column, and encoded the categorical features.
Applying SMOTE
With our data preprocessed and ready, we moved on to applying SMOTE. We used the imblearn library’s SMOTE function to oversample the minority class. After applying SMOTE, we visualized the distribution of the target variable again to confirm that the classes were now balanced.
Press enter or click to view image in full size
Distribution of Stroke before SMOTE
Press enter or click to view image in full size
Distribution of Stroke after SMOTE
Model Training and Evaluation
We then trained our model, an AdaBoostClassifier, on the dataset before and after applying SMOTE. We evaluated the model’s performance using a classification report, which includes precision, recall, and f1-score.
Press enter or click to view image in full size
Confusion Matrix before SMOTE
Before SMOTE: The classification report shows that the model had a high precision (0.94) for the majority class (label 0), indicating that when the model predicted an instance to be of the majority class, it was correct 94% of the time. However, the precision for the minority class (label 1) was quite low (0.33), meaning that when the model predicted an instance to be of the minority class, it was correct only 33% of the time.
The recall for the majority class was perfect (1.00), meaning the model identified all instances of the majority class correctly. However, the recall for the minority class was very low (0.02), indicating that the model was able to correctly identify only 2% of the actual instances of the minority class.
The F1-score, which is the harmonic mean of precision and recall, was high for the majority class (0.97) but very low for the minority class (0.03). This suggests that the model was biased towards the majority class and performed poorly in identifying the minority class.
Press enter or click to view image in full size
Confusion Matrix after SMOTE
After SMOTE: After applying SMOTE, there was a significant improvement in the recall for the minority class (from 0.02 to 0.58), meaning the model was now able to correctly identify 58% of the actual instances of the minority class. However, this came at the cost of precision, which decreased from 0.33 to 0.16. This means that when the model predicted an instance to be of the minority class, it was correct only 16% of the time.
The precision for the majority class remained high (0.97), but the recall decreased from 1.00 to 0.81. This means the model was now missing 19% of the actual instances of the majority class.
The F1-score for the minority class improved significantly (from 0.03 to 0.25), indicating a better balance between precision and recall for the minority class. However, the F1-score for the majority class decreased slightly (from 0.97 to 0.88).
Conclusion
In this article, we explored how to use SMOTE to handle imbalanced datasets. We walked through a step-by-step tutorial using a healthcare dataset from Kaggle, performed feature engineering, and evaluated our model’s performance before and after applying SMOTE.
While SMOTE improved the model’s ability to identify the minority class, it also increased the number of false positives (lower precision) for the minority class and false negatives (lower recall) for the majority class. This is a common trade-off in machine learning when dealing with imbalanced datasets. The choice between precision and recall often depends on the specific requirements of your application. For example, in a medical diagnosis scenario, a higher recall might be more important to ensure all positive cases are identified, even at the cost of some false positives. On the other hand, in a spam detection scenario, a higher precision might be more desirable to avoid misclassifying legitimate emails as spam.
SMOTE can be a powerful tool for dealing with imbalanced datasets, but it’s not a silver bullet. It’s important to use it in conjunction with other techniques and to validate the results using appropriate metrics. As always, understanding your data and the problem you’re trying to solve is key.



