Hitaya_OneAPI: Machine Learning Model Training Using Intel oneAPI

Jayita Bhattacharyya
6 min readMar 27, 2023

Previous article we covered our overview approach to our solution under healthcare for underserved communities. This article is a continuation of that series, wherein we run down steps of how we’re making our predictions leveraging machine learning.

For this let us take an example of one of our diseases. We’re going to demonstrate the steps we used for Diabetic patient detection or as commonly known as blood sugar level. We follow classical machine-learning techniques for tabular data and later fine-tune them using Intel’s OneAPI AI Analytics Toolkit & libraries and Intel DevCloud which lets developers use a wide range of readily accessible tools.

Data Collection & Preparation

Any machine learning use case starts with data gathering. We take the help of open-source and readily available data from Kaggle datasets. This particular Diabetes Dataset was collected from one of Frankfurt’s hospitals in Germany.

Data Dictionary — The dataset consists of several medical predictor variables and one target variable, Outcome. This is a classification problem. The parameters are the following:

Pregnancies - Number of times pregnant

Glucose - Plasma glucose concentration a 2 hours in an oral glucose
tolerance test

BloodPressure - Diastolic blood pressure (mm Hg)

SkinThickness - Triceps skin fold thickness (mm)

Insulin - 2-Hour serum insulin (mu U/ml)

BMI - Body mass index (weight in kg/(height in m)^2)

DiabetesPedigreeFunction - Diabetes pedigree function

Age - Age (years)

Outcome - Class variable (0 or 1)

Now let us proceed with some Exploratory Data Analysis. Here we make use of Intel’s Modin which helps us get faster results than traditional pandas data frames. Modin works the same as pandas with additional enhanced execution time. Along with that, we import other commonly used data science libraries — numpy for n-dimensional array calculations, for visualization — matplotlib & seaborn.

Installation

pip install modin
# importing libraries 
import modin.pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

It’s time to use one more of Intel’s blazing performance-enhancing products — scikit-learn-intelex. This is easy and simple to use with a 2-liner code and needs no change to be made to existing code. The following code shall patch and provide the accelerated power of execution time. After the patch is included, as usual, the sklearn library is imported.

Patching

from sklearnex import patch_sklearn
patch_sklearn()
# importing ML libraries
from sklearn.metrics import confusion_matrix, classification_report,
accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.impute import SimpleImputer
from sklearn.model_selection import train_test_split

For EDA, we first read the data and try to get some insights out of it. We tend to look out for the types of data it holds (strings, floats, integers, dates, etc.), check on the statistics, co-related parameters, null values and other checks

# Reading data
df = pd.read_csv("../input/diabetes/diabetes.csv")
df.head()
%time
CPU times: user 7.6 ms, sys: 347 µs, total: 9.3 ms
Wall time: 4.37 ms

The dataset holds 2000 rows and 9 columns.

# determining dataset size
>>> df.shape
(2000, 9)
# determining types of data
df.info()
# basic statistics
df.describe()

Now let’s have a look at the distribution of the target values. It’s a binary classification problem wherein 1- represents a person having diabetes and 0 — denotes a person not having diabetes.

# checking dataset is balanced or not
>>> diabetes_true_count = len(df.loc[df['Outcome'] == 1])
>>> diabetes_false_count = len(df.loc[df['Outcome'] == 0])
>>> (diabetes_true_count,diabetes_false_count)
(684, 1316)

Data Visualization

Here’s a visual representation of the same.

# plotting graph for output classes counts
sns.countplot(x = 'Outcome',data = df)

The following snippet is one of the most important steps in ML problems, null or missing values. If any parameter is found to have some null values, there are several techniques to impute them for better model performance.

# checking for missing values
df.isnull().sum()
Pregnancies                 0
Glucose 0
BloodPressure 0
SkinThickness 0
Insulin 0
BMI 0
DiabetesPedigreeFunction 0
Age 0
Outcome 0
dtype: int64

No null value is present in the data. Although there could be neutral values r 0 present, which tends to increase bias if found more in number.

# Checking if data has 0 values present
print("Pregnancies: {0}".format(len(df.loc[df['Pregnancies'] == 0])))
print("Glucose: {0}".format(len(df.loc[df['Glucose'] == 0])))
print("bp: {0}".format(len(df.loc[df['BloodPressure'] == 0])))
print("SkinThickness: {0}".format(len(df.loc[df['SkinThickness'] == 0])))
print("Insulin: {0}".format(len(df.loc[df['Insulin'] == 0])))
print("BMI: {0}".format(len(df.loc[df['BMI'] == 0])))
print("DiabetesPedigreeFunction: {0}".format(len(
df.loc[df['DiabetesPedigreeFunction'] == 0])))
print("Age: {0}".format(len(df.loc[df['Age'] == 0])))
Pregnancies: 301
Glucose: 13
bp: 90
SkinThickness: 573
Insulin: 956
BMI: 28
DiabetesPedigreeFunction: 0
Age: 0

Model Building

Clearly, some of the parameters have way more than 0 values. We shall impute these values at once after we create the training features.

# Preparing the data for model building
feature_columns = ['Pregnancies', 'Glucose', 'BloodPressure',
'SkinThickness', 'Insulin', 'BMI',
'DiabetesPedigreeFunction', 'Age']
predicted_class = ['Outcome']

We split the data into 70% training data size and 30% test data size

# Splitting dataset into train & test set
X = df[feature_columns]
y = df[predicted_class]
X_train, X_test, y_train, y_test = train_test_split(X, y, 
test_size = 0.30, random_state=42)
X_train.shape, X_test.shape, y_train.shape, y_test.shape
((1400, 8), (600, 8), (1400, 1), (600, 1))

As discussed earlier, we’ll impute the 0 values with the mean of the particular parameter.

# Filling in the 0 values present with the mean of that particular property
fill_values = SimpleImputer(missing_values=0, strategy="mean")
X_train = fill_values.fit_transform(X_train)
X_test = fill_values.fit_transform(X_test)

It is time now to fit our training data into the machine learning algorithm. After trying some of the classifier algorithms like Logistic Regression, SVM, and Decision trees, it was seen that Random Forest gave good results.

Random Forest Classifier is an ensembling technique, based on a number of decision tree classifiers built out of subsampling the dataset and finally getting an aggregate of all these to increase accuracy thereby and give the final result.

# Fitting the training data into RandomForest Classifier
random_forest_model = RandomForestClassifier(random_state=10)
model = random_forest_model.fit(X_train, y_train)

Prediction

Let us now see how our model performs and make predictions over the test set that was separated. Along with that we also take out the accuracy score.

# Predicting model over test set & acquiring accuracy achieved
predict_train_data = model.predict(X_test)
print("Accuracy using Intel Extension for Sklearn = {0:.3f}".format(metrics.accuracy_score(y_test,
predict_train_data)))
Accuracy using Intel Extension for Sklearn = 0.968
CPU times: user 2 µs, sys: 1 µs, total: 3 µs
Wall time: 5.72 µs

Evaluation Metrics

As the score appears, the model performs exceptionally well with the help of the Intel one API AI Analytics Toolkit and Libraries. We display results from other metrics available as well to take out precision, recall, f1-score & confusion matrix.

classification_report(y_test, predict_train_data)
              precision    recall  f1-score   support

0 0.97 0.98 0.98 398
1 0.96 0.95 0.95 202

accuracy 0.97 600
macro avg 0.97 0.96 0.96 600
weighted avg 0.97 0.97 0.97 600

Unpatching

from sklearnex import unpatch_sklearn
unpatch_sklearn()

Performance Measure

Conclusion

After the model is ready, it is saved and ready for deployment along with machine learning pipelines being created which are connected to API endpoints. Using Intel’s AI libraries has been a boon in performance and execution times. A link to our previous article covering the overview of our problem statement and API structure is given below. Attached is another article on breast cancer detection.

--

--

Jayita Bhattacharyya

Official Code-breaker | Generative AI | Machine Learning | Software Engineer | Traveller