Tutorial: Support Vector Machines
Support Vector Machines (SVMs) are one of many tools used within the Data Science community. SVMs are a powerful supervised machine learning algorithm which can be used for both classification and regression challenges though it is generally used for only classification. This tutorial is to show you how SVMs work with the SKLearn library, provide code examples, and provide some insight into the Grid Search method when trying to find the best parameters for your SVM.
To start we will import our dataset. For this tutorial we will be using the load_breast_cancer from the sklearn.datasets library.
# import dataset and pandas
from sklearn.datasets import load_breast_cancer
import pandas as pd# create dataframe variable for readability
df = load_breast_cancer()# Separate the target and predictor variables
X = pd.DataFrame(df.data,\
columns = df.feature_names)
y = pd.DataFrame(df.target)
Towards the bottom of the cell we have split our data into two variables, X being our predictors and y being our target. In this dataset our target is to predict whether a tumor is malignant or benign.
For the purpose of this tutorial, we will skip over including our EDA code. Based on my findings this dataset has zero N/A values between both our X and y dataframes. We do have a small class imbalance between our benign and malignant tumors with 357 being benign and 212 being malignant. I don’t foresee this being a problem with our small dataset but we can investigate this later on if our model cannot produce above par results.
To start, we will build a baseline model to use as a starting point. From there we will create a pipeline and build a grid search to find our optimal parameters.
Our baseline model will be a SVC (Support Vector Classification) model from the sklearn library. We will start doing our standard train test split, moving onto scaling our training/testing data, fitting our baseline model, and seeing how our model performs.
# Import needed librariesfrom sklearn.svm import SVC
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split# Perform train_test_split using standard 80/20X_train, X_test, y_train, y_test = train_test_split(X,y, random_state = 42)# Instantiate StandardScaler() and fit/transform variablesss = StandardScaler()
X_train_scaled = ss.fit_transform(X_train)
X_test_scaled = ss.transform(X_test# Instantiate baseline modelsvc = SVC()
svc.fit(X_train_scaled, y_train)# Find our training data score on baseline modelsvc.score(X_train_scaled, y_train)
0.9882629107981221# Find our testing data score on baseline modelsvc.score(X_test_scaled, y_test)
0.982027972027972
Now for the fun stuff! Model tuning! We will now start with building our pipeline.
# Import needed librariesfrom sklearn.pipeline import Pipeline# Instantiate pipeline and set SVC to random_state = 42pipe = Pipeline([('scaler', StandardScaler()),
('svc', SVC(random_state=42))])
Now that we have our pipeline built we can filter based on parameters using GridSearchCV from the sklearn.model_selection library. We’ll start by creating a parameter dictionary holding the values we want our GridSearch to iterate over.
# Set parameter grid for our GridSearchparam_grid = {
'svc__C': [0.1,0.2,0.3,0.4,0.5,0.6,0.7,0.8,0.9],
'svc__kernel': ('linear', 'poly', 'rbf', 'sigmoid'),
'svc__degree': [1,2,3,4,5,],
'svc__gamma': ('scale', 'auto')}
Using our built pipeline and param_grid, we can feel this into the GridSearchCV function and find the best results for our model.
# Import needed librariesfrom sklearn.model_selection import GridSearchCV# Instantiate GridSearchCV and pass the pipeline and param_gridgs = GridSearchCV(pipe_1, param_grid, n_jobs = -1,
cv = 5, return_train_score=True)
As we do will all other Sklearn models we will need to fit our training sets onto the gs model we just created.
# Fit model gs.fit(X_train, y_train)
Once ran, you should have the following output:
GridSearchCV(cv=3,
estimator=Pipeline(steps=[('scaler', StandardScaler()),
('svc', SVC(random_state=42))]),
n_jobs=-1,
param_grid={'svc__C': [0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 0.7, 0.8,
0.9],
'svc__degree': [1, 2, 3, 4, 5],
'svc__gamma': ('scale', 'auto'),
'svc__kernel': ('linear', 'poly', 'rbf', 'sigmoid')},
return_train_score=True)
Awesome! One last line of code and we will have the best parameters our grid search found for our model.
# Run the below code to find the best parametersgs.best_params_# Command output:{'svc__C': 0.2,
'svc__degree': 1,
'svc__gamma': 'scale',
'svc__kernel': 'linear'}# We can also see our highest mean cross-validated score by running:gs.best_score_# Command output:0.9788508891928865
You’ll notice that this output is lower than our initial model but remember that this is only on our training data. We still need to plug in the above ‘best’ parameters on a newly initiated model, fit our training data, and predict based on the test.
# Instantiate new model with 'best' parameterssvc_2 = SVC(C = 0.2, degree=1, gamma='scale', kernel='linear')# Fit svc_2svc_2.fit(X_train_scaled, y_train)# Score test datasvc_2.score(X_test_scaled, y_test)# Output: 0.986013986013986
So, with our small dataset we are able to improve our model by ~.004 using GridSearch and pipelines…not much but its an increase!
Finally, there are advantages and disadvantages to SVM’s as with any machine learning technique. I found this link to have a solid breakdown of SVMs along with other machine learning models.