Sparse regression with sklearn

Predicting Breast Cancer

Sklearn includes the Winsconsin breast cancer database. It associates medical outcomes for tumor observation, with several characteristics. Can a machine learn how to predict whether a cancer is benign or malignant ?

Import the Breast Cancer Dataset from sklearn. Describe it.

import sklearn
import sklearn.datasets
# the as_frame option makes the function return a dataframe
dataset = sklearn.datasets.load_breast_cancer(as_frame=True)
data = dataset['data']
target = dataset['target']

Properly train a linear logistic regression to predict cancer morbidity.

# separate the training set and the testset
import sklearn.model_selection
data_train, data_test, target_train, target_test = sklearn.model_selection.train_test_split(data, target)
# quickly check thes size of th samples, correspond to  what we want:
[e.shape for e in [data_train, data_test, target_train, target_test]]
[(426, 30), (143, 30), (426,), (143,)]
import sklearn.linear_model
model = sklearn.linear_model.LogisticRegression()
model.fit(data_train, target_train)
/opt/conda/envs/escpython/lib/python3.10/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# We can check the performance out of sample:
model.score(data_test, target_test)
0.8951048951048951
# to know what the scores represent, we can read the doc
# it shows that score is measured by mean accuracy
# i.e. number of correct predictions divided by total number of predictions
model.score?
Signature: model.score(X, y, sample_weight=None)
Docstring:
Return the mean accuracy on the given test data and labels.
In multi-label classification, this is the subset accuracy
which is a harsh metric since you require for each sample that
each label set be correctly predicted.
Parameters
----------
X : array-like of shape (n_samples, n_features)
    Test samples.
y : array-like of shape (n_samples,) or (n_samples, n_outputs)
    True labels for `X`.
sample_weight : array-like of shape (n_samples,), default=None
    Sample weights.
Returns
-------
score : float
    Mean accuracy of ``self.predict(X)`` wrt. `y`.
File:      /opt/conda/envs/escpython/lib/python3.10/site-packages/sklearn/base.py
Type:      method

Bonus: the warning message suggests to scale the data. Let’s redo the last few steps accordingly

import sklearn.preprocessing
scaler = sklearn.preprocessing.StandardScaler()
scaler.fit(data)
scaled_data = scaler.transform(data)
# let's repackage in a dataframe
import pandas
scaled_data = pandas.DataFrame(scaled_data, columns=data.columns)
# and check the result has zero mean and constant standard deviation
scaled_data.describe()
mean radius mean texture mean perimeter mean area mean smoothness mean compactness mean concavity mean concave points mean symmetry mean fractal dimension ... worst radius worst texture worst perimeter worst area worst smoothness worst compactness worst concavity worst concave points worst symmetry worst fractal dimension
count 5.690000e+02 5.690000e+02 5.690000e+02 5.690000e+02 5.690000e+02 5.690000e+02 5.690000e+02 5.690000e+02 5.690000e+02 5.690000e+02 ... 5.690000e+02 5.690000e+02 5.690000e+02 569.000000 5.690000e+02 5.690000e+02 5.690000e+02 5.690000e+02 5.690000e+02 5.690000e+02
mean -1.373633e-16 6.868164e-17 -1.248757e-16 -2.185325e-16 -8.366672e-16 1.873136e-16 4.995028e-17 -4.995028e-17 1.748260e-16 4.745277e-16 ... -8.241796e-16 1.248757e-17 -3.746271e-16 0.000000 -2.372638e-16 -3.371644e-16 7.492542e-17 2.247763e-16 2.622390e-16 -5.744282e-16
std 1.000880e+00 1.000880e+00 1.000880e+00 1.000880e+00 1.000880e+00 1.000880e+00 1.000880e+00 1.000880e+00 1.000880e+00 1.000880e+00 ... 1.000880e+00 1.000880e+00 1.000880e+00 1.000880 1.000880e+00 1.000880e+00 1.000880e+00 1.000880e+00 1.000880e+00 1.000880e+00
min -2.029648e+00 -2.229249e+00 -1.984504e+00 -1.454443e+00 -3.112085e+00 -1.610136e+00 -1.114873e+00 -1.261820e+00 -2.744117e+00 -1.819865e+00 ... -1.726901e+00 -2.223994e+00 -1.693361e+00 -1.222423 -2.682695e+00 -1.443878e+00 -1.305831e+00 -1.745063e+00 -2.160960e+00 -1.601839e+00
25% -6.893853e-01 -7.259631e-01 -6.919555e-01 -6.671955e-01 -7.109628e-01 -7.470860e-01 -7.437479e-01 -7.379438e-01 -7.032397e-01 -7.226392e-01 ... -6.749213e-01 -7.486293e-01 -6.895783e-01 -0.642136 -6.912304e-01 -6.810833e-01 -7.565142e-01 -7.563999e-01 -6.418637e-01 -6.919118e-01
50% -2.150816e-01 -1.046362e-01 -2.359800e-01 -2.951869e-01 -3.489108e-02 -2.219405e-01 -3.422399e-01 -3.977212e-01 -7.162650e-02 -1.782793e-01 ... -2.690395e-01 -4.351564e-02 -2.859802e-01 -0.341181 -4.684277e-02 -2.695009e-01 -2.182321e-01 -2.234689e-01 -1.274095e-01 -2.164441e-01
75% 4.693926e-01 5.841756e-01 4.996769e-01 3.635073e-01 6.361990e-01 4.938569e-01 5.260619e-01 6.469351e-01 5.307792e-01 4.709834e-01 ... 5.220158e-01 6.583411e-01 5.402790e-01 0.357589 5.975448e-01 5.396688e-01 5.311411e-01 7.125100e-01 4.501382e-01 4.507624e-01
max 3.971288e+00 4.651889e+00 3.976130e+00 5.250529e+00 4.770911e+00 4.568425e+00 4.243589e+00 3.927930e+00 4.484751e+00 4.910919e+00 ... 4.094189e+00 3.885905e+00 4.287337e+00 5.930172 3.955374e+00 5.112877e+00 4.700669e+00 2.685877e+00 6.046041e+00 6.846856e+00

8 rows × 30 columns

# for compatibility purpose we save the scaled dataframe as data
data = scaled_data
# and redo the same training
# separate the training set and the testset
data_train, data_test, target_train, target_test = sklearn.model_selection.train_test_split(data, target)
import sklearn.linear_model
model = sklearn.linear_model.LogisticRegression()
model.fit(data_train, target_train) # this time, we don't get any error message
LogisticRegression()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
# and actually improve the prediction (which might just be chance)
model.score(data_test, target_test)
0.972027972027972

Use k-fold validation to validate the model

# because the dataset is relatively small we didn't set aside a validation set
# instead we rely on cross-validation

# we split the dataset in 5
# this provides 5 different testsets (with 20% of observation) to test the training on the remaining set (80%)
kf = sklearn.model_selection.KFold(n_splits=5)
scores = []

for i_train, i_test in kf.split(data):
    
    # i_train and i_test are indices of observations belonging to one of the two datasets
    kf_data_train = data.iloc[i_train,:]
    kf_target_train = target.iloc[i_train]
    
    kf_data_test = data.iloc[i_test,:]
    kf_target_test = target.iloc[i_test]
    
    model_kf = sklearn.linear_model.LogisticRegression()
    
    # we train the model
    model_kf.fit(kf_data_train, kf_target_train)
    
    # and test it
    sc = model_kf.score(kf_data_test, kf_target_test)
    
    scores.append(sc)
    
    print(f"Score: {sc}")
Score: 0.9736842105263158
Score: 0.956140350877193
Score: 0.9824561403508771
Score: 0.9824561403508771
Score: 0.9911504424778761

There is some volatility in the scores, but it stays reliably over 95% accuracy.

# to get an estimate of accuracy we can compute the mean:
print(f"KFold validation: mean accuracy {sum(scores)/5}")
KFold validation: mean accuracy 0.9771774569166279

Try with other classifiers. Which one is best?

The dataset being relatively small we can try Support Vector Machines, which are known to generalize well (see discussion here).

We perform a kfold selection exactly as above.

kf = sklearn.model_selection.KFold(n_splits=5)
scores_svc = []

for i_train, i_test in kf.split(data):
    
    # i_train and i_test are indices of observations belonging to one of the two datasets
    kf_data_train = data.iloc[i_train,:]
    kf_target_train = target.iloc[i_train]
    
    kf_data_test = data.iloc[i_test,:]
    kf_target_test = target.iloc[i_test]
    
    # we just change the following line
    model_kf = sklearn.svm.SVC()
    
    # we train the model
    model_kf.fit(kf_data_train, kf_target_train)
    
    # and test it
    sc = model_kf.score(kf_data_test, kf_target_test)
    
    scores_svc.append(sc)
    
    print(f"Score: {sc}")
Score: 0.9473684210526315
Score: 0.9649122807017544
Score: 0.9736842105263158
Score: 0.9912280701754386
Score: 0.9734513274336283
# to get an estimate of accuracy we can compute the mean:
print(f"KFold validation: mean accuracy {sum(scores_svc)/5}")
KFold validation: mean accuracy 0.9701288619779538

Comment: performance of support vector machine is similar to logistic regression. To assess the gains, we can compare the difference to both estimate (0.007) to the standard deviation of either of two models. Both are geater than 0.01, meaning that the difference between the two models is probably not significant.

# we can compute the standard deviation as follows (googld standard deviation python)

import numpy 
print( numpy.std(scores) )
print( numpy.std(scores_svc) )
0.01188053806820839
0.0142415326274357