18. Supervised Learning with scikit-learn¶

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')

import sklearn
from sklearn import datasets

18.1. Classification¶

18.1.1. KNN (binary)¶

18.1.1.1. 讀資料集¶

vote_raw = pd.read_csv("data/house-votes-84.csv")

vote = vote_raw.copy()
col_names = ['party', 'infants', 'water', 'budget', 'physician', 'salvador',
       'religious', 'satellite', 'aid', 'missile', 'immigration', 'synfuels',
       'education', 'superfund', 'crime', 'duty_free_exports', 'eaa_rsa']
vote.columns = col_names
vote[vote == "?"] = np.nan # 把 ? 改成 na
vote = vote.dropna()
for i in col_names[1:]:
    vote[i] = vote[i].replace({"y": 1, "n": 0})
vote

	party	infants	water	budget	physician	salvador	religious	satellite	aid	missile	immigration	synfuels	education	superfund	crime	duty_free_exports	eaa_rsa
4	democrat	0	1	1	0	1	1	0	0	0	0	0	0	1	1	1	1
7	republican	0	1	0	1	1	1	0	0	0	0	0	1	1	1	0	1
18	democrat	1	1	1	0	0	0	1	1	1	0	1	0	0	0	1	1
22	democrat	1	1	1	0	0	0	1	1	1	0	0	0	0	0	1	1
24	democrat	1	0	1	0	0	0	1	1	1	1	0	0	0	0	1	1
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
422	democrat	0	1	1	0	0	1	1	1	1	0	1	0	0	1	1	1
425	democrat	1	0	1	0	0	0	1	1	1	1	0	0	0	0	1	1
426	republican	0	0	0	1	1	1	1	1	0	1	0	1	1	1	0	1
429	republican	0	0	1	1	1	1	0	0	1	1	0	1	1	1	0	1
430	democrat	0	0	1	0	0	0	1	1	1	1	0	0	0	0	0	1

232 rows × 17 columns

vote.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 232 entries, 4 to 430
Data columns (total 17 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 party              232 non-null    object
 infants            232 non-null    int64 
 water              232 non-null    int64 
 budget             232 non-null    int64 
 physician          232 non-null    int64 
 salvador           232 non-null    int64 
 religious          232 non-null    int64 
 satellite          232 non-null    int64 
 aid                232 non-null    int64 
 missile            232 non-null    int64 
immigration        232 non-null    int64 
synfuels           232 non-null    int64 
education          232 non-null    int64 
superfund          232 non-null    int64 
crime              232 non-null    int64 
duty_free_exports  232 non-null    int64 
eaa_rsa            232 non-null    int64 
dtypes: int64(16), object(1)
memory usage: 32.6+ KB

這筆資料共 232 個列，每一列是一個立委
y 是 party(該立委所屬的政黨，民主黨或共和黨)
剩下的全都是x，這些x都是各大議題的投票結果。以 infants 這個變數來說，就是在嬰兒這個議題上，此立委是投贊成票(1)還是反對票(0).
那這筆資料的任務，就是根據這些議題的投票結果，來猜這個立委屬於哪個政黨

18.1.1.2. 最簡單流程¶

先來講最簡單的流程
- 分 train/test
- ~~定 pipeline.~~
  - ~~定義 preprocessing steps.~~
  - 定義 classifier.
- ~~hyper-parameter tunning~~
  - ~~grid search~~
  - ~~random search~~
- 用整個 training set 做 fitting.
- 對 testing set 做 predict.
- 評估模型表現
  - threshold.
  - non-trheshold
- 細節資訊探索(e.g. fitting後的參數,…)

18.1.1.2.1. 分 train/test¶

# 分 train/test
X = vote.drop("party", axis = 1)
y = vote["party"]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    test_size = 0.3, 
    random_state = 21, 
    stratify = y
)

可以看看，X_train 和 X_test 的資料筆數分配

print(f"The shape of X_train is: {X_train.shape}")
print(f"The shape of X_test is: {X_test.shape}")

The shape of X_train is: (162, 16)
The shape of X_test is: (70, 16)

可以看， y_train 和 y_test 的分佈是不是一樣(因為我有做 stratify)

print(f"republican% in whole data set is: {(y == 'republican').sum()/y.size}")
print(f"republican% in training set is: {(y_train == 'republican').sum()/y_train.size}")
print(f"republican% in testing set is: {(y_test == 'republican').sum()/y_test.size}")

republican% in whole data set is: 0.46551724137931033
republican% in training set is: 0.46296296296296297
republican% in testing set is: 0.4714285714285714

18.1.1.2.2. 做 pipeline¶

pipeline 包括 preprocessing + model，那這邊只做 model

# 定 pipeline
## 定 preprocessing steps. (略)
## 定 classifier
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 5)

很簡單的就定義完我要使用的 model (knn，以及 neighbor 數選了 5).
我們可以看他的文件，來看這個 classifier 的細節

KNeighborsClassifier?

Init signature:
KNeighborsClassifier(
    n_neighbors=5,
    *,
    weights='uniform',
    algorithm='auto',
    leaf_size=30,
    p=2,
    metric='minkowski',
    metric_params=None,
    n_jobs=None,
)
Docstring:     
Classifier implementing the k-nearest neighbors vote.

Read more in the :ref:`User Guide <classification>`.

Parameters
----------
n_neighbors : int, default=5
    Number of neighbors to use by default for :meth:`kneighbors` queries.

weights : {'uniform', 'distance'} or callable, default='uniform'
    Weight function used in prediction.  Possible values:

    - 'uniform' : uniform weights.  All points in each neighborhood
      are weighted equally.
    - 'distance' : weight points by the inverse of their distance.
      in this case, closer neighbors of a query point will have a
      greater influence than neighbors which are further away.
    - [callable] : a user-defined function which accepts an
      array of distances, and returns an array of the same shape
      containing the weights.

algorithm : {'auto', 'ball_tree', 'kd_tree', 'brute'}, default='auto'
    Algorithm used to compute the nearest neighbors:

    - 'ball_tree' will use :class:`BallTree`
    - 'kd_tree' will use :class:`KDTree`
    - 'brute' will use a brute-force search.
    - 'auto' will attempt to decide the most appropriate algorithm
      based on the values passed to :meth:`fit` method.

    Note: fitting on sparse input will override the setting of
    this parameter, using brute force.

leaf_size : int, default=30
    Leaf size passed to BallTree or KDTree.  This can affect the
    speed of the construction and query, as well as the memory
    required to store the tree.  The optimal value depends on the
    nature of the problem.

p : int, default=2
    Power parameter for the Minkowski metric. When p = 1, this is
    equivalent to using manhattan_distance (l1), and euclidean_distance
    (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used.

metric : str or callable, default='minkowski'
    The distance metric to use for the tree.  The default metric is
    minkowski, and with p=2 is equivalent to the standard Euclidean
    metric. For a list of available metrics, see the documentation of
    :class:`~sklearn.metrics.DistanceMetric`.
    If metric is "precomputed", X is assumed to be a distance matrix and
    must be square during fit. X may be a :term:`sparse graph`,
    in which case only "nonzero" elements may be considered neighbors.

metric_params : dict, default=None
    Additional keyword arguments for the metric function.

n_jobs : int, default=None
    The number of parallel jobs to run for neighbors search.
    ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
    ``-1`` means using all processors. See :term:`Glossary <n_jobs>`
    for more details.
    Doesn't affect :meth:`fit` method.

Attributes
----------
classes_ : array of shape (n_classes,)
    Class labels known to the classifier

effective_metric_ : str or callble
    The distance metric used. It will be same as the `metric` parameter
    or a synonym of it, e.g. 'euclidean' if the `metric` parameter set to
    'minkowski' and `p` parameter set to 2.

effective_metric_params_ : dict
    Additional keyword arguments for the metric function. For most metrics
    will be same with `metric_params` parameter, but may also contain the
    `p` parameter value if the `effective_metric_` attribute is set to
    'minkowski'.

n_features_in_ : int
    Number of features seen during :term:`fit`.

    .. versionadded:: 0.24

feature_names_in_ : ndarray of shape (`n_features_in_`,)
    Names of features seen during :term:`fit`. Defined only when `X`
    has feature names that are all strings.

    .. versionadded:: 1.0

n_samples_fit_ : int
    Number of samples in the fitted data.

outputs_2d_ : bool
    False when `y`'s shape is (n_samples, ) or (n_samples, 1) during fit
    otherwise True.

See Also
--------
RadiusNeighborsClassifier: Classifier based on neighbors within a fixed radius.
KNeighborsRegressor: Regression based on k-nearest neighbors.
RadiusNeighborsRegressor: Regression based on neighbors within a fixed radius.
NearestNeighbors: Unsupervised learner for implementing neighbor searches.

Notes
-----
See :ref:`Nearest Neighbors <neighbors>` in the online documentation
for a discussion of the choice of ``algorithm`` and ``leaf_size``.

.. warning::

   Regarding the Nearest Neighbors algorithms, if it is found that two
   neighbors, neighbor `k+1` and `k`, have identical distances
   but different labels, the results will depend on the ordering of the
   training data.

https://en.wikipedia.org/wiki/K-nearest_neighbor_algorithm

Examples
--------
>>> X = [[0], [1], [2], [3]]
>>> y = [0, 0, 1, 1]
>>> from sklearn.neighbors import KNeighborsClassifier
>>> neigh = KNeighborsClassifier(n_neighbors=3)
>>> neigh.fit(X, y)
KNeighborsClassifier(...)
>>> print(neigh.predict([[1.1]]))
[0]
>>> print(neigh.predict_proba([[0.9]]))
[[0.666... 0.333...]]
File:           /Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/neighbors/_classification.py
Type:           ABCMeta
Subclasses:     

18.1.1.2.3. 用整個 training set 做 fitting.¶

knn.fit(X_train, y_train)

KNeighborsClassifier()

fitting 完後，可以簡要看一下他學到了啥

knn.effective_metric_

'euclidean'

18.1.1.2.4. 對 testing set 做 predict¶

y_pred = knn.predict(X_test)
y_pred_prob = knn.predict_proba(X_test)

我們看看預測結果(label)

pd.DataFrame(y_pred, columns= ["pred_label"]).head()

	pred_label
0	democrat
1	democrat
2	republican
3	democrat
4	republican

這個預測值，就是去比預測此委員為 democrat 的機率，和 republican 的機率，哪個大，而做出的判斷. (因為只有兩類，所以其實就是 threshold = 0.5 -> 如果 republican 的機率值 > 0.5，就判定為 republican (y的positive是republican)
所以我們來看一下預測機率值

pd.DataFrame(y_pred_prob, columns= y_test.unique()).head()

	democrat	republican
0	1.0	0.0
1	1.0	0.0
2	0.0	1.0
3	1.0	0.0
4	0.0	1.0

這個機率的算法，就是找最近的 k 個 neighbor後，去統計 democrat 的比例，和 republican 的比例.
所以以 index = 0 這一列來說，就是離此委員最近的 k 個 neighbor，全都是 democrat。

18.1.1.2.5. 效果評估¶

from sklearn.metrics import confusion_matrix, classification_report, roc_curve, roc_auc_score
# 評估結果
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

[[32  5]
 [ 2 31]]
              precision    recall  f1-score   support

    democrat       0.94      0.86      0.90        37
  republican       0.86      0.94      0.90        33

    accuracy                           0.90        70
   macro avg       0.90      0.90      0.90        70
weighted avg       0.90      0.90      0.90        70

可以看到，如果你把 republican 當 positive 的話，就看第二列
- precision: 0.86，表示你預測他是republican的人中，有86%真的是republican.
- recall: 0.94，表示實際上是republican的人中，有94%被你抓到.
- f1-score: 是precision和recall的調和平均數.
接著看 index = accuracy 那一列，可以看到，準確率是 0.90.

y_pred_prob

array([[1. , 0. ],
       [1. , 0. ],
       [0. , 1. ],
       [1. , 0. ],
       [0. , 1. ],
       [1. , 0. ],
       [1. , 0. ],
       [0. , 1. ],
       [1. , 0. ],
       [1. , 0. ],
       [0. , 1. ],
       [0. , 1. ],
       [1. , 0. ],
       [1. , 0. ],
       [1. , 0. ],
       [0.4, 0.6],
       [1. , 0. ],
       [1. , 0. ],
       [1. , 0. ],
       [0. , 1. ],
       [1. , 0. ],
       [0. , 1. ],
       [0. , 1. ],
       [1. , 0. ],
       [0. , 1. ],
       [0. , 1. ],
       [1. , 0. ],
       [1. , 0. ],
       [0.8, 0.2],
       [1. , 0. ],
       [0.4, 0.6],
       [1. , 0. ],
       [0. , 1. ],
       [1. , 0. ],
       [1. , 0. ],
       [0. , 1. ],
       [0. , 1. ],
       [0. , 1. ],
       [1. , 0. ],
       [1. , 0. ],
       [0. , 1. ],
       [1. , 0. ],
       [0. , 1. ],
       [0. , 1. ],
       [0. , 1. ],
       [1. , 0. ],
       [0. , 1. ],
       [0. , 1. ],
       [0. , 1. ],
       [0. , 1. ],
       [1. , 0. ],
       [0. , 1. ],
       [0. , 1. ],
       [1. , 0. ],
       [0.8, 0.2],
       [0. , 1. ],
       [0. , 1. ],
       [0.2, 0.8],
       [0.8, 0.2],
       [0. , 1. ],
       [1. , 0. ],
       [0.4, 0.6],
       [1. , 0. ],
       [0. , 1. ],
       [0. , 1. ],
       [1. , 0. ],
       [0. , 1. ],
       [0. , 1. ],
       [1. , 0. ],
       [0. , 1. ]])

y_test_trans = (y_test=="republican") # republican = 1, democrat = 0
y_pred_prob_trans = y_pred_prob[:,1] # 預測是 republican 的 機率

auc = roc_auc_score(y_test_trans, y_pred_prob_trans)

print(f"auc: {auc}")

fpr, tpr, thresholds = roc_curve(y_test_trans, y_pred_prob_trans)
plt.plot([0,1],[0, 1], "k--")
plt.plot(fpr, tpr, label = "KNN (K = 5)")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("KNN ROC Curve")
plt.show()

auc: 0.9492219492219492

18.1.1.3. 最完整流程整理¶

來講最完整的流程
- 分 train/test
- 定 pipeline
  - 定義 preprocessing steps
  - 定義 classifier.
- hyper-parameter tunning
  - grid search
  - random search
- 用整個 training set 做 fitting.
- 對 testing set 做 predict.
- 評估模型表現
  - threshold.
  - non-trheshold
- 細節資訊探索(e.g. fitting後的參數,…)

18.1.1.3.1. 分 train/test¶

# 分 train/test
X = vote.drop("party", axis = 1)
y = vote["party"]

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    test_size = 0.3, 
    random_state = 21, 
    stratify = y
)

18.1.1.3.2. 做 pipeline¶

對 knn 這種依賴 euclidean distance 的演算法，必須先做 normalization，再開始算距離，所以 pipeline寫成這樣：

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.neighbors import KNeighborsClassifier

# preprocess
scaler = StandardScaler()

# model
knn = KNeighborsClassifier()

# pipeline
my_pipe = Pipeline([
    ("scaler", scaler),
    ("knn", knn)
])

18.1.1.3.3. 做 hyper-parameter tunning¶

先來做 grid_search

from sklearn.model_selection import GridSearchCV

parameters = {"knn__n_neighbors": np.arange(1, 50)}
grid_cv = GridSearchCV(my_pipe, 
                       param_grid = parameters, 
                       cv = 5,
                       scoring='roc_auc');
grid_cv.fit(X_train, y_train);

這邊看一下，第二列的 parameters 裡面，”knn__”的knn，是用我的 my_pipe 物件裡的名稱 (“knn”); “n_neighbors” 是超參數的名稱
做完 fitting 後的物件就是 grid_cv 了，我們可以看最佳參數是多少：

grid_cv.best_params_

{'knn__n_neighbors': 17}

可以知道，最佳參數是 17.

?ref:scoring_parameter

Object `ref:scoring_parameter` not found.

?GridSearchCV

Init signature:
GridSearchCV(
    estimator,
    param_grid,
    *,
    scoring=None,
    n_jobs=None,
    refit=True,
    cv=None,
    verbose=0,
    pre_dispatch='2*n_jobs',
    error_score=nan,
    return_train_score=False,
)
Docstring:     
Exhaustive search over specified parameter values for an estimator.

Important members are fit, predict.

GridSearchCV implements a "fit" and a "score" method.
It also implements "score_samples", "predict", "predict_proba",
"decision_function", "transform" and "inverse_transform" if they are
implemented in the estimator used.

The parameters of the estimator used to apply these methods are optimized
by cross-validated grid-search over a parameter grid.

Read more in the :ref:`User Guide <grid_search>`.

Parameters
----------
estimator : estimator object
    This is assumed to implement the scikit-learn estimator interface.
    Either estimator needs to provide a ``score`` function,
    or ``scoring`` must be passed.

param_grid : dict or list of dictionaries
    Dictionary with parameters names (`str`) as keys and lists of
    parameter settings to try as values, or a list of such
    dictionaries, in which case the grids spanned by each dictionary
    in the list are explored. This enables searching over any sequence
    of parameter settings.

scoring : str, callable, list, tuple or dict, default=None
    Strategy to evaluate the performance of the cross-validated model on
    the test set.

    If `scoring` represents a single score, one can use:

    - a single string (see :ref:`scoring_parameter`);
    - a callable (see :ref:`scoring`) that returns a single value.

    If `scoring` represents multiple scores, one can use:

    - a list or tuple of unique strings;
    - a callable returning a dictionary where the keys are the metric
      names and the values are the metric scores;
    - a dictionary with metric names as keys and callables a values.

    See :ref:`multimetric_grid_search` for an example.

n_jobs : int, default=None
    Number of jobs to run in parallel.
    ``None`` means 1 unless in a :obj:`joblib.parallel_backend` context.
    ``-1`` means using all processors. See :term:`Glossary <n_jobs>`
    for more details.

    .. versionchanged:: v0.20
       `n_jobs` default changed from 1 to None

refit : bool, str, or callable, default=True
    Refit an estimator using the best found parameters on the whole
    dataset.

    For multiple metric evaluation, this needs to be a `str` denoting the
    scorer that would be used to find the best parameters for refitting
    the estimator at the end.

    Where there are considerations other than maximum score in
    choosing a best estimator, ``refit`` can be set to a function which
    returns the selected ``best_index_`` given ``cv_results_``. In that
    case, the ``best_estimator_`` and ``best_params_`` will be set
    according to the returned ``best_index_`` while the ``best_score_``
    attribute will not be available.

    The refitted estimator is made available at the ``best_estimator_``
    attribute and permits using ``predict`` directly on this
    ``GridSearchCV`` instance.

    Also for multiple metric evaluation, the attributes ``best_index_``,
    ``best_score_`` and ``best_params_`` will only be available if
    ``refit`` is set and all of them will be determined w.r.t this specific
    scorer.

    See ``scoring`` parameter to know more about multiple metric
    evaluation.

    .. versionchanged:: 0.20
        Support for callable added.

cv : int, cross-validation generator or an iterable, default=None
    Determines the cross-validation splitting strategy.
    Possible inputs for cv are:

    - None, to use the default 5-fold cross validation,
    - integer, to specify the number of folds in a `(Stratified)KFold`,
    - :term:`CV splitter`,
    - An iterable yielding (train, test) splits as arrays of indices.

    For integer/None inputs, if the estimator is a classifier and ``y`` is
    either binary or multiclass, :class:`StratifiedKFold` is used. In all
    other cases, :class:`KFold` is used. These splitters are instantiated
    with `shuffle=False` so the splits will be the same across calls.

    Refer :ref:`User Guide <cross_validation>` for the various
    cross-validation strategies that can be used here.

    .. versionchanged:: 0.22
        ``cv`` default value if None changed from 3-fold to 5-fold.

verbose : int
    Controls the verbosity: the higher, the more messages.

    - >1 : the computation time for each fold and parameter candidate is
      displayed;
    - >2 : the score is also displayed;
    - >3 : the fold and candidate parameter indexes are also displayed
      together with the starting time of the computation.

pre_dispatch : int, or str, default='2*n_jobs'
    Controls the number of jobs that get dispatched during parallel
    execution. Reducing this number can be useful to avoid an
    explosion of memory consumption when more jobs get dispatched
    than CPUs can process. This parameter can be:

        - None, in which case all the jobs are immediately
          created and spawned. Use this for lightweight and
          fast-running jobs, to avoid delays due to on-demand
          spawning of the jobs

        - An int, giving the exact number of total jobs that are
          spawned

        - A str, giving an expression as a function of n_jobs,
          as in '2*n_jobs'

error_score : 'raise' or numeric, default=np.nan
    Value to assign to the score if an error occurs in estimator fitting.
    If set to 'raise', the error is raised. If a numeric value is given,
    FitFailedWarning is raised. This parameter does not affect the refit
    step, which will always raise the error.

return_train_score : bool, default=False
    If ``False``, the ``cv_results_`` attribute will not include training
    scores.
    Computing training scores is used to get insights on how different
    parameter settings impact the overfitting/underfitting trade-off.
    However computing the scores on the training set can be computationally
    expensive and is not strictly required to select the parameters that
    yield the best generalization performance.

    .. versionadded:: 0.19

    .. versionchanged:: 0.21
        Default value was changed from ``True`` to ``False``

Attributes
----------
cv_results_ : dict of numpy (masked) ndarrays
    A dict with keys as column headers and values as columns, that can be
    imported into a pandas ``DataFrame``.

    For instance the below given table

    +------------+-----------+------------+-----------------+---+---------+
    |param_kernel|param_gamma|param_degree|split0_test_score|...|rank_t...|
    +============+===========+============+=================+===+=========+
    |  'poly'    |     --    |      2     |       0.80      |...|    2    |
    +------------+-----------+------------+-----------------+---+---------+
    |  'poly'    |     --    |      3     |       0.70      |...|    4    |
    +------------+-----------+------------+-----------------+---+---------+
    |  'rbf'     |     0.1   |     --     |       0.80      |...|    3    |
    +------------+-----------+------------+-----------------+---+---------+
    |  'rbf'     |     0.2   |     --     |       0.93      |...|    1    |
    +------------+-----------+------------+-----------------+---+---------+

    will be represented by a ``cv_results_`` dict of::

        {
        'param_kernel': masked_array(data = ['poly', 'poly', 'rbf', 'rbf'],
                                     mask = [False False False False]...)
        'param_gamma': masked_array(data = [-- -- 0.1 0.2],
                                    mask = [ True  True False False]...),
        'param_degree': masked_array(data = [2.0 3.0 -- --],
                                     mask = [False False  True  True]...),
        'split0_test_score'  : [0.80, 0.70, 0.80, 0.93],
        'split1_test_score'  : [0.82, 0.50, 0.70, 0.78],
        'mean_test_score'    : [0.81, 0.60, 0.75, 0.85],
        'std_test_score'     : [0.01, 0.10, 0.05, 0.08],
        'rank_test_score'    : [2, 4, 3, 1],
        'split0_train_score' : [0.80, 0.92, 0.70, 0.93],
        'split1_train_score' : [0.82, 0.55, 0.70, 0.87],
        'mean_train_score'   : [0.81, 0.74, 0.70, 0.90],
        'std_train_score'    : [0.01, 0.19, 0.00, 0.03],
        'mean_fit_time'      : [0.73, 0.63, 0.43, 0.49],
        'std_fit_time'       : [0.01, 0.02, 0.01, 0.01],
        'mean_score_time'    : [0.01, 0.06, 0.04, 0.04],
        'std_score_time'     : [0.00, 0.00, 0.00, 0.01],
        'params'             : [{'kernel': 'poly', 'degree': 2}, ...],
        }

    NOTE

    The key ``'params'`` is used to store a list of parameter
    settings dicts for all the parameter candidates.

    The ``mean_fit_time``, ``std_fit_time``, ``mean_score_time`` and
    ``std_score_time`` are all in seconds.

    For multi-metric evaluation, the scores for all the scorers are
    available in the ``cv_results_`` dict at the keys ending with that
    scorer's name (``'_<scorer_name>'``) instead of ``'_score'`` shown
    above. ('split0_test_precision', 'mean_train_precision' etc.)

best_estimator_ : estimator
    Estimator that was chosen by the search, i.e. estimator
    which gave highest score (or smallest loss if specified)
    on the left out data. Not available if ``refit=False``.

    See ``refit`` parameter for more information on allowed values.

best_score_ : float
    Mean cross-validated score of the best_estimator

    For multi-metric evaluation, this is present only if ``refit`` is
    specified.

    This attribute is not available if ``refit`` is a function.

best_params_ : dict
    Parameter setting that gave the best results on the hold out data.

    For multi-metric evaluation, this is present only if ``refit`` is
    specified.

best_index_ : int
    The index (of the ``cv_results_`` arrays) which corresponds to the best
    candidate parameter setting.

    The dict at ``search.cv_results_['params'][search.best_index_]`` gives
    the parameter setting for the best model, that gives the highest
    mean score (``search.best_score_``).

    For multi-metric evaluation, this is present only if ``refit`` is
    specified.

scorer_ : function or a dict
    Scorer function used on the held out data to choose the best
    parameters for the model.

    For multi-metric evaluation, this attribute holds the validated
    ``scoring`` dict which maps the scorer key to the scorer callable.

n_splits_ : int
    The number of cross-validation splits (folds/iterations).

refit_time_ : float
    Seconds used for refitting the best model on the whole dataset.

    This is present only if ``refit`` is not False.

    .. versionadded:: 0.20

multimetric_ : bool
    Whether or not the scorers compute several metrics.

classes_ : ndarray of shape (n_classes,)
    The classes labels. This is present only if ``refit`` is specified and
    the underlying estimator is a classifier.

n_features_in_ : int
    Number of features seen during :term:`fit`. Only defined if
    `best_estimator_` is defined (see the documentation for the `refit`
    parameter for more details) and that `best_estimator_` exposes
    `n_features_in_` when fit.

    .. versionadded:: 0.24

feature_names_in_ : ndarray of shape (`n_features_in_`,)
    Names of features seen during :term:`fit`. Only defined if
    `best_estimator_` is defined (see the documentation for the `refit`
    parameter for more details) and that `best_estimator_` exposes
    `feature_names_in_` when fit.

    .. versionadded:: 1.0

Notes
-----
The parameters selected are those that maximize the score of the left out
data, unless an explicit score is passed in which case it is used instead.

If `n_jobs` was set to a value higher than one, the data is copied for each
point in the grid (and not `n_jobs` times). This is done for efficiency
reasons if individual jobs take very little time, but may raise errors if
the dataset is large and not enough memory is available.  A workaround in
this case is to set `pre_dispatch`. Then, the memory is copied only
`pre_dispatch` many times. A reasonable value for `pre_dispatch` is `2 *
n_jobs`.

See Also
---------
ParameterGrid : Generates all the combinations of a hyperparameter grid.
train_test_split : Utility function to split the data into a development
    set usable for fitting a GridSearchCV instance and an evaluation set
    for its final evaluation.
sklearn.metrics.make_scorer : Make a scorer from a performance metric or
    loss function.

Examples
--------
>>> from sklearn import svm, datasets
>>> from sklearn.model_selection import GridSearchCV
>>> iris = datasets.load_iris()
>>> parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
>>> svc = svm.SVC()
>>> clf = GridSearchCV(svc, parameters)
>>> clf.fit(iris.data, iris.target)
GridSearchCV(estimator=SVC(),
             param_grid={'C': [1, 10], 'kernel': ('linear', 'rbf')})
>>> sorted(clf.cv_results_.keys())
['mean_fit_time', 'mean_score_time', 'mean_test_score',...
 'param_C', 'param_kernel', 'params',...
 'rank_test_score', 'split0_test_score',...
 'split2_test_score', ...
 'std_fit_time', 'std_score_time', 'std_test_score']
File:           /Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/model_selection/_search.py
Type:           ABCMeta
Subclasses:     

grid_cv.cv_results_

{'mean_fit_time': array([0.00410557, 0.00349936, 0.00289578, 0.00251188, 0.00247269,
        0.00252709, 0.00272069, 0.00246129, 0.00306287, 0.00371184,
        0.00294638, 0.00308042, 0.00299649, 0.00300522, 0.00273948,
        0.00361261, 0.00310259, 0.00313864, 0.00295277, 0.00384784,
        0.00304828, 0.00281844, 0.00319686, 0.00302219, 0.00305853,
        0.00290966, 0.00289521, 0.00281544, 0.00308471, 0.00267277,
        0.00285001, 0.0030632 , 0.00333405, 0.00302234, 0.00311818,
        0.00305524, 0.00284081, 0.00307274, 0.00295916, 0.00310979,
        0.00310645, 0.00305629, 0.00277839, 0.00244412, 0.00253983,
        0.00268669, 0.00238819, 0.00241308, 0.00232444]),
 'std_fit_time': array([7.88619106e-04, 9.42372814e-04, 5.09592826e-04, 2.05940145e-04,
        1.30329147e-04, 1.39555875e-04, 4.71658495e-04, 9.38250994e-05,
        7.01362537e-04, 9.40648512e-04, 5.46725004e-04, 6.70972481e-04,
        6.49284812e-04, 6.08038812e-04, 4.14937119e-04, 5.44774122e-04,
        4.02125639e-04, 7.85894654e-04, 4.22827063e-04, 3.05221040e-04,
        4.93310674e-04, 4.91984687e-04, 6.07253808e-04, 3.68679192e-04,
        7.83753444e-04, 3.84013546e-04, 5.97170735e-04, 3.48093033e-04,
        5.59792474e-04, 1.83740436e-04, 5.59706703e-04, 6.29626715e-04,
        6.91515668e-04, 5.43012236e-04, 4.23518716e-04, 4.46913336e-04,
        2.66436123e-04, 5.09237350e-04, 2.87175098e-04, 5.41384761e-04,
        3.48689486e-04, 4.61566522e-04, 4.62282743e-05, 1.11534301e-04,
        2.85003914e-04, 3.43765774e-04, 4.60957275e-05, 6.26801311e-05,
        2.83467783e-04]),
 'mean_score_time': array([0.0031558 , 0.00250463, 0.00250649, 0.00215402, 0.00216942,
        0.00229611, 0.00228481, 0.00209074, 0.00282593, 0.00295038,
        0.00259624, 0.00252881, 0.00267229, 0.00262327, 0.00244746,
        0.00306764, 0.00275512, 0.00263891, 0.00269318, 0.00277519,
        0.00260739, 0.00271072, 0.00273805, 0.0025115 , 0.00240846,
        0.00260949, 0.00252876, 0.00271316, 0.00271668, 0.00248637,
        0.00258036, 0.00288286, 0.0027564 , 0.00256944, 0.00278878,
        0.00260425, 0.0026751 , 0.00264025, 0.00280237, 0.0027246 ,
        0.00317144, 0.00274553, 0.00256987, 0.00226922, 0.00251756,
        0.00245085, 0.0023438 , 0.00224037, 0.00238848]),
 'std_score_time': array([1.11923762e-03, 5.02003516e-04, 6.28364240e-04, 2.53515694e-04,
        1.18820261e-04, 2.69056802e-04, 2.76645180e-04, 7.58113199e-05,
        6.51008842e-04, 6.12322828e-04, 5.13428608e-04, 4.03309776e-04,
        4.90662646e-04, 4.82464803e-04, 2.78491622e-04, 2.98720787e-04,
        3.83199448e-04, 4.57812113e-04, 3.96594125e-04, 4.14270047e-04,
        4.59691211e-04, 4.23115014e-04, 3.26109745e-04, 2.89609198e-04,
        2.27550939e-04, 4.39541240e-04, 3.07336423e-04, 6.50968867e-04,
        3.49868736e-04, 1.77414479e-04, 2.67554022e-04, 4.83264888e-04,
        4.17323549e-04, 2.95681527e-04, 2.24535524e-04, 3.38101453e-04,
        2.95083632e-04, 3.19394449e-04, 4.91383988e-04, 3.39971364e-04,
        5.14225423e-04, 4.11409220e-04, 9.38881830e-05, 3.99615144e-05,
        2.96201578e-04, 2.11339188e-04, 8.54063938e-05, 6.91232000e-05,
        3.91190910e-04]),
 'param_knn__n_neighbors': masked_array(data=[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16,
                    17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30,
                    31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44,
                    45, 46, 47, 48, 49],
              mask=[False, False, False, False, False, False, False, False,
                    False, False, False, False, False, False, False, False,
                    False, False, False, False, False, False, False, False,
                    False, False, False, False, False, False, False, False,
                    False, False, False, False, False, False, False, False,
                    False, False, False, False, False, False, False, False,
                    False],
        fill_value='?',
             dtype=object),
 'params': [{'knn__n_neighbors': 1},
  {'knn__n_neighbors': 2},
  {'knn__n_neighbors': 3},
  {'knn__n_neighbors': 4},
  {'knn__n_neighbors': 5},
  {'knn__n_neighbors': 6},
  {'knn__n_neighbors': 7},
  {'knn__n_neighbors': 8},
  {'knn__n_neighbors': 9},
  {'knn__n_neighbors': 10},
  {'knn__n_neighbors': 11},
  {'knn__n_neighbors': 12},
  {'knn__n_neighbors': 13},
  {'knn__n_neighbors': 14},
  {'knn__n_neighbors': 15},
  {'knn__n_neighbors': 16},
  {'knn__n_neighbors': 17},
  {'knn__n_neighbors': 18},
  {'knn__n_neighbors': 19},
  {'knn__n_neighbors': 20},
  {'knn__n_neighbors': 21},
  {'knn__n_neighbors': 22},
  {'knn__n_neighbors': 23},
  {'knn__n_neighbors': 24},
  {'knn__n_neighbors': 25},
  {'knn__n_neighbors': 26},
  {'knn__n_neighbors': 27},
  {'knn__n_neighbors': 28},
  {'knn__n_neighbors': 29},
  {'knn__n_neighbors': 30},
  {'knn__n_neighbors': 31},
  {'knn__n_neighbors': 32},
  {'knn__n_neighbors': 33},
  {'knn__n_neighbors': 34},
  {'knn__n_neighbors': 35},
  {'knn__n_neighbors': 36},
  {'knn__n_neighbors': 37},
  {'knn__n_neighbors': 38},
  {'knn__n_neighbors': 39},
  {'knn__n_neighbors': 40},
  {'knn__n_neighbors': 41},
  {'knn__n_neighbors': 42},
  {'knn__n_neighbors': 43},
  {'knn__n_neighbors': 44},
  {'knn__n_neighbors': 45},
  {'knn__n_neighbors': 46},
  {'knn__n_neighbors': 47},
  {'knn__n_neighbors': 48},
  {'knn__n_neighbors': 49}],
 'split0_test_score': array([0.87222222, 0.91851852, 0.97222222, 0.96666667, 0.97407407,
        0.97037037, 0.96481481, 0.97222222, 0.98148148, 0.98518519,
        0.98148148, 0.97962963, 0.97777778, 0.97592593, 0.97222222,
        0.96851852, 0.97407407, 0.97037037, 0.96666667, 0.97037037,
        0.97777778, 0.97777778, 0.97777778, 0.97777778, 0.97777778,
        0.98148148, 0.97777778, 0.97777778, 0.97777778, 0.97777778,
        0.97777778, 0.97777778, 0.97592593, 0.97037037, 0.96851852,
        0.96851852, 0.97592593, 0.97592593, 0.97592593, 0.97592593,
        0.97407407, 0.97407407, 0.97407407, 0.97407407, 0.97407407,
        0.97407407, 0.97407407, 0.97407407, 0.97777778]),
 'split1_test_score': array([0.97222222, 0.97222222, 0.97222222, 0.97222222, 0.97222222,
        1.        , 1.        , 0.99814815, 1.        , 1.        ,
        1.        , 1.        , 1.        , 1.        , 1.        ,
        1.        , 1.        , 1.        , 1.        , 1.        ,
        0.99814815, 0.99814815, 0.99814815, 0.99814815, 0.99814815,
        0.99814815, 0.99814815, 0.9962963 , 1.        , 1.        ,
        1.        , 1.        , 1.        , 1.        , 1.        ,
        1.        , 1.        , 1.        , 1.        , 1.        ,
        0.99814815, 0.99814815, 1.        , 1.        , 1.        ,
        1.        , 1.        , 1.        , 1.        ]),
 'split2_test_score': array([0.93333333, 0.95882353, 0.95686275, 0.95686275, 0.95686275,
        0.95294118, 0.95098039, 0.95098039, 0.94901961, 0.94705882,
        0.95294118, 0.95294118, 0.98039216, 0.99019608, 0.99215686,
        0.99019608, 0.98823529, 0.98823529, 0.98823529, 0.98627451,
        0.98627451, 0.98431373, 0.98039216, 0.97647059, 0.9745098 ,
        0.9745098 , 0.97254902, 0.97647059, 0.97647059, 0.98235294,
        0.97843137, 0.97843137, 0.97843137, 0.97647059, 0.97647059,
        0.9745098 , 0.9745098 , 0.97058824, 0.97058824, 0.96862745,
        0.96666667, 0.96666667, 0.9627451 , 0.96078431, 0.96078431,
        0.9627451 , 0.9627451 , 0.9627451 , 0.96078431]),
 'split3_test_score': array([0.90784314, 0.9372549 , 0.9372549 , 0.92941176, 0.92941176,
        0.95882353, 0.95882353, 0.96470588, 0.9627451 , 0.9627451 ,
        0.98627451, 0.98627451, 0.99215686, 0.99411765, 0.99019608,
        0.98823529, 0.98823529, 0.98627451, 0.98431373, 0.98235294,
        0.98039216, 0.97647059, 0.97058824, 0.96862745, 0.96862745,
        0.97254902, 0.97647059, 0.9745098 , 0.9745098 , 0.97058824,
        0.97647059, 0.9745098 , 0.9745098 , 0.97254902, 0.97254902,
        0.97058824, 0.96862745, 0.96862745, 0.96862745, 0.96862745,
        0.96862745, 0.9745098 , 0.9745098 , 0.97254902, 0.9745098 ,
        0.97647059, 0.97647059, 0.97647059, 0.97843137]),
 'split4_test_score': array([0.9372549 , 0.9372549 , 0.93333333, 0.92941176, 0.9254902 ,
        0.92352941, 0.94901961, 0.94705882, 0.94705882, 0.94705882,
        0.94705882, 0.94705882, 0.94705882, 0.94705882, 0.94705882,
        0.94705882, 0.97058824, 0.97058824, 0.97647059, 0.97647059,
        0.9745098 , 0.9745098 , 0.9745098 , 0.9745098 , 0.9745098 ,
        0.9745098 , 0.97254902, 0.9745098 , 0.97254902, 0.97254902,
        0.97254902, 0.97254902, 0.97254902, 0.97254902, 0.96862745,
        0.96862745, 0.96470588, 0.96470588, 0.96666667, 0.96470588,
        0.96666667, 0.96666667, 0.97058824, 0.97058824, 0.96862745,
        0.96666667, 0.96666667, 0.96470588, 0.96470588]),
 'mean_test_score': array([0.92457516, 0.94481481, 0.95437908, 0.95091503, 0.9516122 ,
        0.9611329 , 0.96472767, 0.96662309, 0.968061  , 0.96840959,
        0.9735512 , 0.97318083, 0.97947712, 0.98145969, 0.9803268 ,
        0.97880174, 0.98422658, 0.98309368, 0.98313725, 0.98309368,
        0.98342048, 0.98224401, 0.98028322, 0.97910675, 0.9787146 ,
        0.98023965, 0.97949891, 0.97991285, 0.98026144, 0.98065359,
        0.98104575, 0.98065359, 0.98028322, 0.9783878 , 0.97723312,
        0.9764488 , 0.97675381, 0.9759695 , 0.97636166, 0.97557734,
        0.9748366 , 0.97601307, 0.97638344, 0.97559913, 0.97559913,
        0.97599129, 0.97599129, 0.97559913, 0.97633987]),
 'std_test_score': array([0.03325211, 0.01872544, 0.01660774, 0.0182333 , 0.02064992,
        0.02482919, 0.01851883, 0.0182006 , 0.02016003, 0.02108635,
        0.02025305, 0.02012114, 0.01809218, 0.01894378, 0.01895695,
        0.01887616, 0.01067725, 0.01131942, 0.01120224, 0.01002235,
        0.00831217, 0.00860608, 0.00951598, 0.01002235, 0.01015531,
        0.00945584, 0.00955541, 0.00828506, 0.01002709, 0.0105113 ,
        0.00969438, 0.00990881, 0.01004308, 0.01098385, 0.01175462,
        0.01197354, 0.01230688, 0.01254751, 0.0122156 , 0.01273875,
        0.01196711, 0.01158181, 0.01253933, 0.01305082, 0.01316813,
        0.01298256, 0.01298256, 0.0132844 , 0.01373731]),
 'rank_test_score': array([49, 48, 45, 47, 46, 44, 43, 42, 41, 40, 38, 39, 18,  7, 11, 20,  1,
         4,  3,  4,  2,  6, 12, 19, 21, 15, 17, 16, 14,  9,  8,  9, 12, 22,
        23, 25, 24, 32, 27, 36, 37, 29, 26, 33, 33, 30, 30, 33, 28],
       dtype=int32)}

18.1.1.3.4. 用整個 training set 做 fitting.¶

剛剛的 grid_cv 物件，再找完最佳參數後，就已經幫你把整個 training set 給 fit 完了

18.1.1.3.5. 用 testing set 做 predict¶

pred_label_test = grid_cv.predict(X_test)
pred_prob_test = grid_cv.predict_proba(X_test)

18.1.1.3.6. 效果評估¶

# 評估結果
print(grid_cv.score(X_test, y_test))
print(confusion_matrix(y_test, pred_label_test))
print(classification_report(y_test, pred_label_test))

0.9426699426699426
[[31  6]
 [ 3 30]]
              precision    recall  f1-score   support

    democrat       0.91      0.84      0.87        37
  republican       0.83      0.91      0.87        33

    accuracy                           0.87        70
   macro avg       0.87      0.87      0.87        70
weighted avg       0.87      0.87      0.87        70

18.1.2. KNN (multi_class)¶

18.1.2.1. 讀資料集¶

要來引入數字辨認資料集

from sklearn import datasets
digits = datasets.load_digits()

這筆資料，x 分成兩種：
- images: 1797張image x 8 x 8 的 3d-array.
- data: 每張 image 拉成 64 個 column，所以變成 1797x64 的 2d-array
看一下 shape 是不是真的是這樣：

print(digits.images.shape)
print(digits.data.shape)

(1797, 8, 8)
(1797, 64)

我們可以看一張圖片：

plt.imshow(digits.images[1010], cmap=plt.cm.gray_r, interpolation='nearest');

那這個任務，蠻熟悉的，就是給我一張圖片，然後我要辨認出他是 0 ~ 9 的哪個數字

18.1.2.2. fit model & predict¶

# 切資料
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(
    digits.data, 
    digits.target, 
    test_size = 0.2, 
    random_state = 42, 
    stratify = digits.target
)

# fit model
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 6)
knn.fit(X_train, y_train)

# 預測 training/testing set
pred_label_train = knn.predict(X_train)
pred_prob_train = knn.predict_proba(X_train)

pred_label_test = knn.predict(X_test)
pred_prob_test = knn.predict_proba(X_test)

# 評估結果
print(knn.score(X_test, y_test))
print(confusion_matrix(y_test, pred_label_test))
print(classification_report(y_test, pred_label_test))

0.9805555555555555
[[36  0  0  0  0  0  0  0  0  0]
 [ 0 36  0  0  0  0  0  0  0  0]
 [ 0  0 35  0  0  0  0  0  0  0]
 [ 0  0  0 37  0  0  0  0  0  0]
 [ 0  0  0  0 36  0  0  0  0  0]
 [ 0  0  0  0  0 37  0  0  0  0]
 [ 0  0  0  0  0  0 35  0  1  0]
 [ 0  0  0  0  0  0  0 36  0  0]
 [ 0  3  0  0  0  0  0  1 31  0]
 [ 0  0  0  0  1  0  0  0  1 34]]
              precision    recall  f1-score   support

           0       1.00      1.00      1.00        36
           1       0.92      1.00      0.96        36
           2       1.00      1.00      1.00        35
           3       1.00      1.00      1.00        37
           4       0.97      1.00      0.99        36
           5       1.00      1.00      1.00        37
           6       1.00      0.97      0.99        36
           7       0.97      1.00      0.99        36
           8       0.94      0.89      0.91        35
           9       1.00      0.94      0.97        36

    accuracy                           0.98       360
   macro avg       0.98      0.98      0.98       360
weighted avg       0.98      0.98      0.98       360

18.1.2.3. overfitting exploration¶

最後，我們來玩點新的，看看 overfitting 的狀況

neighbors = np.arange(1, 9) # knn的 k，從複雜(1)到簡單(9)
train_accuracy = np.empty(len(neighbors)) # 先擺個 placeholder
test_accuracy = np.empty(len(neighbors))

for i, k in enumerate(neighbors):
    
    knn = KNeighborsClassifier(n_neighbors = k)
    knn.fit(X_train, y_train)
    
    train_accuracy[i] = knn.score(X_train, y_train)
    test_accuracy[i] = knn.score(X_test, y_test)

畫個圖看看

fig, ax = plt.subplots()
ax.plot(neighbors, train_accuracy, label = "train");
ax.plot(neighbors, test_accuracy, label = "test");
ax.legend();
ax.set(
    xlabel='Number of Neighbors', 
    ylabel='Accuracy',
    title='k-NN: Varying Number of Neighbors'
);

18.1.3. Logistic (binary)¶

18.1.3.1. 讀資料集¶

diabetes = pd.read_csv("data/diabetes.csv")
diabetes

	pregnancies	glucose	diastolic	triceps	insulin	bmi	dpf	age	diabetes
0	6	148	72	35	0	33.6	0.627	50	1
1	1	85	66	29	0	26.6	0.351	31	0
2	8	183	64	0	0	23.3	0.672	32	1
3	1	89	66	23	94	28.1	0.167	21	0
4	0	137	40	35	168	43.1	2.288	33	1
...	...	...	...	...	...	...	...	...	...
763	10	101	76	48	180	32.9	0.171	63	0
764	2	122	70	27	0	36.8	0.340	27	0
765	5	121	72	23	112	26.2	0.245	30	0
766	1	126	60	0	0	30.1	0.349	47	1
767	1	93	70	31	0	30.4	0.315	23	0

768 rows × 9 columns

diabetes.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 768 entries, 0 to 767
Data columns (total 9 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   pregnancies  768 non-null    int64  
 1   glucose      768 non-null    int64  
 2   diastolic    768 non-null    int64  
 3   triceps      768 non-null    int64  
 4   insulin      768 non-null    int64  
 5   bmi          768 non-null    float64
 6   dpf          768 non-null    float64
 7   age          768 non-null    int64  
 8   diabetes     768 non-null    int64  
dtypes: float64(2), int64(7)
memory usage: 54.1 KB

18.1.3.2. fit, predict, and evaluate¶

# 切資料
from sklearn.model_selection import train_test_split

# preprocess (沒用到)
# from sklearn.impute import SimpleImputer
# from sklearn.pipeline import Pipeline

# modeling
from sklearn.linear_model import LogisticRegression

# hyper-parameter tunning
from sklearn.model_selection import GridSearchCV

# evaluation
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, roc_auc_score


# 切資料
X = diabetes.drop("diabetes", axis = 1)
y = diabetes["diabetes"]
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    test_size = 0.4, 
    random_state = 42, 
    stratify = y
)

# model
logreg = LogisticRegression()

# hyper-parameter tunning
# logistic regression 的 regularization parameter 是 C ，代表 inverse of the regularization strength, 
# 所以，C 越大，regularize能力越爛，越容易 overfit model; 
# C 越小，regularize能力越強，越容易 underfit model
c_space = np.logspace(-5, 8, 15)
param_grid = {"C": c_space, "penalty": ['l1', 'l2']}
logreg_cv = GridSearchCV(logreg, param_grid, cv=5)
logreg_cv.fit(X_train, y_train)
print("Tuned Logistic Regression Parameters: {}".format(logreg_cv.best_params_)) 
print("Best score is {}".format(logreg_cv.best_score_))

# 預測 training/testing set
y_pred = logreg_cv.predict(X_test)
y_pred_prob = logreg_cv.predict_proba(X_test)[:,1]


# 評估結果
print("AUC: {}".format(roc_auc_score(y_test, y_pred_prob)))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.plot([0,1],[0, 1], "k--")
plt.plot(fpr, tpr, label = "Logistic Regression")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Logistic Regression ROC Curve")
plt.show()

print(roc_auc_score(y_test, y_pred))

/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/model_selection/_validation.py:372: FitFailedWarning: 
75 fits failed out of a total of 150.
The score on these train-test partitions for these parameters will be set to nan.
If these failures are not expected, you can try to debug them by setting error_score='raise'.

Below are more details about the failures:
--------------------------------------------------------------------------------
75 fits failed with the following error:
Traceback (most recent call last):
  File "/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/model_selection/_validation.py", line 680, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py", line 1461, in fit
    solver = _check_solver(self.solver, self.penalty, self.dual)
  File "/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py", line 447, in _check_solver
    raise ValueError(
ValueError: Solver lbfgs supports only 'l2' or 'none' penalties, got l1 penalty.

  warnings.warn(some_fits_failed_message, FitFailedWarning)
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/model_selection/_search.py:969: UserWarning: One or more of the test scores are non-finite: [       nan 0.72391304        nan 0.7673913         nan 0.77608696
        nan 0.7826087         nan 0.77391304        nan 0.78913043
        nan 0.78043478        nan 0.78043478        nan 0.77391304
        nan 0.77391304        nan 0.76956522        nan 0.77391304
        nan 0.7673913         nan 0.77608696        nan 0.77391304]
  warnings.warn(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_logistic.py:814: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

Tuned Logistic Regression Parameters: {'C': 0.4393970560760795, 'penalty': 'l2'}
Best score is 0.7891304347826086
[[169  32]
 [ 47  60]]
              precision    recall  f1-score   support

           0       0.78      0.84      0.81       201
           1       0.65      0.56      0.60       107

    accuracy                           0.74       308
   macro avg       0.72      0.70      0.71       308
weighted avg       0.74      0.74      0.74       308

0.7007718417259498

precision & recall
- precision = 你說是 positive 的人中，有多少個是真的 positive
- recall = true positive rate = 真的是positive的人中，你揪出了幾個？
- cutpoint -> 1, precision ->1，因為惜字如金，只要我說是positive，幾乎一定是 positive。
- cutpoint -> 0, recall -> 1, 因為cutpoint接近0的時候，幾乎所有人全被說 positive，那recall的分子就幾乎全中了

18.1.4. DecisionTree (binary)¶

# 切資料
from sklearn.model_selection import train_test_split

# preprocess (沒用到)
# from sklearn.impute import SimpleImputer
# from sklearn.pipeline import Pipeline

# modeling
from sklearn.tree import DecisionTreeClassifier

# hyper-parameter tunning
from sklearn.model_selection import RandomizedSearchCV

# evaluation
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, roc_auc_score

# utils  
from scipy.stats import randint



# 讀檔 & 切資料
diabetes = pd.read_csv("data/diabetes.csv")
X = diabetes.drop("diabetes", axis = 1)
y = diabetes["diabetes"]
X_train, X_test, y_train, y_test = train_test_split(
    X, 
    y, 
    test_size = 0.4, 
    random_state = 42, 
    stratify = y
)

# model
tree = DecisionTreeClassifier()

# hyper-parameter tunning
param_dist = {"max_depth": [3, None],
              "max_features": randint(1, 9),
              "min_samples_leaf": randint(1, 9),
              "criterion": ["gini", "entropy"]}
tree_randomcv = RandomizedSearchCV(tree, param_dist, cv=5)
tree_randomcv.fit(X_train, y_train)
print("Tuned Decision Tree Parameters: {}".format(tree_randomcv.best_params_))
print("Best score is {}".format(tree_randomcv.best_score_))

# 預測 training/testing set
y_pred = tree_randomcv.predict(X_test)
y_pred_prob = tree_randomcv.predict_proba(X_test)[:,1]


# 評估結果
print("AUC: {}".format(roc_auc_score(y_test, y_pred_prob)))
print(confusion_matrix(y_test, y_pred))
print(classification_report(y_test, y_pred))

fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob)
plt.plot([0,1],[0, 1], "k--")
plt.plot(fpr, tpr, label = "Decision Tree")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("Logistic Regression ROC Curve")
plt.show()

Tuned Decision Tree Parameters: {'criterion': 'entropy', 'max_depth': 3, 'max_features': 5, 'min_samples_leaf': 5}
Best score is 0.7608695652173914
AUC: 0.7851629701957501
[[174  27]
 [ 57  50]]
              precision    recall  f1-score   support

           0       0.75      0.87      0.81       201
           1       0.65      0.47      0.54       107

    accuracy                           0.73       308
   macro avg       0.70      0.67      0.67       308
weighted avg       0.72      0.73      0.71       308

18.1.5. SVC (binary)¶

# Import necessary modules
from sklearn.impute import SimpleImputer
from sklearn.pipeline import Pipeline
from sklearn.svm import SVC

# 讀資料
vote_raw = pd.read_csv("data/house-votes-84.csv")
vote = vote_raw.copy()
col_names = ['party', 'infants', 'water', 'budget', 'physician', 'salvador',
       'religious', 'satellite', 'aid', 'missile', 'immigration', 'synfuels',
       'education', 'superfund', 'crime', 'duty_free_exports', 'eaa_rsa']
vote.columns = col_names
vote[vote == "?"] = np.nan # 把 ? 改成 na
for i in col_names[1:]:
    vote[i] = vote[i].replace({"y": 1, "n": 0})
X = vote.drop("party", axis = 1)
y = vote["party"]

# 切資料
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 42)

# pipeline: 前處理 & model
steps = [('imputation', SimpleImputer(missing_values=np.nan, strategy='most_frequent')),
         ("scaler", StandardScaler()),
        ('SVM', SVC())]

# Create the pipeline: pipeline
pipeline = Pipeline(steps)


# hyper-parameter tunning
parameters = {'SVM__C':[1, 10, 100],
              'SVM__gamma':[0.1, 0.01]}
cv = GridSearchCV(pipeline, parameters, cv = 3)
cv.fit(X_train, y_train)
print("Tuned Model Parameters: {}".format(cv.best_params_))

# 預測
y_pred = cv.predict(X_test)

# performance
print("Accuracy: {}".format(cv.score(X_test, y_test)))
print(classification_report(y_test, y_pred))

Tuned Model Parameters: {'SVM__C': 10, 'SVM__gamma': 0.01}
Accuracy: 0.9541984732824428
              precision    recall  f1-score   support

    democrat       0.99      0.94      0.96        83
  republican       0.90      0.98      0.94        48

    accuracy                           0.95       131
   macro avg       0.95      0.96      0.95       131
weighted avg       0.96      0.95      0.95       131

18.2. Regression¶

18.2.1. Gapminder Data¶

import pandas as pd
gapminder = pd.read_csv("data/gm_2008_region.csv")
gapminder.head()

	population	fertility	HIV	CO2	BMI_male	GDP	BMI_female	life	child_mortality	Region
0	34811059.0	2.73	0.1	3.328945	24.59620	12314.0	129.9049	75.3	29.5	Middle East & North Africa
1	19842251.0	6.43	2.0	1.474353	22.25083	7103.0	130.1247	58.3	192.0	Sub-Saharan Africa
2	40381860.0	2.24	0.5	4.785170	27.50170	14646.0	118.8915	75.5	15.4	America
3	2975029.0	1.40	0.1	1.804106	25.35542	7383.0	132.8108	72.5	20.0	Europe & Central Asia
4	21370348.0	1.96	0.1	18.016313	27.56373	41312.0	117.3755	81.5	5.2	East Asia & Pacific

gapminder.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 139 entries, 0 to 138
Data columns (total 10 columns):
 #   Column           Non-Null Count  Dtype  
---  ------           --------------  -----  
 0   population       139 non-null    float64
 1   fertility        139 non-null    float64
 2   HIV              139 non-null    float64
 3   CO2              139 non-null    float64
 4   BMI_male         139 non-null    float64
 5   GDP              139 non-null    float64
 6   BMI_female       139 non-null    float64
 7   life             139 non-null    float64
 8   child_mortality  139 non-null    float64
 9   Region           139 non-null    object 
dtypes: float64(9), object(1)
memory usage: 11.0+ KB

18.2.2. Linear Regression¶

18.2.2.1. train/test¶

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# preprocess. 
# gapminder_onehot = pd.get_dummies(gapminder)
gapminder_dummy = pd.get_dummies(gapminder, drop_first = True)

# 分割資料
X = gapminder_dummy.drop("life", axis = 1)
y = gapminder_dummy["life"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=42)

# fit model
reg_all = LinearRegression()
reg_all.fit(X_train, y_train)

# Predict on the test data: y_pred
y_pred = reg_all.predict(X_test)

# Compute and print R^2 and RMSE
print("R^2: {}".format(reg_all.score(X_test, y_test)))
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error: {}".format(rmse))

R^2: 0.8219419939587727
Root Mean Squared Error: 3.405248115733344

18.2.2.2. CV¶

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import cross_val_score

# 分割資料
# 不分割了，因為等等直接用 cv

# fit model
reg = LinearRegression()
cv_scores = cross_val_score(reg, X, y, cv = 5)

print(cv_scores) # 5-fold cv 結果

print(f"Average 5-Fold CV Score: {str(np.mean(cv_scores))}")

[0.8196741  0.80301541 0.89758712 0.80425614 0.94015848]
Average 5-Fold CV Score: 0.8529382494240787

18.2.3. Lasso Regression¶

# Import Lasso
from sklearn.linear_model import Lasso
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline

# pipeline
lasso_pipe = make_pipeline(
    StandardScaler(),
    Lasso(alpha = 0.4)
)

# Fit the regressor to the data
lasso_pipe.fit(X_train, y_train)

y_pred = lasso_pipe.predict(X_test)

# Compute and print R^2 and RMSE
print("R^2: {}".format(lasso_pipe.score(X_test, y_test)))
rmse = np.sqrt(mean_squared_error(y_test, y_pred))
print("Root Mean Squared Error: {}".format(rmse))

R^2: 0.8401265152705228
Root Mean Squared Error: 3.2266824659707334

看一下哪些變數被 shrinkage 到 0 ，哪些變數最重要：

# Compute and print the coefficients
lasso_coef = lasso_pipe.named_steps["lasso"].coef_
print(lasso_coef)

# Plot the coefficients

plt.plot(range(len(X_train.columns)), lasso_coef)
plt.xticks(range(len(X_train.columns)), X_train.columns.values, rotation=60)
plt.margins(0.02)

[-0.         -0.08591553 -2.91968634 -0.          0.58693244  1.6922106
 -1.11083667 -4.3362549  -0.48746711  0.          0.          0.
 -0.1705074 ]

18.2.4. Ridge Regression¶

def display_plot(cv_scores, cv_scores_std):
    fig = plt.figure()
    ax = fig.add_subplot(1,1,1)
    ax.plot(alpha_space, cv_scores)

    std_error = cv_scores_std / np.sqrt(10)

    ax.fill_between(alpha_space, cv_scores + std_error, cv_scores - std_error, alpha=0.2)
    ax.set_ylabel('CV Score +/- Std Error')
    ax.set_xlabel('Alpha')
    ax.axhline(np.max(cv_scores), linestyle='--', color='.5')
    ax.set_xlim([alpha_space[0], alpha_space[-1]])
    ax.set_xscale('log')
    plt.show()

# Import necessary modules
from sklearn.linear_model import Ridge
from sklearn.model_selection import cross_val_score

# Setup the array of alphas and lists to store scores
alpha_space = np.logspace(-4, 0, 50)
ridge_scores = []
ridge_scores_std = []

# ridge = Ridge(normalize=True)

# Compute scores over range of alphas
for alpha in alpha_space:

    # Specify the alpha value to use: ridge.alpha
    
    ridge = make_pipeline(
        StandardScaler(),
        Ridge(alpha = alpha)
    )
    # ridge.named_steps["ridge"].alpha = alpha
    #ridge.alpha = alpha
    
    # Perform 10-fold CV: ridge_cv_scores
    ridge_cv_scores = cross_val_score(ridge, X, y, cv=10)
    
    # Append the mean of ridge_cv_scores to ridge_scores
    ridge_scores.append(np.mean(ridge_cv_scores))
    
    # Append the std of ridge_cv_scores to ridge_scores_std
    ridge_scores_std.append(np.std(ridge_cv_scores))

# Display the plot
display_plot(ridge_scores, ridge_scores_std)

18.2.5. Elastic net¶

# Import necessary modules
from sklearn.model_selection import train_test_split
from sklearn.model_selection import GridSearchCV
from sklearn.linear_model import ElasticNet
from sklearn.metrics import mean_squared_error

# 讀資料 & 切資料
gapminder = pd.read_csv("data/gm_2008_region.csv")
X = gapminder_dummy.drop("life", axis = 1)
y = gapminder_dummy["life"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.4, random_state = 42)

# pipeline: 前處理 & model
steps = [('imputation', SimpleImputer(missing_values=np.nan, strategy='mean')),
         ("scaler", StandardScaler()),
         ("elasticnet", ElasticNet())]
pipeline = Pipeline(steps)

# hyper-parameter tunning
parameters = {"elasticnet__l1_ratio":np.linspace(0,1,30)} # l1_ratio*L1_loss + (1-l1_ratio)*L2_loss
gm_cv = GridSearchCV(pipeline, parameters, cv = 5)
gm_cv.fit(X_train, y_train)
print("Tuned ElasticNet l1 ratio: {}".format(gm_cv.best_params_))

# Predict
y_pred = gm_cv.predict(X_test)

# performance
r2 = gm_cv.score(X_test, y_test)
mse = mean_squared_error(y_test, y_pred)
print("Tuned ElasticNet R squared: {}".format(r2))
print("Tuned ElasticNet MSE: {}".format(mse))

/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_coordinate_descent.py:647: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 8.126e+02, tolerance: 5.589e-01 Linear regression models with null weight for the l1 regularization term are more efficiently fitted using one of the solvers implemented in sklearn.linear_model.Ridge/RidgeCV instead.
  model = cd_fast.enet_coordinate_descent(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_coordinate_descent.py:647: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 8.410e+02, tolerance: 5.893e-01 Linear regression models with null weight for the l1 regularization term are more efficiently fitted using one of the solvers implemented in sklearn.linear_model.Ridge/RidgeCV instead.
  model = cd_fast.enet_coordinate_descent(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_coordinate_descent.py:647: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 7.983e+02, tolerance: 5.890e-01 Linear regression models with null weight for the l1 regularization term are more efficiently fitted using one of the solvers implemented in sklearn.linear_model.Ridge/RidgeCV instead.
  model = cd_fast.enet_coordinate_descent(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_coordinate_descent.py:647: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 8.506e+02, tolerance: 5.814e-01 Linear regression models with null weight for the l1 regularization term are more efficiently fitted using one of the solvers implemented in sklearn.linear_model.Ridge/RidgeCV instead.
  model = cd_fast.enet_coordinate_descent(
/Volumes/GoogleDrive/我的雲端硬碟/0. codepool_python/python_ds/python_ds_env/lib/python3.8/site-packages/sklearn/linear_model/_coordinate_descent.py:647: ConvergenceWarning: Objective did not converge. You might want to increase the number of iterations, check the scale of the features or consider increasing regularisation. Duality gap: 8.375e+02, tolerance: 5.802e-01 Linear regression models with null weight for the l1 regularization term are more efficiently fitted using one of the solvers implemented in sklearn.linear_model.Ridge/RidgeCV instead.
  model = cd_fast.enet_coordinate_descent(

Tuned ElasticNet l1 ratio: {'elasticnet__l1_ratio': 1.0}
Tuned ElasticNet R squared: 0.8862016549771035
Tuned ElasticNet MSE: 8.594868215979249

My sample book

Supervised Learning with scikit-learn

Contents

18. Supervised Learning with scikit-learn¶

18.1. Classification¶

18.1.1. KNN (binary)¶

18.1.1.1. 讀資料集¶

18.1.1.2. 最簡單流程¶

18.1.1.2.1. 分 train/test¶

18.1.1.2.2. 做 pipeline¶

18.1.1.2.3. 用整個 training set 做 fitting.¶

18.1.1.2.4. 對 testing set 做 predict¶

18.1.1.2.5. 效果評估¶

18.1.1.3. 最完整流程整理¶

18.1.1.3.1. 分 train/test¶

18.1.1.3.2. 做 pipeline¶

18.1.1.3.3. 做 hyper-parameter tunning¶

18.1.1.3.4. 用整個 training set 做 fitting.¶

18.1.1.3.5. 用 testing set 做 predict¶

18.1.1.3.6. 效果評估¶

18.1.2. KNN (multi_class)¶

18.1.2.1. 讀資料集¶

18.1.2.2. fit model & predict¶

18.1.2.3. overfitting exploration¶

18.1.3. Logistic (binary)¶

18.1.3.1. 讀資料集¶

18.1.3.2. fit, predict, and evaluate¶

18.1.4. DecisionTree (binary)¶

18.1.5. SVC (binary)¶

18.2. Regression¶

18.2.1. Gapminder Data¶

18.2.2. Linear Regression¶

18.2.2.1. train/test¶

18.2.2.2. CV¶

18.2.3. Lasso Regression¶

18.2.4. Ridge Regression¶

18.2.5. Elastic net¶