Study_note(zb_data)/Machine Learning

스터디노트 (Precision and Recall)

KloudHyun 2023. 9. 30. 20:10

📌 Precision, Recall

🔻Precision과 Recall 을 강제로 튜닝할 수 있다

HOW?

-> 강제로 Threshold를 변경 (단, 성능이 좋아지는 지는 의문.)

 

📌 데이터 분리 및 예측

import pandas as pd
from sklearn.model_selection import train_test_split

red_wine = pd.read_csv('../data/winequality-red.csv', sep=';') 
white_wine = pd.read_csv('../data/winequality-white.csv', sep=';') 

# wine의 컬러로 나누기
red_wine['color'] = 1.
white_wine['color'] = 0.

# red_wine, white_wine 합치기
wine = pd.concat([red_wine, white_wine])

# wine quality 를 숫자로 나누기
wine['taste'] = [1. if grade > 5 else 0. for grade in wine['quality']]

# 데이터 나누기
X = wine.drop(['taste', 'quality'], axis=1)
y = wine['taste']

# 데이터를 훈련용 데이터와 테스트 데이터로 나누기
X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.2, random_state=13)
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# 로지스틱 회귀 적용
lr =LogisticRegression(solver='liblinear', random_state=13)
lr.fit(X_train, y_train)

# 예측하기
y_pred_tr = lr.predict(X_train)
y_pred_test = lr.predict(X_test)

print('Train Acc : ', accuracy_score(y_train, y_pred_tr))
print('test Acc : ', accuracy_score(y_test, y_pred_test))
>>>>
Train Acc :  0.7427361939580527
test Acc :  0.7438461538461538

📌 예측 결과 알아보기

from sklearn.metrics import classification_report

print(classification_report(y_test, lr.predict(X_test)))
>>>>
              precision    recall  f1-score   support

         0.0       0.68      0.58      0.62       477
         1.0       0.77      0.84      0.81       823

    accuracy                           0.74      1300
   macro avg       0.73      0.71      0.71      1300
weighted avg       0.74      0.74      0.74      1300
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, lr.predict(X_test))
>>>>
array([[275, 202],
       [131, 692]], dtype=int64)

📌 그래프 그려보기

import matplotlib.pyplot as plt
from sklearn.metrics import precision_recall_curve
%matplotlib inline

plt.figure(figsize=(10, 8))
# proba -> class 별로 확률을 나타내기 때문에, 1일 때 확률을 반환하기 위해 슬라이싱
pred = lr.predict_proba(X_test)[:, 1]

precisions, recalls, thresholds = precision_recall_curve(y_test, pred)
plt.plot(thresholds, precisions[:len(thresholds)], label='precision')
plt.plot(thresholds, recalls[:len(thresholds)], label='recalls')
plt.grid(); plt.legend(); plt.show()

📌 threshold 변경해보기

pred_proba = lr.predict_proba(X_test)
pred_proba[:3]
>>>>
array([[0.40526731, 0.59473269],
       [0.50957556, 0.49042444],
       [0.10215001, 0.89784999]])
import numpy as np

np.concatenate([pred_proba, y_pred_test.reshape(-1, 1)], axis=1)
>>>>
array([[0.40526731, 0.59473269, 1.        ],
       [0.50957556, 0.49042444, 0.        ],
       [0.10215001, 0.89784999, 1.        ],
       ...,
       [0.22540242, 0.77459758, 1.        ],
       [0.67366935, 0.32633065, 0.        ],
       [0.31452992, 0.68547008, 1.        ]])

🔻threshold 바꿔보기 (Binarizer)

from sklearn.preprocessing import Binarizer

binarizer = Binarizer(threshold=0.6).fit(pred_proba)
pred_bin = binarizer.transform(pred_proba)[:, 1]
pred_bin
>>>>
array([0., 0., 1., ..., 1., 0., 1.])
print(classification_report(y_test, pred_bin))
>>>>
              precision    recall  f1-score   support

         0.0       0.62      0.73      0.67       477
         1.0       0.82      0.74      0.78       823

    accuracy                           0.73      1300
   macro avg       0.72      0.73      0.72      1300
weighted avg       0.75      0.73      0.74      1300