๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

Study_note(zb_data)/Machine Learning

์Šคํ„ฐ๋””๋…ธํŠธ (credit card data 1)

๐Ÿ“Œ ์‹ ์šฉ์นด๋“œ ๋ถ€์ • ์‚ฌ์šฉ์ž ๊ฒ€์ถœํ•ด๋ณด๊ธฐ

๐Ÿ”ปData import

→ ์‹ ์šฉ์นด๋“œ ๋ถˆ๋Ÿ‰ ์‚ฌ์šฉ์ž ๋น„์œจ์ด ๋งค์šฐ ๋‚ฎ๋‹ค (0.17%..)๋น„์œจ์ด ๋‚ฎ๋‹ค๋Š” ๊ฒƒ์€? ๋ฐ์ดํ„ฐ๊ฐ€ ๋งค์šฐ ๋ถˆ๊ท ํ˜•ํ•œ ์ƒํƒœ๋ผ๋Š” ๊ฒƒ.

import pandas as pd

raw_data = pd.read_csv('../data/creditcard.csv')
raw_data.head()

raw_data['Class'].value_counts()
>>>>
Class
0    284315
1       492
Name: count, dtype: int64
frauds_rate = round(raw_data['Class'].value_counts()[1] / len(raw_data) * 100,2)
frauds_rate
>>>>
0.17

๐Ÿ”ปData๋ฅผ plot์œผ๋กœ ๊ทธ๋ ค๋ณด๊ธฐ

import seaborn as sns
import matplotlib.pyplot as plt

sns.countplot(x='Class', data = raw_data)
plt.title('Class Distribution')
plt.show()

์ž˜ ์•ˆ๋ณด์ด๊ฒ ์ง€๋งŒ.. 1์ด ๋ถˆ๋Ÿ‰๋ฅ ์ด๋‹ค.

๐Ÿ”ปtrain data์˜ ๋น„์œจ ํ™•์ธ

from sklearn.model_selection import train_test_split

X = raw_data.iloc[:, 1:-1] # column V1 ~ amount ๊นŒ์ง€
y = raw_data.iloc[:, -1] # ๋งˆ์ง€๋ง‰ ์ปฌ๋Ÿผ๋งŒ ์„ ํƒ (Class)

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state= 13, stratify=y)
import numpy as np
np.unique(y_train, return_counts=True)
>>>>
(array([0, 1], dtype=int64), array([227451,    394], dtype=int64))

# ๋ฐ์ดํ„ฐ์˜ ๋ถˆ๊ท ํ˜• ์ •๋„๊ฐ€ ์–ด๋–ค์ง€ ํ™•์ธ
tmp = np.unique(y_train, return_counts=True)[1]
tmp[1] / len(y_train) * 100
>>>>
0.17292457591783889

๐Ÿ“Œ ๋‹ค์–‘ํ•œ ์‹œ๋„๋ฅผ ํ•ด๋ณด๊ธฐ

๐Ÿ”ปdef ์ƒ์„ฑ

# ๋ถ„๋ฅ˜๊ธฐ์˜ ์„ฑ๋Šฅ์„ returnํ•˜๋Š” ํ•จ์ˆ˜ 
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

def get_clf_eval(y_test, pred):
    acc = accuracy_score(y_test, pred)
    pre = precision_score(y_test, pred)
    re = recall_score(y_test, pred)
    f1 = f1_score(y_test, pred)
    auc = roc_auc_score(y_test, pred)

    return acc, pre, re, f1, auc
# ์„ฑ๋Šฅ์„ ์ถœ๋ ฅํ•˜๋Š” ํ•จ์ˆ˜ ์ž‘์„ฑ
from sklearn.metrics import confusion_matrix

def print_clf_eval(y_test, pred):
    confusion = confusion_matrix(y_test, pred)
    acc, pre, re, f1, auc = get_clf_eval(y_test, pred)

    print('confusion metrix')
    print(confusion)
    print('----------------')

    print('Accuracy : {0:.4f}, precision : {1:.4f}'. format(acc, pre))
    print('recall : {0:.4f}, f1_score : {1:.4f}, auc : {2:.4f}'. format(re, f1, auc))

๐Ÿ”ป๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€ ์‹คํ–‰ ํ•ด๋ณด๊ธฐ

→ 56,864 ๊ฐœ์˜ 0 (์ •์ƒ data) ์ค‘์—์„œ 8๊ฐœ๋ฅผ fraud๋กœ ์˜ค์ธ

→ 98๊ฐœ์˜ 1 (๋ถˆ๋Ÿ‰ data) ์ค‘์—์„œ 40๊ฐœ๋ฅผ ์ •์ƒ์œผ๋กœ ์˜ค์ธ

→ Accuracy๋Š” 99.92% ์ด์ง€๋งŒ, Recall ์ˆ˜์น˜๋Š” 59%๋กœ ๊ฒฐ๊ณผ๊ฐ€ ์ข‹์ง€ ์•Š๋‹ค (๋ถˆ๋Ÿ‰์ธ ๋ฐ์ดํ„ฐ ์ค‘, ์ง„์งœ ๋ถˆ๋Ÿ‰์„ ๊ฐ์ง€ํ•ด๋‚ด๋Š” ๋น„์œจ์ด 59%)

from sklearn.linear_model import LogisticRegression

lr_clf = LogisticRegression(random_state=13, solver='liblinear')
lr_clf.fit(X_train, y_train)
lr_pred = lr_clf.predict(X_test)

print_clf_eval(y_test, lr_pred)
>>>>
confusion metrix
[[56856     8]
 [   40    58]]
----------------
Accuracy : 0.9992, precision : 0.8788
recall : 0.5918, f1_score : 0.7073, auc : 0.7958

๐Ÿ”ป๊ฒฐ์ •๋‚˜๋ฌด ์‹คํ–‰ ํ•ด๋ณด๊ธฐ

→ 56,864 ๊ฐœ์˜ 0 (์ •์ƒ data) ์ค‘์—์„œ 8๊ฐœ๋ฅผ fraud๋กœ ์˜ค์ธ

→ 98๊ฐœ์˜ 1 (๋ถˆ๋Ÿ‰ data) ์ค‘์—์„œ 33๊ฐœ๋ฅผ ์ •์ƒ์œผ๋กœ ์˜ค์ธ (recall 66%)

from sklearn.tree import DecisionTreeClassifier

dt_clf = DecisionTreeClassifier(random_state=13, max_depth=4)
dt_clf.fit(X_train, y_train)
dt_pred = dt_clf.predict(X_test)

print_clf_eval(y_test, dt_pred)
>>>>
confusion metrix
[[56856     8]
 [   33    65]]
----------------
Accuracy : 0.9993, precision : 0.8904
recall : 0.6633, f1_score : 0.7602, auc : 0.8316

๐Ÿ”ป๋žœ๋คํฌ๋ ˆ์ŠคํŠธ๋กœ ์‹คํ–‰ ํ•ด๋ณด๊ธฐ

→ 56,864 ๊ฐœ์˜ 0 (์ •์ƒ data) ์ค‘์—์„œ 7๊ฐœ๋ฅผ fraud๋กœ ์˜ค์ธ

→ 98๊ฐœ์˜ 1 (๋ถˆ๋Ÿ‰ data) ์ค‘์—์„œ 25๊ฐœ๋ฅผ ์ •์ƒ์œผ๋กœ ์˜ค์ธ (recall 74%)

→ ์ ์  ์„ฑ๋Šฅ์ด ๊ดœ์ฐฎ์•„ ์ง€๋Š” ๊ฒƒ ๊ฐ™๋‹ค

from sklearn.ensemble import RandomForestClassifier

rf_clf = RandomForestClassifier(random_state=13, n_jobs=-1, n_estimators=100)
rf_clf.fit(X_train, y_train)
rf_pred = rf_clf.predict(X_test)

print_clf_eval(y_test, rf_pred)
>>>>
confusion metrix
[[56857     7]
 [   25    73]]
----------------
Accuracy : 0.9994, precision : 0.9125
recall : 0.7449, f1_score : 0.8202, auc : 0.8724

๐Ÿ”ปLGBM์œผ๋กœ ์‹คํ–‰ํ•ด๋ณด๊ธฐ

→ 56,864 ๊ฐœ์˜ 0 (์ •์ƒ data) ์ค‘์—์„œ 6๊ฐœ๋ฅผ fraud๋กœ ์˜ค์ธ

→ 98๊ฐœ์˜ 1 (๋ถˆ๋Ÿ‰ data) ์ค‘์—์„œ 24๊ฐœ๋ฅผ ์ •์ƒ์œผ๋กœ ์˜ค์ธ (recall 75%)

→ ๋žœ๋ค ํฌ๋ ˆ์ŠคํŠธ์™€ ํฐ ์ฐจ์ด๋Š” ์—†๋Š” ๋“ฏ..

from lightgbm import LGBMClassifier

lgbm_clf = LGBMClassifier(random_state=13, n_jobs=-1, n_estimators=1000, num_leaves=64, boost_from_average=False)
lgbm_clf.fit(X_train, y_train)
lgbm_pred = lgbm_clf.predict(X_test)

print_clf_eval(y_test, lgbm_pred)
>>>>
confusion metrix
[[56858     6]
 [   24    74]]
----------------
Accuracy : 0.9995, precision : 0.9250
recall : 0.7551, f1_score : 0.8315, auc : 0.8775

๐Ÿ”ป๊ฒฐ๊ณผ๋ฅผ DataFrame ์œผ๋กœ !

def get_result(model, X_train, y_train, X_test, y_test):
    model.fit(X_train, y_train)
    pred = model.predict(X_test)

    return get_clf_eval(y_test, pred)
import pandas as pd

def get_result_pd(models, model_names, X_train, y_train, X_test, y_test):
    col_names = ['accuracy', 'precision', 'recall', 'f1', 'roc_auc']
    tmp = []

    for model in models:
        tmp.append(get_result(model, X_train, y_train, X_test, y_test))
    
    return pd.DataFrame(tmp, columns=col_names, index=model_names)

๐Ÿ”ป๊ฒฐ๊ณผ๋ฅผ ๋ฐ์ดํ„ฐ ํ”„๋ ˆ์ž„์œผ๋กœ!

→ Accuracy๋Š” ๋งค์šฐ ๋†’๋‹ค

→ Recall ์„ฑ๋Šฅ์€ RandomForest, LGBM์˜ ์„ฑ๋Šฅ์ด ์ข‹์•„๋ณด์ธ๋‹ค.

models = [lr_clf, dt_clf, rf_clf, lgbm_clf]
model_names = ['Logistic Regression', 'DecisionTree', 'RandomForest', 'LightGBM']
result = get_result_pd(models, model_names, X_train, y_train, X_test, y_test)
result