๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

Study_note(zb_data)/Machine Learning

์Šคํ„ฐ๋”” ๋…ธํŠธ (ML2)

๐Ÿ“Œ Scikit Learn Tree

๐Ÿ”ป Tree model visualization 

from sklearn.tree import plot_tree

plt.figure(figsize=(12, 8))
plot_tree(iris_clf);

๐Ÿ”ป mlxtend.plotting

- ์ œ๊ณตํ•œ ๋ฐ์ดํ„ฐ์— ๋”ฐ๋ผ ๊ฒฝ๊ณ„์„ ์„ ๊ทธ๋ ค์ฃผ๋Š” ํ•จ์ˆ˜

- feature๊ฐ€ ์ ์–ด์„œ ์‚ฌ์šฉํ•จ, feature๊ฐ€ ๋งŽ๋‹ค๋ฉด ์‚ฌ์šฉํ•˜๊ธฐ ์–ด๋ ค์›€

- ๊ฒฝ๊ณ„๋ฉด์ด ๊ณผ์—ฐ ์˜ฌ๋ฐ”๋ฅธ ๊ฑธ๊นŒ ์˜์‹ฌ์„ ํ•ด๋ด์•ผ ํ•œ๋‹ค. (๋ณต์žกํ•œ ๊ฒฝ๊ณ„๋ฉด์€ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ๊ฒฐ๊ตญ ๋‚˜์˜๊ฒŒ ๋งŒ๋“ ๋‹ค.)

from mlxtend.plotting import plot_decision_regions

plt.figure(figsize=(14, 8))
plot_decision_regions(X=iris.data[:, 2:], y=iris.target, clf=iris_clf, legend=2)
plt.show()

๐Ÿ“Œ ๊ณผ์ ํ•ฉ (Over fitting)

- ๋‚ด๊ฐ€ ๊ฐ€์ง€๊ณ  ์žˆ๋Š” ๋ฐ์ดํ„ฐ์— ๋„ˆ๋ฌด fitํ•˜์—ฌ ๋ฐ์ดํ„ฐ ์ด์™ธ์˜ ์ผ๋ฐ˜์  ๋ฐ์ดํ„ฐ์—์„œ ์ œ ์„ฑ๋Šฅ์ด ์ž˜ ๋‚˜์˜ค์ง€ ๋ชปํ•˜๋Š” ๊ฒƒ

๐Ÿ”ป ๋ฐ์ดํ„ฐ์˜ ๋ถ„๋ฆฌ (๊ณผ์ ํ•ฉ์ธ์ง€ ์•„๋‹Œ์ง€ ํŒ์ •ํ•˜๊ธฐ ์œ„ํ•ด)

- model_selection -> train_test_split ํ•จ์ˆ˜ ์‚ฌ์šฉ 

from sklearn.datasets import load_iris
import pandas as pd
iris = load_iris()
from sklearn.model_selection import train_test_split
features = iris.data[:, 2:]
labels = iris.target

์ถœ์ฒ˜ : ์ œ๋กœ๋ฒ ์ด์Šค ๋ฐ์ดํ„ฐ ์Šค์ฟจ

๐Ÿ”ป ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์™€ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ๋‚˜๋ˆ„๊ธฐ

- ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ๋‚˜๋ˆ ๋ณด๋‹ˆ 9, 8, 13๊ฐœ๋กœ ๋‚˜๋ˆ ์ ธ ์žˆ๋‹ค- ์ด๋ฅผ stratify ์˜ต์…˜์œผ๋กœ ๊ฐ๊ฐ ๋™์ผํ•˜๊ฒŒ ๋‚˜๋ˆ ์„œ ๋ฐ์ดํ„ฐ๋ฅผ ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ๋‹ค.

# test_size = 0.2 --> 20%๋ฅผ ํ…Œ์ŠคํŠธ ์šฉ๋„๋กœ ์‚ฌ์šฉ
X_train, X_test, y_train, y_test = train_test_split(feature, labels, test_size=0.2, random_state=13)
X_train.shape, X_test.shape
>>>>
((120, 2), (30, 2))
# train ๋ฐ์ดํ„ฐ์— ๊ฐ ๊ฝƒ์ด ๋ช‡๊ฐœ ๋“ค์–ด๊ฐ€ ์žˆ์„๊นŒ ํ™•์ธ
import numpy as np
np.unique(y_test, return_counts=True)
>>>>
(array([0, 1, 2]), array([ 9,  8, 13], dtype=int64))
# stratify --> ๊ฐ๊ฐ์˜ ๋ฐ์ดํ„ฐ ๋น„์œจ์„ ๋งž์ถฐ์ฃผ๋Š” ์˜ต์…˜
X_train, X_test, y_train, y_test = train_test_split(feature, labels, test_size=0.2, stratify=labels, random_state=13)
import numpy as np
np.unique(y_test, return_counts=True)
>>>>
(array([0, 1, 2]), array([10, 10, 10], dtype=int64))

๐Ÿ”ป ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์™€ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ๋‚˜๋ˆ„๊ธฐ

- max_depth ๊ฐ’์„ ์ง€์ •ํ•˜์—ฌ Tree๊ฐ€ ๋ป—์–ด๋‚˜๊ฐ€๋Š” ๊ฐœ์ˆ˜๋ฅผ ๊ทœ์ œํ•  ์ˆ˜ ์žˆ๋‹ค

--> ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€

# max_depth ๊ฐ’์„ ์ง€์ •ํ•˜์—ฌ tree๊ฐ€ ๋ป—์–ด๋‚˜๊ฐ€๋Š” ๊ฐœ์ˆ˜๋ฅผ ์ •ํ•ด์ค€๋‹ค. (์„ฑ๋Šฅ ์ œํ•œ ๋ฐ ๊ทœ์ œ --> ๊ณผ์ ํ•ฉ ๋ฐฉ์ง€)
from sklearn.tree import DecisionTreeClassifier
iris_tree = DecisionTreeClassifier(max_depth=2, random_state=13)
iris_tree.fit(X_train, y_train)
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
plt.figure(figsize=(12, 8))
plot_tree(iris_tree);

๐Ÿ”ป ์ •ํ™•๋„๋ฅผ ํ™•์ธ

- ์ด์ „ ๋ณด๋‹ค ๋‚ฎ์•„์ง„ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

from sklearn.metrics import accuracy_score
y_pred_tr = iris_tree.predict(iris.data[:, 2:])
accuracy_score(iris.target, y_pred_tr)
>>>>
0.9533333333333334

๐Ÿ”ป ํ›ˆ๋ จ๋œ ๋ฐ์ดํ„ฐ์˜ ๊ฒฐ์ • ๊ฒฝ๊ณ„๋ฅผ ํ™•์ธํ•ด๋ณด์ž

from mlxtend.plotting import plot_decision_regions
plt.figure(figsize=(14, 8))
plot_decision_regions(X=X_train, y=y_train, clf=iris_tree, legend=2)
plt.show()

๐Ÿ”ป Test ๋ฐ์ดํ„ฐ์˜ ์ •ํ™•์„ฑ์„ ์•Œ์•„๋ณด์ž

# Test ๋ฐ์ดํ„ฐ ์ •ํ™•์„ฑ ํ™•์ธ
y_pred_test = iris_tree.predict(X=X_test)
accuracy_score(y_test, y_pred_test)
>>>>
0.9666666666666667
#Test data์™€ ๊ธฐ๋ณธ ๋ฐ์ดํ„ฐ ๋น„๊ต
scatter_highlight_kwargs = {'s':150, 'label':'Test data', 'alpha':0.9}
scatter_kwargs = {'s':120, 'edgecolor':None, 'alpha':0.9}

plt.figure(figsize=(12, 8))
plot_decision_regions(X=feature, y=labels, X_highlight=X_test, clf=iris_tree, legend=2, 
                      scatter_highlight_kwargs=scatter_highlight_kwargs,
                      scatter_kwargs=scatter_kwargs,
                      contourf_kwargs={'alpha':0.2})

๐Ÿ“Œ Feature data๋ฅผ 4๊ฐœ๋กœ ํ•ด๋ณด๊ธฐ 

- ๊ธฐ์กด์—๋Š” petal width, petal length๋กœ ์ง„ํ–‰ํ–ˆ๋‹ค๋ฉด, sepal length, sepal width๋ฅผ ์ถ”๊ฐ€ํ•ด์„œ ์ง„ํ–‰ํ•ด๋ณด์ž

from sklearn.model_selection import train_test_split
features = iris.data
labels = iris.target

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, stratify=labels, random_state=13)
iris_tree = DecisionTreeClassifier(max_depth=2, random_state=13)
iris_tree.fit(X_train, y_train)
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

plt.figure(figsize=(12, 8))
plot_tree(iris_tree);

๐Ÿ“Œ ํ•ด๋‹น ๋ชจ๋ธ์— ๋ฐ์ดํ„ฐ๋ฅผ ๋„ฃ์–ด๋ณด์ž

- feature 4๊ฐœ์˜ ๋ชจ๋ธ์— ๋ฐ์ดํ„ฐ๋ฅผ ๋„ฃ์–ด๋ณด๋‹ˆ, 1 ์ฆ‰ 'versicolor' ๊ฐ€ ๋‚˜์˜ค๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

## ๋ชจ๋ธ์„ ๋งŒ๋“ค์—ˆ์œผ๋‹ˆ ๋ฐ์ดํ„ฐ๋ฅผ ๋„ฃ์–ด๋ณด์ž
test_data = np.array([[4.3, 2.0, 1.2, 1.0]])
iris_tree.predict(test_data)
>>>>
array([1])

๐Ÿ”ป Data์˜ ํ™•๋ฅ ์„ ์ฒดํฌํ•˜๊ณ  ์ด๋ฆ„ ํ™•์ธํ•˜๊ธฐ

# data์˜ ํ™•๋ฅ  (๋‚ด ๋ชจ๋ธ์— ๋ฐ์ดํ„ฐ๋ฅผ ๋„ฃ์œผ๋ฉด ์–ด๋–ค ๊ฐ’์ด ๋‚˜์˜ฌ์ง€์˜ ํ™•๋ฅ !)
iris_tree.predict_proba(test_data)
>>>>
array([[0.        , 0.97222222, 0.02777778]])
# ๊ฐ’์ด 1์ด ๋‚˜์™”๊ธฐ ๋•Œ๋ฌธ์—, target_names์— ๋ณ€์ˆ˜๋ฅผ ๋„ฃ์œผ๋ฉด ํ•ด๋‹นํ•˜๋Š” ๊ฝƒ์˜ ์ด๋ฆ„์ด ๋ฐ˜ํ™˜๋œ๋‹ค.
iris.target_names[iris_tree.predict(test_data)]
>>>>
array(['versicolor'], dtype='<U10')
# ๋ชจ๋ธ์„ ๊ฒฐ์ •ํ•˜๋Š” ๋ฐ ์ค‘์š”ํ•œ feature --> max_depth ๊ฐ’์— ๋”ฐ๋ผ ๋ฐ”๋€๋‹ค.
iris_tree = DecisionTreeClassifier(max_depth=2, random_state=13)
iris_tree.fit(X_train, y_train)

iris_tree.feature_importances_
# zipping
dict(zip(iris.feature_names, iris_tree.feature_importances_))
>>>>
{'sepal length (cm)': 0.0,
 'sepal width (cm)': 0.0,
 'petal length (cm)': 0.421897810218978,
 'petal width (cm)': 0.578102189781022}

๐Ÿ“Œ zipping / ์–ธํŒจํ‚น์— ๋Œ€ํ•ด์„œ

๐Ÿ”ป๋ฆฌ์ŠคํŠธ๋ฅผ Tuple๋กœ zipping

list1 =['a', 'b', 'c']
list2 =[1, 2, 3]
pairs = [pair for pair in zip(list1, list2)]
pairs
>>>>
[('a', 1), ('b', 2), ('c', 3)]
dict(pairs)
>>>>
{'a': 1, 'b': 2, 'c': 3}
# ํ•œ๋ฒˆ์— ํ•ด๊ฒฐํ•˜๊ธฐ
dict(zip(list1, list2))
>>>>
{'a': 1, 'b': 2, 'c': 3}

๐Ÿ”ปunpacking ์ธ์ž๋ฅผ ํ†ตํ•œ ์—ญ๋ณ€ํ™˜ํ•˜๊ธฐ

# unpacking ์ธ์ž๋ฅผ ํ†ตํ•œ ์—ญ๋ณ€ํ™˜
x, y = zip(*pairs)
x, y
>>>>
(('a', 'b', 'c'), (1, 2, 3))