๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

Study_note(zb_data)/Machine Learning

์Šคํ„ฐ๋”” ๋…ธํŠธ (ML)

๐Ÿ“Œ Machine Learning

- ๋ช…์‹œ์ ์œผ๋กœ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ํ•˜์ง€ ์•Š์•„๋„ ์ปดํ“จํ„ฐ์— ํ•™์Šตํ•  ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ์„ ๋ถ€์—ฌํ•˜๋Š” ํ•™๋ฌธ

- ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ด ๊ทœ์น™์„ ์ฐพ๋Š” ๊ฒƒ์ด๋‹ค.

๐Ÿ“Œ iris ๋ฐ์ดํ„ฐ ์…‹ import ํ•˜๊ธฐ

- sklearn.datasets์—์„œ iris ๋ฐ์ดํ„ฐ๋ฅผ import

- ๋ฐ์ดํ„ฐ ์…‹์„ ํ™œ์šฉํ•˜์—ฌ setosa, versicolor, virginica๋ฅผ ๊ตฌ๋ถ„ํ•ด๋ณด์ž

from sklearn.datasets import load_iris
iris = load_iris()
iris.keys()
>>>>
dict_keys(['data', 'target', 'frame', 'target_names', 'DESCR', 'feature_names', 'filename', 'data_module'])
print(iris['target_names'])
>>>>
['setosa' 'versicolor' 'virginica']
import pandas as pd
iris_pd = pd.DataFrame(iris.data, columns=iris['feature_names'])
iris_pd

iris ๋ฐ์ดํ„ฐ 150๊ฐœ

๐Ÿ“Œ iris ๋ฐ์ดํ„ฐ๋ฅผ ๊ทธ๋ž˜ํ”„๋กœ ๊ทธ๋ ค๋ณด์ž

- ๋ฐ์ดํ„ฐ๋ฅผ ํ™•์ธ, petal length์™€ petal width๋ฅผ ๊ฐ€์ง€๊ณ  ํ’ˆ์ข…์„ ๊ตฌ๋ถ„ํ•  ์ˆ˜ ์žˆ์„ ๊ฒƒ ๊ฐ™๋‹ค๋Š” ๊ฐ€๋Šฅ์„ฑ ํ™•์ธ

import seaborn as sns
import matplotlib.pyplot as plt

plt.figure(figsize=(12, 6))
sns.boxplot(x='sepal length (cm)', y='species', data=iris_pd, orient='h');

 

plt.figure(figsize=(12, 6))
sns.boxplot(x='petal length (cm)', y='species', data=iris_pd, orient='h');

plt.figure(figsize=(12, 6))
sns.boxplot(x='petal width (cm)', y='species', data=iris_pd, orient='h');

sns.pairplot(data=iris_pd,
             vars=['petal length (cm)', 'petal width (cm)'],
             hue = 'species', height=4);

plt.figure(figsize=(12, 10))
sns.scatterplot(data=iris_pd,
             x='petal length (cm)', y='petal width (cm)', hue='species');

๐Ÿ“Œ Decision Tree์˜ ๋ถ„ํ•  ๊ธฐ์ค€

๐Ÿ”ป entropy์˜ ๊ฐœ๋…

- ์–ผ๋งˆ๋‚˜ ์ •๋ณด๊ฐ€ ๋ฌด์งˆ์„œํ•˜๊ณ  ๋ถˆํ™•์‹ค ํ•œ๊ฐ€? (๋ฌด์งˆ์„œ์˜ ์ •๋„๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค.)

- ๋ถ„ํ• ํ•˜๋ฉด ์—”ํŠธ๋กœํ”ผ๊ฐ€ ๋‚ฎ์•„์ง€๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

์ถœ์ฒ˜ : ์ œ๋กœ๋ฒ ์ด์Šค ๋ฐ์ดํ„ฐ ์Šค์ฟจ

๐Ÿ”ป Gini ๊ณ„์ˆ˜ 

- Gini index ํ˜น์€ ๋ถˆ์ˆœ๋„์œจ

- ์—”ํŠธ๋กœํ”ผ์˜ ๊ณ„์‚ฐ๋Ÿ‰์ด ๋งŽ์•„์„œ ๋ณด๋‹ค ๊ณ„์‚ฐ๋Ÿ‰์ด ์ ์€ ์ง€๋‹ˆ๊ณ„์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜๋Š” ๊ฒฝ์šฐ๊ฐ€ ๋งŽ๋‹ค.

- ๋ถ„ํ• ํ•˜๋ฉด ์ง€๋‹ˆ๊ณ„์ˆ˜๊ฐ€ ๋‚ฎ์•„์ง€๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

์ถœ์ฒ˜ : ์ œ๋กœ๋ฒ ์ด์Šค ๋ฐ์ดํ„ฐ ์Šค์ฟจ

--> ์—”ํŠธ๋กœํ”ผ๋‚˜ ์ง€๋‹ˆ ๊ณ„์ˆ˜๋Š” ๋‚ฎ์„ ์ˆ˜๋ก ์ข‹๋‹ค


๐Ÿ“Œ Scikit Learn

- iris data์˜ petal width, length ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜์—ฌ 'setosa', 'versicolor', 'virginica' ๋ฅผ ๊ตฌ๋ถ„ํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•™์Šต ์‹œ์ผœ๋ณด์ž

# fit์„ ํ™œ์šฉ, ๋ฐ์ดํ„ฐ๋ฅผ ํ•™์Šต
from sklearn.tree import DecisionTreeClassifier
iris_clf = DecisionTreeClassifier()
iris_clf.fit(iris.data[:, 2:], iris.target)

๐Ÿ”ป ํ•™์Šต ํ•œ ๋ฐ์ดํ„ฐ์™€ ๊ทธ ์•„๋ž˜์˜ ์ •๋‹ต ๋ฐ์ดํ„ฐ๋ฅผ ๋น„๊ตํ•ด๋ณด์ž.

# ํ•™์Šตํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์˜ˆ์ธกํ•˜๊ธฐ
from sklearn.metrics import accuracy_score
y_pred_tr = iris_clf.predict(iris.data[:, 2:])
y_pred_tr
>>>>
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 2, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])
iris.target
>>>>
array([0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
       0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
       1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
       2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2])

๐Ÿ”ป ์ •ํ™•๋„๋ฅผ ์•Œ์•„๋ณด์ž

## ์–ผ๋งˆ๋‚˜ ์ •ํ™•ํ•œ์ง€ ์•Œ์•„๋ณด์ž
accuracy_score(iris.target, y_pred_tr)
>>>>
0.9933333333333333