스터디 노트 (ML2)

📌 Scikit Learn Tree

🔻 Tree model visualization

from sklearn.tree import plot_tree

plt.figure(figsize=(12, 8))
plot_tree(iris_clf);

🔻 mlxtend.plotting

- 제공한 데이터에 따라 경계선을 그려주는 함수

- feature가 적어서 사용함, feature가 많다면 사용하기 어려움

- 경계면이 과연 올바른 걸까 의심을 해봐야 한다. (복잡한 경계면은 모델의 성능을 결국 나쁘게 만든다.)

from mlxtend.plotting import plot_decision_regions

plt.figure(figsize=(14, 8))
plot_decision_regions(X=iris.data[:, 2:], y=iris.target, clf=iris_clf, legend=2)
plt.show()

📌 과적합 (Over fitting)

- 내가 가지고 있는 데이터에 너무 fit하여 데이터 이외의 일반적 데이터에서 제 성능이 잘 나오지 못하는 것

🔻 데이터의 분리 (과적합인지 아닌지 판정하기 위해)

- model_selection -> train_test_split 함수 사용

from sklearn.datasets import load_iris
import pandas as pd
iris = load_iris()

from sklearn.model_selection import train_test_split
features = iris.data[:, 2:]
labels = iris.target

🔻 훈련 데이터와 테스트 데이터를 나누기

- 테스트 데이터를 나눠보니 9, 8, 13개로 나눠져 있다- 이를 stratify 옵션으로 각각 동일하게 나눠서 데이터를 나눌 수 있다.

# test_size = 0.2 --> 20%를 테스트 용도로 사용
X_train, X_test, y_train, y_test = train_test_split(feature, labels, test_size=0.2, random_state=13)
X_train.shape, X_test.shape
>>>>
((120, 2), (30, 2))

# train 데이터에 각 꽃이 몇개 들어가 있을까 확인
import numpy as np
np.unique(y_test, return_counts=True)
>>>>
(array([0, 1, 2]), array([ 9,  8, 13], dtype=int64))

# stratify --> 각각의 데이터 비율을 맞춰주는 옵션
X_train, X_test, y_train, y_test = train_test_split(feature, labels, test_size=0.2, stratify=labels, random_state=13)

import numpy as np
np.unique(y_test, return_counts=True)
>>>>
(array([0, 1, 2]), array([10, 10, 10], dtype=int64))

🔻 훈련 데이터와 테스트 데이터를 나누기

- max_depth 값을 지정하여 Tree가 뻗어나가는 개수를 규제할 수 있다

--> 과적합 방지

# max_depth 값을 지정하여 tree가 뻗어나가는 개수를 정해준다. (성능 제한 및 규제 --> 과적합 방지)
from sklearn.tree import DecisionTreeClassifier
iris_tree = DecisionTreeClassifier(max_depth=2, random_state=13)
iris_tree.fit(X_train, y_train)

import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
plt.figure(figsize=(12, 8))
plot_tree(iris_tree);

🔻 정확도를 확인

- 이전 보다 낮아진 것을 확인할 수 있다.

from sklearn.metrics import accuracy_score
y_pred_tr = iris_tree.predict(iris.data[:, 2:])
accuracy_score(iris.target, y_pred_tr)
>>>>
0.9533333333333334

🔻 훈련된 데이터의 결정 경계를 확인해보자

from mlxtend.plotting import plot_decision_regions
plt.figure(figsize=(14, 8))
plot_decision_regions(X=X_train, y=y_train, clf=iris_tree, legend=2)
plt.show()

🔻 Test 데이터의 정확성을 알아보자

# Test 데이터 정확성 확인
y_pred_test = iris_tree.predict(X=X_test)
accuracy_score(y_test, y_pred_test)
>>>>
0.9666666666666667

#Test data와 기본 데이터 비교
scatter_highlight_kwargs = {'s':150, 'label':'Test data', 'alpha':0.9}
scatter_kwargs = {'s':120, 'edgecolor':None, 'alpha':0.9}

plt.figure(figsize=(12, 8))
plot_decision_regions(X=feature, y=labels, X_highlight=X_test, clf=iris_tree, legend=2, 
                      scatter_highlight_kwargs=scatter_highlight_kwargs,
                      scatter_kwargs=scatter_kwargs,
                      contourf_kwargs={'alpha':0.2})

📌 Feature data를 4개로 해보기

- 기존에는 petal width, petal length로 진행했다면, sepal length, sepal width를 추가해서 진행해보자

from sklearn.model_selection import train_test_split
features = iris.data
labels = iris.target

X_train, X_test, y_train, y_test = train_test_split(features, labels, test_size=0.2, stratify=labels, random_state=13)
iris_tree = DecisionTreeClassifier(max_depth=2, random_state=13)
iris_tree.fit(X_train, y_train)

import matplotlib.pyplot as plt
from sklearn.tree import plot_tree

plt.figure(figsize=(12, 8))
plot_tree(iris_tree);

📌 해당 모델에 데이터를 넣어보자

- feature 4개의 모델에 데이터를 넣어보니, 1 즉 'versicolor' 가 나오는 것을 확인할 수 있다.

## 모델을 만들었으니 데이터를 넣어보자
test_data = np.array([[4.3, 2.0, 1.2, 1.0]])
iris_tree.predict(test_data)
>>>>
array([1])

🔻 Data의 확률을 체크하고 이름 확인하기

# data의 확률 (내 모델에 데이터를 넣으면 어떤 값이 나올지의 확률!)
iris_tree.predict_proba(test_data)
>>>>
array([[0.        , 0.97222222, 0.02777778]])

# 값이 1이 나왔기 때문에, target_names에 변수를 넣으면 해당하는 꽃의 이름이 반환된다.
iris.target_names[iris_tree.predict(test_data)]
>>>>
array(['versicolor'], dtype='<U10')

# 모델을 결정하는 데 중요한 feature --> max_depth 값에 따라 바뀐다.
iris_tree = DecisionTreeClassifier(max_depth=2, random_state=13)
iris_tree.fit(X_train, y_train)

iris_tree.feature_importances_

# zipping
dict(zip(iris.feature_names, iris_tree.feature_importances_))
>>>>
{'sepal length (cm)': 0.0,
 'sepal width (cm)': 0.0,
 'petal length (cm)': 0.421897810218978,
 'petal width (cm)': 0.578102189781022}

📌 zipping / 언패킹에 대해서

🔻리스트를 Tuple로 zipping

list1 =['a', 'b', 'c']
list2 =[1, 2, 3]
pairs = [pair for pair in zip(list1, list2)]
pairs
>>>>
[('a', 1), ('b', 2), ('c', 3)]

dict(pairs)
>>>>
{'a': 1, 'b': 2, 'c': 3}

# 한번에 해결하기
dict(zip(list1, list2))
>>>>
{'a': 1, 'b': 2, 'c': 3}

🔻unpacking 인자를 통한 역변환하기

# unpacking 인자를 통한 역변환
x, y = zip(*pairs)
x, y
>>>>
(('a', 'b', 'c'), (1, 2, 3))

'Study_note(zb_data) > Machine Learning' 카테고리의 다른 글

스터디노트 (ML6_Wine) (0)	2023.09.22
스터디노트 (ML5) (0)	2023.09.21
스터디노트 (ML4) (0)	2023.09.21
스터디노트 (ML3) (0)	2023.09.21
스터디 노트 (ML) (0)	2023.09.19

Kloud

스터디 노트 (ML2)

📌 Scikit Learn Tree

🔻 Tree model visualization

🔻 mlxtend.plotting

📌 과적합 (Over fitting)

🔻 데이터의 분리 (과적합인지 아닌지 판정하기 위해)

🔻 훈련 데이터와 테스트 데이터를 나누기

🔻 훈련 데이터와 테스트 데이터를 나누기

🔻 정확도를 확인

🔻 훈련된 데이터의 결정 경계를 확인해보자

🔻 Test 데이터의 정확성을 알아보자

📌 Feature data를 4개로 해보기

📌 해당 모델에 데이터를 넣어보자

🔻 Data의 확률을 체크하고 이름 확인하기

📌 zipping / 언패킹에 대해서

🔻리스트를 Tuple로 zipping

🔻unpacking 인자를 통한 역변환하기

'Study_note(zb_data) > Machine Learning' 카테고리의 다른 글

티스토리툴바

스터디 노트 (ML2)

📌 Scikit Learn Tree

🔻 Tree model visualization

🔻 mlxtend.plotting

📌 과적합 (Over fitting)

🔻 데이터의 분리 (과적합인지 아닌지 판정하기 위해)

🔻 훈련 데이터와 테스트 데이터를 나누기

🔻 훈련 데이터와 테스트 데이터를 나누기

🔻 정확도를 확인

🔻 훈련된 데이터의 결정 경계를 확인해보자

🔻 Test 데이터의 정확성을 알아보자

📌 Feature data를 4개로 해보기

📌 해당 모델에 데이터를 넣어보자

🔻 Data의 확률을 체크하고 이름 확인하기

📌 zipping / 언패킹에 대해서

🔻리스트를 Tuple로 zipping

🔻unpacking 인자를 통한 역변환하기

'Study_note(zb_data) > Machine Learning' 카테고리의 다른 글

'Study_note(zb_data)/Machine Learning' Related Articles

티스토리툴바