๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

Study_note(zb_data)/Machine Learning

์Šคํ„ฐ๋””๋…ธํŠธ (PCA)

๐Ÿ“Œ PCA

๐Ÿ”ปPCA ๊ฐœ๋… ์•Œ๊ธฐ

→ ๊ฐ€์žฅ ๋„๋ฆฌ ์‚ฌ์šฉ๋˜๋Š” ์ž์› ์ถ•์†Œ ๊ธฐ๋ฒ• ์ค‘ ํ•˜๋‚˜, ์› ๋ฐ์ดํ„ฐ์˜ ๋ถ„ํฌ๋ฅผ ์ตœ๋Œ€ํ•œ ๋ณด์กดํ•˜๋ฉด์„œ ๊ณ ์ฐจ์› ๊ณต๊ฐ„์˜ ๋ฐ์ดํ„ฐ๋ฅผ ์ €์ฐจ์› ๊ณต๊ฐ„์œผ๋กœ ๋ณ€ํ™˜→ ๊ธฐ์กด์˜ ๋ณ€์ˆ˜๋ฅผ ์กฐํ•ฉ, ์ฃผ์„ฑ๋ถ„์„ ๋งŒ๋“ค์–ด ๋‚ธ๋‹ค.

์ถœ์ฒ˜ : http://matrix.skku.ac.kr/math4ai-intro/W12/

๐Ÿ”ป๋ฐ์ดํ„ฐ ์…‹์œผ๋กœ ์ง„ํ–‰ ํ•ด๋ณด๊ธฐ

import pandas as pd
from sklearn.datasets import load_iris

iris = load_iris()
iris_pd = pd.DataFrame(iris.data, columns=iris.feature_names)
iris_pd['species'] = iris.target
iris_pd.head(6)

sns.pairplot(iris_pd, hue='species', height=3,
            x_vars=['sepal length (cm)', 'petal width (cm)'],
            y_vars=['petal length (cm)', 'sepal width (cm)']);

๐Ÿ”ปStandardScaler ํ™œ์šฉ

from sklearn.preprocessing import StandardScaler

iris_ss = StandardScaler().fit_transform(iris.data)
iris_ss[:3]
>>>>
array([[-0.90068117,  1.01900435, -1.34022653, -1.3154443 ],
       [-1.14301691, -0.13197948, -1.34022653, -1.3154443 ],
       [-1.38535265,  0.32841405, -1.39706395, -1.3154443 ]])

๐Ÿ”ปPCA ์ ์šฉ

→ Parameter : n_components (๋ณ€์ˆ˜์˜ ๊ฐœ์ˆ˜)

pca.fit

from sklearn.decomposition import PCA

# pca ๊ฒฐ๊ณผ ๋ฐ˜ํ™˜ ํ•จ์ˆ˜ ์ง€์ •
# StandardScaler ์ ์šฉ๋œ ๋ฐ์ดํ„ฐ์…‹์™€ n_components ๊ฐ’ ๋Œ€์ž…
# pca ๋ณ€์ˆ˜ ์ง€์ •
# pca ํ•จ์ˆ˜์— ๋ฐ์ดํ„ฐ ์…‹์„ fit
# pca ๊ฒฐ๊ณผ return
def get_pca_data(ss_data, n_components=2):
    pca = PCA(n_components=n_components)
    pca.fit(ss_data)
    return pca.transform(ss_data), pca

iris_pca, pca = get_pca_data(iris_ss, 2)
iris_pca.shape
>>>>
(150, 2)

๐Ÿ”ปPCA ์ ์šฉ ๋œ ๋ฐ์ดํ„ฐ๋ฅผ DataFrame ํ™”

→ ํ™•์ธ ํ•ด๋ณด๋ฉด, 4๊ฐœ์˜€๋˜ ๊ธฐ์กด์˜ ๋ณ€์ˆ˜์—์„œ 2๊ฐœ์˜ ๋ณ€์ˆ˜๋กœ ์ถ•์†Œ ๋œ ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค

def get_pd_from_pca(pca_data, cols=['PC1', 'PC2']):
    return pd.DataFrame(pca_data, columns=cols)

iris_pd_pca = get_pd_from_pca(iris_pca)
iris_pd_pca['species'] = iris.target
iris_pd_pca.head()

๐Ÿ”ป๊ฒฐ๊ณผ ๊ฐ’ ํ™•์ธ 1

→ ์›๋ณธ ๋ฐ์ดํ„ฐ์— ๋น„ํ•ด์„œ ์•ฝ 94%์˜ ๋น„์ค‘์„ ์ฐจ์ง€ํ•œ๋‹ค→ ์ฐจ์›์„ ์ถ•์†Œํ•˜๋ฉด์„œ ์•ฝ ์›๋ณธ์— ๋น„ํ•ด 6%๊ฐ€ ์ค„์—ˆ๋‹ค๋Š” ๊ฒƒ

pca.explained_variance_ratio_
>>>>
array([0.72962445, 0.22850762])

๐Ÿ”ป๊ฒฐ๊ณผ ๊ฐ’ ํ™•์ธ 2

→ RandomForest ์ ์šฉํ•ด๋ณด๊ธฐ→ ์›๋ณธ ๋ฐ์ดํ„ฐ์™€ PCA๋ฅผ ์ ์šฉํ•œ ๋ฐ์ดํ„ฐ์˜ ์ฐจ์ด๊ฐ€ 0.06์  ์ •๋„ ์ฐจ์ด๋‚˜๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score

def rf_scores(X, y, cv=5):
    rf = RandomForestClassifier(random_state=13, n_estimators=100)
    score_rf = cross_val_score(rf, X, y, scoring='accuracy', cv=cv)
    print('Score : ', np.mean(score_rf))
    
rf_scores(iris_ss, iris.target)
>>>>
Score :  0.96
pca_X = iris_pd_pca[['PC1', 'PC2']]
pca_X
rf_scores(pca_X, iris.target)
>>>>
Score :  0.9066666666666666