Study_note(zb_data)/Machine Learning

์Šคํ„ฐ๋””๋…ธํŠธ (ํ†ต๊ณ„์  ํšŒ๊ท€)

KloudHyun 2023. 9. 28. 21:37

๐Ÿ“Œ ํ†ต๊ณ„์  ํšŒ๊ท€

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

data = pd.read_csv('../data/ecommerce.csv', sep=',')

data.drop(['Email', 'Address', 'Avatar'], axis=1, inplace=True)
data.head()

๐Ÿ”ปBoxplot์„ ๊ทธ๋ ค๋ณด์ž

plt.figure(figsize=(12, 8))
sns.boxplot(data=data.iloc[:, :-1]);

plt.figure(figsize=(12, 8))
sns.boxplot(data=data['Yearly Amount Spent']);

๐Ÿ”ปPairplot์œผ๋กœ ๊ฒฝํ–ฅ์„ฑ์„ ํ™•์ธํ•ด๋ณด์ž

- ๋ฉค๋ฒ„์‰ฝ๊ณผ ์—ฐ๊ฐ„ ์ง€๋ถˆ ๊ธˆ์•ก์˜ ์–‘์ด ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ์žˆ๋Š” ๊ฒƒ์„ ํ™•์ธํ•  ์ˆ˜ ์žˆ๋‹ค.

plt.figure(figsize=(12, 6))
sns.pairplot(data=data);

๐Ÿ”ป๋” ์ž์„ธํ•˜๊ฒŒ ํ™•์ธํ•ด๋ณด๊ธฐ

- ์ƒ๊ด€๊ด€๊ณ„๊ฐ€ ์žˆ๋Š” ๊ฒƒ์œผ๋กœ ํŒŒ์•…์ด ๋œ๋‹ค.

plt.figure(figsize=(12, 6))
sns.lmplot(x='Length of Membership', y= 'Yearly Amount Spent', data=data);

๐Ÿ“Œ ์ƒ๊ด€์„ฑ์ด ๋†’์€ ๊ฒƒ์„ ํ™œ์šฉํ•˜์—ฌ ํšŒ๊ท€๋ฅผ ํ•ด๋ณด์ž

import statsmodels.api as sm

X=data['Length of Membership']
y=data['Yearly Amount Spent']
lm = sm.OLS(y, X).fit()
lm.summary()

์ถœ์ฒ˜ : ์ œ๋กœ๋ฒ ์ด์Šค ๋ฐ์ดํ„ฐ ์Šค์ฟจ

๐Ÿ”ป๋” ์ž์„ธํ•˜๊ฒŒ ํ™•์ธํ•ด๋ณด๊ธฐ

pred = lm.predict(X)

sns.scatterplot(x=X, y=y)
plt.plot(X, pred, 'r', ls='dashed', lw=3)

sns.scatterplot(x=y, y=pred)
plt.plot([min(y), max(y)], [min(y), max(y)], 'r', ls='dashed', lw=3)
plt.plot([0, max(y)], [0, max(y)], 'b', ls='dashed', lw=3)

๐Ÿ”ป์ƒ์ˆ˜ํ•ญ ๋„ฃ์–ด์ฃผ๊ธฐ

X = np.c_[X, [1]*len(X)]
X[:5]
>>>>
array([[4.08262063, 1.        ],
       [2.66403418, 1.        ],
       [4.1045432 , 1.        ],
       [3.12017878, 1.        ],
       [4.44630832, 1.        ]])
lm = sm.OLS(y, X).fit()
lm.summary()

๐Ÿ”ปํ™•์ธํ•ด๋ณด์ž

- R-squared

- AIC (๋งŒ๋“ค์–ด๋‚ธ ๋ชจ๋ธ์ด ๋ฐ์ดํ„ฐ๋ฅผ ์–ผ๋งˆ๋‚˜ ์ž˜ ๋ฐ˜์˜ํ•˜๋Š”์ง€ ์ธก์ •ํ•˜๋Š” ๋„๊ตฌ)

๐Ÿ“Œ sklearn ํ•ด๋ณด๊ธฐ

from sklearn.model_selection import train_test_split

X = data.drop('Yearly Amount Spent', axis=1)
y = data['Yearly Amount Spent']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=13)
lm = sm.OLS(y_train, X_train).fit()
lm.summary()

pred = lm.predict(X_test)

sns.scatterplot(x=y_test, y=pred)
plt.plot([min(y_test), max(y_test)], [min(y_test), max(y_test)], 'r');