Study_note(zb_data)/Machine Learning
์คํฐ๋๋ ธํธ (TF-IDF ํ์ฉ)
KloudHyun
2023. 10. 6. 13:31
๐ TF-IDF
๐ปVectorize ํ ๋ฌธ์ฅ์ Tfidf ๋ฒกํฐ๋ผ์ด์ ์ ๋ณํํ๊ธฐ
→ Term Frequency - Inverse Document Frequency
→ TF -- ํน์ ํ ๋จ์ด๊ฐ ๋ฌธ์ ๋ด์ ์ผ๋ง๋ ์์ฃผ ๋ฑ์ฅํ๋ ์ง๋ฅผ ๋ํ๋ด๋ ๊ฐ
→ IDF -- ์์๋ผ๋ ๋ฑ๋ง์ ์ผ๋ฐ์ ๋ฌธ์์์๋ ์ ๋ํ๋์ง ์์ง๋ง, ์์์ ๋ํ ๋ฌธ์๋ฅผ ๋ชจ์๋์ ๋ฌธ์๊ตฐ์ ๊ฒฝ์ฐ ํด๋น ๋จ์ด๋ ์ํฌ์ด๊ฐ ๋๋ค---> ๊ฐ ๋ฌธ์๋ค์ ์ธ๋ถํ ํ์ฌ ๊ตฌ๋ถํ ์ ์๋ ๋ค๋ฅธ ๋จ์ด๋ค์ด ๋์ ๊ฐ์ค์น๋ฅผ ์ป๊ฒ ๋๋ค.
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(min_df=1, decode_error='ignore')
# ๋์ด์ฐ๊ธฐ ๊ธฐ์ค ํฉ์น ๋ฌธ์ฅ์ Tfidf ๋ฒกํฐ๋ผ์ด์ ์ ๋ณํ
# [' ์์ฒ ๋ฐ์ ์์ด ๋ค ์ ๋๋ฌด ์ผ์ฐ ์ปค๋ฒ ๋ ค',
# ' ๋ด ๊ฐ ์์ฒ ๋ฐ์ ๊ฑฐ ์๋ ์ฌ๋ ๋ถ์ํด',
# ' ์ ์ฌ๋ ์ฌ๋ ๋ค ์ ์ข์ ์ฌ๋ ๋๊ธฐ ์ฌ์',
# ' ์๋ฌด ์ผ๋ ์๋์ผ ๊ด์ฐฎ์']
X = vectorizer.fit_transform(contents_for_vectorize)
num_samples, num_features = X.shape
num_samples, num_features
>>>>
(4, 17)
# ์ด์ ์ ๋์ด์ฐ๊ธฐ ๊ธฐ์ค์ผ๋ก ํฉ์ณค๋ ํ
์คํธ ๋ฌธ์ฅ์ ๋ฒกํฐ๋ผ์ด์ฆํ
# [' ์์ฒ ๋ฐ๊ธฐ ์ซ์ด ๊ด์ฐฎ์']
new_post_vec = vectorizer.transform(new_post_for_vectorize)
๐ปVectorize ํ ๋ฌธ์ฅ์ Tfidf ๋ฒกํฐ๋ผ์ด์ ์ ๋ณํํ๊ธฐ
# ๊ธฐ์ค ๋ฌธ์ฅ๊ณผ ํ
์คํธ ๋ฌธ์ฅ ์ฌ์ด์ ๊ฑฐ๋ฆฌ ๊ตฌํ๊ธฐ
def dist_norm(v1, v2):
v1_normalized = v1 / sp.linalg.norm(v1.toarray())
v2_normalized = v2 / sp.linalg.norm(v2.toarray())
delta = v1_normalized - v2_normalized
return sp.linalg.norm(delta.toarray())
dist = [dist_norm(each, new_post_vec) for each in X]
dist
>>>>
[1.254451632446019, 1.2261339938790283, 1.4142135623730951, 1.1021396119773588]
print('Best post is ', dist.index(min(dist)), ', dist = ', min(dist))
print('Test post is --> ', new_post)
print('Best dist post is --> ', contents[dist.index(min(dist))])
>>>>
Best post is 3 , dist = 1.1021396119773588
Test post is --> ['์์ฒ๋ฐ๊ธฐ ์ซ์ด ๊ด์ฐฎ์']
Best dist post is --> ์๋ฌด ์ผ๋ ์๋์ผ ๊ด์ฐฎ์