Study_note(zb_data)/Machine Learning

์Šคํ„ฐ๋””๋…ธํŠธ (TF-IDF ํ™œ์šฉ)

KloudHyun 2023. 10. 6. 13:31

๐Ÿ“Œ TF-IDF

๐Ÿ”ปVectorize ํ•œ ๋ฌธ์žฅ์„ Tfidf ๋ฒกํ„ฐ๋ผ์ด์ €์— ๋ณ€ํ™˜ํ•˜๊ธฐ

→ Term Frequency - Inverse Document Frequency

→ TF -- ํŠน์ •ํ•œ ๋‹จ์–ด๊ฐ€ ๋ฌธ์„œ ๋‚ด์— ์–ผ๋งˆ๋‚˜ ์ž์ฃผ ๋“ฑ์žฅํ•˜๋Š” ์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” ๊ฐ’

→ IDF -- ์›์ž๋ผ๋Š” ๋‚ฑ๋ง์€ ์ผ๋ฐ˜์  ๋ฌธ์„œ์—์„œ๋Š” ์ž˜ ๋‚˜ํƒ€๋‚˜์ง€ ์•Š์ง€๋งŒ, ์›์ž์— ๋Œ€ํ•œ ๋ฌธ์„œ๋ฅผ ๋ชจ์•„๋†“์€ ๋ฌธ์„œ๊ตฐ์˜ ๊ฒฝ์šฐ ํ•ด๋‹น ๋‹จ์–ด๋Š” ์ƒํˆฌ์–ด๊ฐ€ ๋œ๋‹ค---> ๊ฐ ๋ฌธ์„œ๋“ค์„ ์„ธ๋ถ„ํ™” ํ•˜์—ฌ ๊ตฌ๋ถ„ํ•  ์ˆ˜ ์žˆ๋Š” ๋‹ค๋ฅธ ๋‹จ์–ด๋“ค์ด ๋†’์€ ๊ฐ€์ค‘์น˜๋ฅผ ์–ป๊ฒŒ ๋œ๋‹ค.

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(min_df=1, decode_error='ignore')
# ๋„์–ด์“ฐ๊ธฐ ๊ธฐ์ค€ ํ•ฉ์นœ ๋ฌธ์žฅ์„ Tfidf ๋ฒกํ„ฐ๋ผ์ด์ €์— ๋ณ€ํ™˜
# [' ์ƒ์ฒ˜ ๋ฐ›์€ ์•„์ด ๋“ค ์€ ๋„ˆ๋ฌด ์ผ์ฐ ์ปค๋ฒ„ ๋ ค',
# ' ๋‚ด ๊ฐ€ ์ƒ์ฒ˜ ๋ฐ›์€ ๊ฑฐ ์•„๋Š” ์‚ฌ๋žŒ ๋ถˆ์•ˆํ•ด',
# ' ์ž˜ ์‚ฌ๋Š” ์‚ฌ๋žŒ ๋“ค ์€ ์ข‹์€ ์‚ฌ๋žŒ ๋˜๊ธฐ ์‰ฌ์›Œ',
# ' ์•„๋ฌด ์ผ๋„ ์•„๋‹ˆ์•ผ ๊ดœ์ฐฎ์•„']
X = vectorizer.fit_transform(contents_for_vectorize)
num_samples, num_features = X.shape
num_samples, num_features
>>>>
(4, 17)
# ์ด์ „์— ๋„์–ด์“ฐ๊ธฐ ๊ธฐ์ค€์œผ๋กœ ํ•ฉ์ณค๋˜ ํ…Œ์ŠคํŠธ ๋ฌธ์žฅ์„ ๋ฒกํ„ฐ๋ผ์ด์ฆˆํ™”
# [' ์ƒ์ฒ˜ ๋ฐ›๊ธฐ ์‹ซ์–ด ๊ดœ์ฐฎ์•„']
new_post_vec = vectorizer.transform(new_post_for_vectorize)

๐Ÿ”ปVectorize ํ•œ ๋ฌธ์žฅ์„ Tfidf ๋ฒกํ„ฐ๋ผ์ด์ €์— ๋ณ€ํ™˜ํ•˜๊ธฐ

# ๊ธฐ์ค€ ๋ฌธ์žฅ๊ณผ ํ…Œ์ŠคํŠธ ๋ฌธ์žฅ ์‚ฌ์ด์˜ ๊ฑฐ๋ฆฌ ๊ตฌํ•˜๊ธฐ
def dist_norm(v1, v2):
    v1_normalized = v1 / sp.linalg.norm(v1.toarray())
    v2_normalized = v2 / sp.linalg.norm(v2.toarray())

    delta = v1_normalized - v2_normalized
    return sp.linalg.norm(delta.toarray())

dist = [dist_norm(each, new_post_vec) for each in X]
dist
>>>>
[1.254451632446019, 1.2261339938790283, 1.4142135623730951, 1.1021396119773588]
print('Best post is ', dist.index(min(dist)), ', dist = ', min(dist))
print('Test post is --> ', new_post)
print('Best dist post is --> ', contents[dist.index(min(dist))])
>>>>
Best post is  3 , dist =  1.1021396119773588
Test post is -->  ['์ƒ์ฒ˜๋ฐ›๊ธฐ ์‹ซ์–ด ๊ดœ์ฐฎ์•„']
Best dist post is -->  ์•„๋ฌด ์ผ๋„ ์•„๋‹ˆ์•ผ ๊ดœ์ฐฎ์•„