๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

Study_note(zb_data)/Machine Learning

์Šคํ„ฐ๋””๋…ธํŠธ (๋‚˜์ด๋ธŒ ๋ฒ ์ด์ฆˆ ๋ถ„๋ฅ˜)

๐Ÿ“Œ Naive Bayes Classifier ๊ฐ์„ฑ ๋ถ„์„ (eng)

๐Ÿ”ปtokenize๋ฅผ ํ™œ์šฉํ•œ๋‹ค.

→ ์ง€๋„ํ•™์Šต์ด๊ธฐ ๋•Œ๋ฌธ์— train ๋ฐ์ดํ„ฐ ์ฒ˜๋Ÿผ ์ •๋‹ต์„ ์•Œ๋ ค์ฃผ์–ด์•ผ ํ•œ๋‹ค.

→ ์ „์ฒด ๋ง๋ญ‰์น˜๋ฅผ ๋งŒ๋“ ๋‹ค

from nltk.tokenize import word_tokenize
import nltk
# 1. train ๋ฐ์ดํ„ฐ์—์„œ ๊ฐ sentence (๋ฌธ์žฅ)๋ฅผ ๋ฐ›์•„์˜จ๋‹ค.
# 2. ๊ฐ sentence์˜ ๋ฌธ์žฅ์„ tokenize ํ™” ํ•œ๋‹ค. (๋ถ„ํ• )
# 3. set ๋ช…๋ น์ด ์žˆ์–ด์„œ ์ค‘๋ณต์ด ์ œ๊ฑฐ ๋œ๋‹ค.

train = [
    ('i like you', 'pos'),
    ('i hate you', 'neg'),
    ('you like me', 'neg'),
    ('i like her', 'pos'),
]

all_words = set(
    word.lower() for sentence in train for word in word_tokenize(sentence[0])
)
all_words
>>>>
{'hate', 'her', 'i', 'like', 'me', 'you'}

๐Ÿ”ป๋‹จ์–ด์˜ ์œ ๋ฌด ํŒŒ์•… (๋ง๋ญ‰์น˜ ๋Œ€๋น„)

# 1. train ์—์„œ ๋ฐ์ดํ„ฐ ํ•œ์Œ์”ฉ (๋ฌธ์žฅ, ๊ฐ์ •) ๊ฐ€์ ธ์˜จ๋‹ค
# ->  ex)  ('i like you', 'pos')
# 2. ๋ฌธ์žฅ๋งŒ ๊ฐ€์ง€๊ณ  ์™€์„œ ๋„์–ด์“ฐ๊ธฐ๋กœ ๋ถ„๋ฆฌ
# ->  ex)  ('i like you')
# 3. ๋ถ„๋ฆฌํ•œ ๋‹จ์–ด๋“ค์ด all_words์— ์žˆ๋Š”์ง€ ํŒŒ์•…

t = [({word : (word in word_tokenize(x[0])) for word in all_words}, x[1]) for x in train]
t
>>>>
[({'me': False,
   'like': True,
   'her': False,
   'you': True,
   'hate': False,
   'i': True},
  'pos'),
...

๐Ÿ”ปtrain!

# NaiveBayesClassifier ํ™œ์šฉ
# ํŠน์„ฑ ํŒŒ์•…

classifier = nltk.NaiveBayesClassifier.train(t)
classifier.show_most_informative_features()
>>>>
Most Informative Features
                    hate = False             pos : neg    =      1.7 : 1.0
                     her = False             neg : pos    =      1.7 : 1.0
                       i = True              pos : neg    =      1.7 : 1.0
                    like = True              pos : neg    =      1.7 : 1.0
                      me = False             pos : neg    =      1.7 : 1.0
                     you = True              neg : pos    =      1.7 : 1.0

๐Ÿ”ป์ด์ œ ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ๋ฅผ ๋„ฃ์–ด๋ณด์ž

# for๋ฌธ์„ ํ†ตํ•ด all_words์—์„œ word๋ฅผ ํ•˜๋‚˜์”ฉ ๋นผ์˜จ๋‹ค
# ๊ฐ word๊ฐ€ test_sentence๋ฅผ tokenizeํ•œ ๊ฒƒ์— ์žˆ๋Š” ์ง€ ์—†๋Š” ์ง€ ํŒŒ์•…

test_sentence = 'i like MeRui'
test_sent_features = {
    word.lower() : (word in word_tokenize(test_sentence.lower())) for word in all_words
}
test_sent_features
>>>>
{'me': False, 'like': True, 'her': False,  'you': False,  'hate': False,  'i': True}
# ํŒŒ์•…ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ๋ถ„๋ฅ˜๊ธฐ์— ๋„ฃ๋Š”๋‹ค

classifier.classify(test_sent_features)
>>>>
'pos'

๐Ÿ“Œ Naive Bayes Classifier ๊ฐ์„ฑ ๋ถ„์„ (kor)

๐Ÿ”ปtokenize๋ฅผ ํ™œ์šฉํ•œ๋‹ค.

→ ํ•œ๊ธ€์ด๊ธฐ ๋•Œ๋ฌธ์— konlpy ํ™œ์šฉ

from konlpy.tag import Okt
pos_tagger = Okt()

train = [
    ("๋ฉ”๋ฆฌ๊ฐ€ ์ข‹์•„", "pos"),
    ("๊ณ ์–‘์ด๋„ ์ข‹์•„", "pos"),
    ("๋‚œ ์ˆ˜์—…์ด ์ง€๋ฃจํ•ด", "neg"),
    ("๋ฉ”๋ฆฌ๋Š” ์ด์œ ๊ณ ์–‘์ด์•ผ", "pos"),
    ("๋‚œ ๋งˆ์น˜๊ณ  ๋ฉ”๋ฆฌ๋ž‘ ๋†€๊ฑฐ์•ผ", "pos"),
]
all_words = set(
    word for sentence in train for word in word_tokenize(sentence[0])
)
all_words
>>>>
{'๊ณ ์–‘์ด๋„',  '๊ณ ์–‘์ด์•ผ',  '๋‚œ',  '๋†€๊ฑฐ์•ผ',  '๋งˆ์น˜๊ณ ',  '๋ฉ”๋ฆฌ๊ฐ€',  '๋ฉ”๋ฆฌ๋Š”',  '๋ฉ”๋ฆฌ๋ž‘',  '์ˆ˜์—…์ด',  '์ด์œ',  '์ข‹์•„',  '์ง€๋ฃจํ•ด'}

๐Ÿ”ป๋‹จ์–ด์˜ ์œ ๋ฌด ํŒŒ์•…

t = [({word : (word in word_tokenize(x[0])) for word in all_words}, x[1]) for x in train]
t
>>>
[({'๋ฉ”๋ฆฌ๊ฐ€': True,
   '๋งˆ์น˜๊ณ ': False,
   '์ง€๋ฃจํ•ด': False,
   '๋†€๊ฑฐ์•ผ': False,
   '๋ฉ”๋ฆฌ๋Š”': False,
   '๋‚œ': False,
   '์ด์œ': False,
   '๋ฉ”๋ฆฌ๋ž‘': False,
   '์ˆ˜์—…์ด': False,
   '๊ณ ์–‘์ด๋„': False,
   '์ข‹์•„': True,
   '๊ณ ์–‘์ด์•ผ': False},
  'pos'),
  ....

๐Ÿ”ปํŠน์„ฑ ํŒŒ์•…

classifier = nltk.NaiveBayesClassifier.train(t)
classifier.show_most_informative_features()
>>>>
Most Informative Features
                       ๋‚œ = True              neg : pos    =      2.5 : 1.0
                      ์ข‹์•„ = False             neg : pos    =      1.5 : 1.0
                    ๊ณ ์–‘์ด๋„ = False             neg : pos    =      1.1 : 1.0
                    ๊ณ ์–‘์ด์•ผ = False             neg : pos    =      1.1 : 1.0
                     ๋†€๊ฑฐ์•ผ = False             neg : pos    =      1.1 : 1.0
                     ๋งˆ์น˜๊ณ  = False             neg : pos    =      1.1 : 1.0
                     ๋ฉ”๋ฆฌ๊ฐ€ = False             neg : pos    =      1.1 : 1.0
                     ๋ฉ”๋ฆฌ๋Š” = False             neg : pos    =      1.1 : 1.0
                     ๋ฉ”๋ฆฌ๋ž‘ = False             neg : pos    =      1.1 : 1.0
                      ์ด์œ = False             neg : pos    =      1.1 : 1.0

๐Ÿ”ปํ…Œ์ŠคํŠธ ๋ฌธ์žฅ ๋„ฃ์–ด์„œ ํ™•์ธ

→ ํ…Œ์ŠคํŠธ ๋ฌธ์žฅ์„ ํ™•์ธ ํ•ด๋ณด๋‹ˆ, ํ˜•ํƒœ์†Œ ๋ถ„์„์ด ํ•„์ˆ˜์ ์œผ๋กœ ์ด๋ฃจ์–ด์ ธ์•ผ ๋” ์ •ํ™•ํ•œ ๊ฒฐ๊ณผ๊ฐ€ ๋‚˜์˜ค๋Š” ๊ฒƒ์ด ํ™•์ธ๋œ๋‹ค.

test_sentence = '๋‚œ ์ˆ˜์—…์ด ๋งˆ์น˜๋ฉด ๋ฉ”๋ฆฌ๋ž‘ ๋†€๊ฑฐ์•ผ'
test_sent_features = {
    word.lower() : (word in word_tokenize(test_sentence.lower())) for word in all_words
}
test_sent_features
>>>>
{'๋ฉ”๋ฆฌ๊ฐ€': False,  '๋งˆ์น˜๊ณ ': False,  '์ง€๋ฃจํ•ด': False,  '๋†€๊ฑฐ์•ผ': True,  '๋ฉ”๋ฆฌ๋Š”': False,  '๋‚œ': True,  '์ด์œ': False,  '๋ฉ”๋ฆฌ๋ž‘': True,
 '์ˆ˜์—…์ด': True,  '๊ณ ์–‘์ด๋„': False,  '์ข‹์•„': False,  '๊ณ ์–‘์ด์•ผ': False}
classifier.classify(test_sent_features)
>>>>
'neg'

๐Ÿ”ปํ˜•ํƒœ์†Œ ๋ถ„์„์„ ํ•œ ํ›„ ๋‹ค์‹œ ์‹œ๋„

→ ํ˜•ํƒœ์†Œ ๋ถ„์„ ํ›„ ํ’ˆ์‚ฌ๋ฅผ ๋‹จ์–ด ๋’ค์— ๋ถ™์—ฌ ๋„ฃ๋„๋ก ํ•ด๋ณด์ž

def tokenize(doc):
    return ["/".join(t) for t in pos_tagger.pos(doc, norm=True, stem=True)]
train_docs = [(tokenize(row[0]), row[1]) for row in train]
train_docs
>>>>
[(['๋ฉ”๋ฆฌ/Noun', '๊ฐ€/Josa', '์ข‹๋‹ค/Adjective'], 'pos'),
 (['๊ณ ์–‘์ด/Noun', '๋„/Josa', '์ข‹๋‹ค/Adjective'], 'pos'),
 (['๋‚œ/Noun', '์ˆ˜์—…/Noun', '์ด/Josa', '์ง€๋ฃจํ•˜๋‹ค/Adjective'], 'neg'),
 (['๋ฉ”๋ฆฌ/Noun', '๋Š”/Josa', '์ด์˜๋‹ค/Adjective', '๊ณ ์–‘์ด/Noun', '์•ผ/Josa'], 'pos'),
 (['๋‚œ/Noun', '๋งˆ์น˜/Noun', '๊ณ /Josa', '๋ฉ”๋ฆฌ/Noun', '๋ž‘/Josa', '๋†€๋‹ค/Verb'], 'pos')]
tokens = [t for d in train_docs for t in d[0]]
tokens
>>>>
['๋ฉ”๋ฆฌ/Noun',  '๊ฐ€/Josa',  '์ข‹๋‹ค/Adjective',  '๊ณ ์–‘์ด/Noun',  '๋„/Josa',  '์ข‹๋‹ค/Adjective',  '๋‚œ/Noun',  '์ˆ˜์—…/Noun', '์ด/Josa',
 '์ง€๋ฃจํ•˜๋‹ค/Adjective',  '๋ฉ”๋ฆฌ/Noun',  '๋Š”/Josa',  '์ด์˜๋‹ค/Adjective',  '๊ณ ์–‘์ด/Noun',  '์•ผ/Josa',  '๋‚œ/Noun',  '๋งˆ์น˜/Noun',  '๊ณ /Josa',
 '๋ฉ”๋ฆฌ/Noun',  '๋ž‘/Josa',  '๋†€๋‹ค/Verb']

๐Ÿ”ป๋‹จ์–ด์˜ ์œ ๋ฌด ํŒŒ์•…

# tokens์— ์žˆ๋Š” word๊ฐ€ set(doc)์— ์žˆ๋Š”์ง€ ์œ ๋ฌด ํŒŒ์•…

def term_exists(doc):
    return {word : (word in set(doc)) for word in tokens}

# 1. train_docs์— ์žˆ๋Š” d(๋ฌธ์žฅ),c(๊ฐ์ •)๋ฅผ ์ถ”์ถœํ•œ๋‹ค
# 2. term_exists์— ๋ฌธ์žฅ์„ ๋„ฃ์–ด ๊ฐ ๋‹จ์–ด๊ฐ€ ์žˆ๋Š”์ง€ ํ™•์ธ ๋ฐ ๋ฐ˜ํ™˜ 

train_xy = [(term_exists(d), c) for d,c in train_docs]
train_xy
>>>>
[({'๋ฉ”๋ฆฌ/Noun': True,
   '๊ฐ€/Josa': True,
   '์ข‹๋‹ค/Adjective': True,
   '๊ณ ์–‘์ด/Noun': False,
   '๋„/Josa': False,
   '๋‚œ/Noun': False,
   '์ˆ˜์—…/Noun': False,
   '์ด/Josa': False,
   '์ง€๋ฃจํ•˜๋‹ค/Adjective': False,
   '๋Š”/Josa': False,
   '์ด์˜๋‹ค/Adjective': False,
   '์•ผ/Josa': False,
   '๋งˆ์น˜/Noun': False,
   '๊ณ /Josa': False,
   '๋ž‘/Josa': False,
   '๋†€๋‹ค/Verb': False},
  'pos'),
  ....

๐Ÿ”ป์ฃผ์š” ํŠน์„ฑ ํŒŒ์•…

classifier = nltk.NaiveBayesClassifier.train(train_xy)
classifier.show_most_informative_features()
>>>>
Most Informative Features
                  ๋‚œ/Noun = True              neg : pos    =      2.5 : 1.0
                 ๋ฉ”๋ฆฌ/Noun = False             neg : pos    =      2.5 : 1.0
                ๊ณ ์–‘์ด/Noun = False             neg : pos    =      1.5 : 1.0
            ์ข‹๋‹ค/Adjective = False             neg : pos    =      1.5 : 1.0
                  ๊ฐ€/Josa = False             neg : pos    =      1.1 : 1.0
                  ๊ณ /Josa = False             neg : pos    =      1.1 : 1.0
                 ๋†€๋‹ค/Verb = False             neg : pos    =      1.1 : 1.0
                  ๋Š”/Josa = False             neg : pos    =      1.1 : 1.0
                  ๋„/Josa = False             neg : pos    =      1.1 : 1.0
                  ๋ž‘/Josa = False             neg : pos    =      1.1 : 1.0

๐Ÿ”ปํ…Œ์ŠคํŠธ ๋ฌธ์žฅ ํ™•์ธํ•˜๊ธฐ

test_sentence = [('๋‚œ ์ˆ˜์—…์ด ๋งˆ์น˜๋ฉด ๋ฉ”๋ฆฌ๋ž‘ ๋†€๊ฑฐ์•ผ')]

test_docs = pos_tagger.pos(test_sentence[0])
test_docs
test_sent_features = {word : (word in tokens) for word in test_docs}
test_sent_features
>>>>
{('๋‚œ', 'Noun'): False,
 ('์ˆ˜์—…', 'Noun'): False,
 ('์ด', 'Josa'): False,
 ('๋งˆ์น˜', 'Noun'): False,
 ('๋ฉด', 'Josa'): False,
 ('๋ฉ”๋ฆฌ', 'Noun'): False,
 ('๋ž‘', 'Josa'): False,
 ('๋†€๊ฑฐ์•ผ', 'Verb'): False}
classifier.classify(test_sent_features)
>>>>
'pos'

โ€ป ์ฐธ๊ณ 

→ ํ•œ๊ธ€์˜ ํŠน์„ฑ์ƒ ์กฐ์‚ฌ ๋“ฑ์ด ๋งŽ์ด ๋ถ™๊ธฐ ๋•Œ๋ฌธ์—, ์ •ํ™•ํ•œ ๊ฒฐ๊ณผ๋ฅผ ์œ„ํ•ด์„œ๋Š” ํ˜•ํƒœ์†Œ ๋ถ„์„์€ ํ•„์ˆ˜๋กœ ์ง„ํ–‰ํ•˜์—ฌ์•ผ ํ•œ๋‹ค.
→ for๋ฌธ ๋“ฑ์ด ๋งŽ์ด ๋“ฑ์žฅํ•˜๋Š”๋ฐ ์ดํ•ดํ•˜๊ธฐ ์–ด๋ ค์šด ๋ถ€๋ถ„์€ ๋ณต์Šต์ด ํ•„์š”.