๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

Toy Project

์ฑ… ๊ฐ€๊ฒฉ ํšŒ๊ท€ ๋ถ„์„ (With. Naver.api)

Clone Project๋กœ ์ง„ํ–‰

 

๐Ÿ“Œ ์ฑ… ๊ฐ€๊ฒฉ์˜ ์ƒ๊ด€๊ด€๊ณ„ ํŒŒ์•…

๐Ÿ”ปNaver.Api?

→ ๊ธฐ์กด์— ์žˆ๋Š” ๋„ค์ด๋ฒ„ Api๋ฅผ ํ™œ์šฉํ•˜์—ฌ ๋Œ€๋Ÿ‰์œผ๋กœ ๋ฐ์ดํ„ฐ๋ฅผ ์ถ”์ถœํ•ด๋ณด๊ณ  ๋ถ„์„์„ ํ•ด๋ณด์ž

์ฐธ๊ณ  link

๐Ÿ”ปnaver.api ๋ถˆ๋Ÿฌ์˜ค๊ธฐ

import urllib.request

client_id = "6ywEQoHEOqpRJMopg74j"
client_secret = "YAeC1wvboT"
# api_node -> choose type 
#	ex) shop, book, etc..
# search_text -> input text that you want searching
# start_num -> start_number (page)
# disp_num -> display_number (How many display result on page

def gen_search_url(api_node, search_text, start_num, disp_num):
    base = "https://openapi.naver.com/v1/search"
    node = "/" + api_node + ".json"
    param_query = "?query=" + urllib.parse.quote(search_text)
    param_start = "&start=" + str(start_num)
    param_disp = "&display=" + str(disp_num)
    
    return base + node + param_query + param_start + param_disp

url=gen_search_url('book', 'ํŒŒ์ด์ฌ', 10, 3)
import json
import datetime

def get_result_onpage(url):
    request = urllib.request.Request(url)
    request.add_header("X-Naver-Client-Id",client_id)
    request.add_header("X-Naver-Client-Secret",client_secret)
    response = urllib.request.urlopen(request)
    print("[%s] Url Request Success" % datetime.datetime.now())
    return json.loads(response.read().decode("utf-8"))

one_result = get_result_onpage(url)
one_result
>>>>
[2023-10-06 15:21:42.494415] Url Request Success
{'lastBuildDate': 'Fri, 06 Oct 2023 15:21:42 +0900',
 'total': 928,
 'start': 10,
 'display': 3,
 'items': [{'title': '์ฝ”๋”ฉ์€ ์ฒ˜์Œ์ด๋ผ with ํŒŒ์ด์ฌ (VS Code๋กœ ์‹œ์ž‘ํ•˜๋Š” ํŒŒ์ด์ฌ)',
   'link': 'https://search.shopping.naver.com/book/catalog/39049935621',
   'image': 'https://shopping-phinf.pstatic.net/main_3904993/39049935621.20230919123144.jpg',
   'author': '๋‚จ๊ทœ์ง„',
   'discount': '17280',
   'publisher': '์˜์ง„๋‹ท์ปด',
   'pubdate': '20230405',
   'isbn': '9788931467994',
   'description': '์ด ์ฑ…์€ ์ด 12์žฅ์˜ ํŒŒํŠธ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค๋งŒ ์‚ฌ์‹ค 1์žฅ๋ถ€ํ„ฐ 11์žฅ๊นŒ์ง€๋Š” ๋ชจ๋‘ 12์žฅ ํŒŒ์ด์ฌ ํ”„๋กœ์ ํŠธ์˜ ํ”„๋กœ๊ทธ๋žจ์„ ์ดํ•ดํ•˜๊ณ  ์ž‘์„ฑํ•˜๊ธฐ ์œ„ํ•œ ๋‚ด์šฉ์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.\n\n๋‚ด์šฉ์€ ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ตฌ์„ฑ๋˜์–ด ์žˆ์Šต๋‹ˆ๋‹ค. \nใ†\tPart 1 ํŒŒ์ด์ฌ ์ž…๋ฌธ \nใ†\tPart 2 ํŒŒ์ด์ฌ ์‹ค์Šต ํ™˜๊ฒฝ\nใ†\tPart 3 ํŒŒ์ด์ฌ ์ž…์ถœ๋ ฅ\nใ†\tPart 4 ๋ณ€์ˆ˜์™€ ์ž๋ฃŒํ˜•\nใ†\tPart 5 ์—ฐ์‚ฐ์ž\nใ†\tPart 6 ์กฐ๊ฑด๋ฌธ๊ณผ ๋ฐ˜๋ณต๋ฌธ\nใ†\tPart 7 ํ•จ์ˆ˜\nใ†\tPart 8 ํด๋ž˜์Šค\nใ†\tPart 9 ๋ชจ๋“ˆ๊ณผ ๋ผ์ด๋ธŒ๋Ÿฌ๋ฆฌ\nใ†\tPart 10 ํŒŒ์ผ ์ž…์ถœ๋ ฅ\nใ†\tPart 11 ์˜ˆ์™ธ์ฒ˜๋ฆฌ\nใ†\tPart 12 ํŒŒ์ด์ฌ ํ”„๋กœ์ ํŠธ\n\nํ•™์Šต ์ˆœ์„œ๋Š” Part 1๋ถ€ํ„ฐ ์ˆœ์„œ๋Œ€๋กœ ํ•™์Šตํ•˜๋ฉด ๋˜๊ฒ ์Šต๋‹ˆ๋‹ค. 1์žฅ๋ถ€ํ„ฐ 11์žฅ๊นŒ์ง€ ๊ณต๋ถ€ํ•œ ๋ชจ๋“  ๋‚ด์šฉ์„ ์ ‘๋ชฉํ•˜์—ฌ ์ด์ œ ์‹ค์ œ ๋™์ž‘ ๊ฐ€๋Šฅํ•œ ํ”„๋กœ๊ทธ๋žจ์„ ๋งŒ๋“ค์–ด๋ด…๋‹ˆ๋‹ค. ์ˆซ์ž ๋งž์ถ”๊ธฐ ๊ฒŒ์ž„, ์˜์–ด ๋‹จ์–ด ๋งž์ถ”๊ธฐ ๊ฒŒ์ž„, ์ˆซ์ž ์•ผ๊ตฌ ๊ฒŒ์ž„, ์ฝ˜์†” ๊ณ„์‚ฐ๊ธฐ, ํƒ€์ž ๊ฒŒ์ž„, ๋กœ๋˜ ๋ฒˆํ˜ธ ์ƒ์„ฑ๊ธฐ, ํŒŒ์ด์ฌ์œผ๋กœ ์—‘์…€ ํŒŒ์ผ ๋ถˆ๋Ÿฌ์˜ค๊ณ  ์ƒ์„ฑํ•˜๊ธฐ, ํŒŒ์ด์ฌ์œผ๋กœ MS-WORD ํŒŒ์ผ ์ž‘์„ฑํ•˜๊ธฐ๋กœ ์ด 9๊ฐœ์˜ ์‹ค์Šต ํ”„๋กœ๊ทธ๋žจ์„ ๋งŒ๋“ค์–ด๋ณด๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋‹จ์ˆœํžˆ ์™„์„ฑ๋œ ์ฝ”๋“œ๋ฅผ ์„ค๋ช…ํ•˜๋Š” ๋ฐฉ์‹์ด ์•„๋‹Œ ์‹ค์ œ ์ฝ”๋“œ๋ฅผ ํ•˜๋‚˜์”ฉ ์‚ด์„ ๋ถ™์—ฌ ์™„์„ฑํ•ด ๋‚˜๊ฐ€๋Š” ํ˜•ํƒœ๋กœ ์ง„ํ–‰ํ•ฉ๋‹ˆ๋‹ค.\nํ”„๋กœ๊ทธ๋ž˜๋ฐ์„ ๊ณต๋ถ€ํ•  ๋•Œ, ์–ผ๋งŒํผ ๋งŽ์ด ์ง€์‹์„ ์•Œ๊ณ  ์žˆ๋А๋ƒ๊ฐ€ ์•„๋‹ˆ๋ผ ๋‚ด๊ฐ€ ์•Œ๊ณ  ์žˆ๋Š” ๋‚ด์šฉ์„ ์–ด๋–ป๊ฒŒ ์ž˜ ํ™œ์šฉํ•  ์ค„ ์•„๋А๋ƒ๊ฐ€ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค. ๊ทธ๋ ‡๊ธฐ ๋•Œ๋ฌธ์— ์ด ์ฑ…์—์„œ ๋‚˜์˜ค๋Š” ๋‚ด์šฉ๊ณผ ๋ฌธ์ œ๋“ค์„ ๋ณด๋ฉด์„œ ์—ฌ๋Ÿฌ ๋ฐฉ๋ฉด์œผ๋กœ ํ•ด๊ฒฐํ•ด๋ณด์‹œ๊ธฐ ๋ฐ”๋ž๋‹ˆ๋‹ค. ์†Œ์Šค์ฝ”๋“œ์—์„œ ์ œ๊ณตํ•˜๋Š” ์ •๋‹ต์€ ์˜ˆ์‹œ์ผ ๋ฟ ๊ผญ ๊ทธ๊ฒƒ๋งŒ์ด ์ •๋‹ต์€ ์•„๋‹™๋‹ˆ๋‹ค.\nํŒŒ์ด์ฌ ํ”„๋กœ์ ํŠธ๋„ ์‹คํ–‰ํ•ด๋ณด๋ฉด์„œ ์–ด๋–ป๊ฒŒ ํ•˜๋ฉด ๋” ์ข‹์€ ํ”„๋กœ๊ทธ๋žจ์œผ๋กœ ์—…๊ทธ๋ ˆ์ด๋“œํ•  ์ˆ˜ ์žˆ๋Š”์ง€ ๋ถ€์กฑํ•œ ๋ถ€๋ถ„์„ ๋ณด์™„ํ•ด๊ฐ€๋ฉฐ ๊ณต๋ถ€ํ•ด๋ณด์‹œ๊ธฐ ๋ฐ”๋ž๋‹ˆ๋‹ค.\n\nใ€ ๋Œ€์ƒ ๋…์ž์ธต ใ€‘\n- ํŒŒ์ด์ฌ์„ ์ฒ˜์Œ ์ ‘ํ•˜๋Š” ๋ถ„\n- ํŒŒ์ด์ฌ ํ”„๋กœ๊ทธ๋ž˜๋ฐ ์ž…๋ฌธ์ž ๋ฐ ํ•™์ƒ\n- ํŒŒ์ด์ฌ ๊ธฐ์ดˆ ๋ฌธ๋ฒ•๋งŒ ์•Œ๊ณ  ํ™œ์šฉํ•˜๊ธฐ ์–ด๋ ค์šด ๋ถ„'},
  {'title': 'ํ˜ผ์ž ๊ณต๋ถ€ํ•˜๋Š” ๋ฐ์ดํ„ฐ ๋ถ„์„ with ํŒŒ์ด์ฌ (1:1 ๊ณผ์™ธํ•˜๋“ฏ ๋ฐฐ์šฐ๋Š” ๋ฐ์ดํ„ฐ ๋ถ„์„ ์ž์Šต์„œ)',
   'link': 'https://search.shopping.naver.com/book/catalog/36555425618',
   'image': 'https://shopping-phinf.pstatic.net/main_3655542/36555425618.20231004072457.jpg',
   'author': '๋ฐ•ํ•ด์„ ',
   'discount': '23400',
   'publisher': 'ํ•œ๋น›๋ฏธ๋””์–ด',
   'pubdate': '20230102',
   'isbn': '9791169210287',
   'description': 'ํ˜ผ์ž ํ•ด๋„ ์ถฉ๋ถ„ํ•˜๋‹ค! 1:1 ๊ณผ์™ธํ•˜๋“ฏ ๋ฐฐ์šฐ๋Š” ๋ฐ์ดํ„ฐ ๋ถ„์„ ์ž์Šต์„œ\n\n์ด ์ฑ…์€ ๋…ํ•™์œผ๋กœ ๋ฐ์ดํ„ฐ ๋ถ„์„์„ ๋ฐฐ์šฐ๋Š” ์ž…๋ฌธ์ž๊ฐ€ ‘๊ผญ ํ•„์š”ํ•œ ๋‚ด์šฉ์„ ์ œ๋Œ€๋กœ ํ•™์Šต’ํ•  ์ˆ˜ ์žˆ๋„๋ก ๊ตฌ์„ฑํ–ˆ์Šต๋‹ˆ๋‹ค. ๋ญ˜ ๋ชจ๋ฅด๋Š”์ง€์กฐ์ฐจ ๋ชจ๋ฅด๋Š” ์ž…๋ฌธ์ž์˜ ๋ง‰์—ฐํ•œ ๋งˆ์Œ์— ์‹ญ๋ถ„ ๊ณต๊ฐํ•˜์—ฌ ๊ณผ์™ธ ์„ ์ƒ๋‹˜์ด ์•Œ๋ ค์ฃผ๋“ฏ ์นœ์ ˆํ•˜๊ฒŒ, ํ•ต์‹ฌ์ ์ธ ๋‚ด์šฉ๋งŒ ์ฝ•์ฝ• ์ง‘์–ด ์ค๋‹ˆ๋‹ค. ์ฑ…์˜ ์ฒซ ํŽ˜์ด์ง€๋ฅผ ํŽผ์ณ์„œ ๋งˆ์ง€๋ง‰ ํŽ˜์ด์ง€๋ฅผ ๋ฎ์„ ๋•Œ๊นŒ์ง€, ํ˜ผ์ž์„œ๋„ ์ถฉ๋ถ„ํžˆ ๋ฐ์ดํ„ฐ ๋ถ„์„์„ ๋ฐฐ์šธ ์ˆ˜ ์žˆ๋‹ค๋Š” ์ž์‹ ๊ฐ๊ณผ ํ™•์‹ ์ด ๊ณ„์†๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค!\n\n๋ฒ ํƒ€๋ฆฌ๋” ๊ฒ€์ฆ์œผ๋กœ, ‘ํ•จ๊ป˜ ๋งŒ๋“ ’ ์ž…๋ฌธ์ž ๋งž์ถคํ˜• ๋„์„œ\n๋ฒ ํƒ€๋ฆฌ๋”์™€ ํ•จ๊ป˜ ์ž…๋ฌธ์ž์—๊ฒŒ ๋งž๋Š” ๋‚œ์ด๋„, ๋ถ„๋Ÿ‰, ํ•™์Šต ์š”์†Œ ๋“ฑ์„ ๊ณ ๋ฏผํ•˜๊ณ  ์ด๋ฅผ ์ ๊ทน ๋ฐ˜์˜ํ–ˆ์Šต๋‹ˆ๋‹ค. ์–ด๋ ค์šด ์šฉ์–ด์™€ ๊ฐœ๋…์€ ํ•œ ๋ฒˆ ๋” ํ’€์–ด์“ฐ๊ณ , ๋ณต์žกํ•œ ์„ค๋ช…์€ ๋ˆˆ์— ์ž˜ ๋“ค์–ด์˜ค๋Š” ๊ทธ๋ฆผ์œผ๋กœ ํ’€์–ด๋ƒˆ์Šต๋‹ˆ๋‹ค. ‘ํ˜ผ์ž ๊ณต๋ถ€ํ•ด ๋ณธ’ ์—ฌ๋Ÿฌ ์ž…๋ฌธ์ž์˜ ์ดˆ์‹ฌ๊ณผ ๋ˆˆ๋†’์ด๊ฐ€ ์ฑ… ๊ณณ๊ณณ์— ๋ฐ˜์˜๋œ ๊ฒƒ์ด ์ด ์ฑ…์˜ ๊ฐ€์žฅ ํฐ ์žฅ์ ์ž…๋‹ˆ๋‹ค.\n\n๋ˆ„๊ตฌ๋ฅผ ์œ„ํ•œ ์ฑ…์ธ๊ฐ€์š”?\n\nโ—\t๋ฐ์ดํ„ฐ ๋ถ„์„์„ ์–ด๋–ป๊ฒŒ ์‹œ์ž‘ํ• ์ง€ ๋ง‰๋ง‰ํ•œ ๋น„์ „๊ณต์ž\nโ—\tํŒŒ์ด์ฌ์„ ๋ฐฐ์šด ๋‹ค์Œ ์˜๋ฏธ ์žˆ๋Š” ์‹ค์Šต์„ ํ•ด ๋ณด๊ณ  ์‹ถ์€ ํŒŒ์ด์ฌ ์ž…๋ฌธ์ž\nโ—\tํ”„๋กœ๊ทธ๋ž˜๋ฐ์€ ์•Œ์ง€๋งŒ, ๋ถ„์„(ํ†ต๊ณ„)์— ๋Œ€ํ•œ ์ดํ•ด๊ฐ€ ํ•„์š”ํ•œ ๊ฐœ๋ฐœ์ž\nโ—\t๋ฐ์ดํ„ฐ์—์„œ ์œ ์˜๋ฏธํ•œ ๊ฒฐ๊ณผ๋ฅผ ๋„์ถœํ•ด ์ด๋ฅผ ๊ธฐํš์ด๋‚˜ ๋งˆ์ผ€ํŒ…์— ์ ์šฉํ•ด ๋ณด๊ณ  ์‹ถ์€ ์ง์žฅ์ธ\nโ—\t๋ฐ์ดํ„ฐ ๋ถ„์„๊ฐ€, ๋ฐ์ดํ„ฐ ์‚ฌ์ด์–ธํ‹ฐ์ŠคํŠธ๋ผ๋Š” ์ง์—…์— ๊ด€์‹ฌ ์žˆ๋Š” ๋ชจ๋“  ์‚ฌ๋žŒ'},
  {'title': 'ํŒŒ์ด์ฌ์œผ๋กœ ์‰ฝ๊ฒŒ ๋ฐฐ์šฐ๋Š” ์ž๋ฃŒ๊ตฌ์กฐ (๊ฐœ์ •ํŒ)',
   'link': 'https://search.shopping.naver.com/book/catalog/40595743620',
   'image': 'https://shopping-phinf.pstatic.net/main_4059574/40595743620.20230711115354.jpg',
   'author': '์ตœ์˜๊ทœ^์ฒœ์ธ๊ตญ',
   'discount': '26100',
   'publisher': '์ƒ๋Šฅ์ถœํŒ',
   'pubdate': '20230626',
   'isbn': '9791192932187',
   'description': '์ž๋ฃŒ๊ตฌ์กฐ(data structure)๋Š” ์ปดํ“จํ„ฐ๋กœ ์ฒ˜๋ฆฌํ•  ์ž๋ฃŒ๋“ค์„ ํšจ์œจ์ ์œผ๋กœ ๊ด€๋ฆฌํ•˜๊ณ  ๊ตฌ์กฐํ™”์‹œํ‚ค๊ธฐ ์œ„ํ•œ ํ•™๋ฌธ์œผ๋กœ ์ปดํ“จํ„ฐ ๋ถ„์•ผ์—์„œ ๋งค์šฐ ์ค‘์š”ํ•˜๊ณ  ๊ธฐ์ดˆ์ ์ธ ๊ณผ๋ชฉ์ด๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ๊ฐœ๋…์˜ ์ดํ•ด์™€ ํ•จ๊ป˜ ์ฝ”๋”ฉ์„ ํ†ตํ•œ ๊ตฌํ˜„ ๋Šฅ๋ ฅ์ด ํ•„์ˆ˜์ ์œผ๋กœ ์š”๊ตฌ๋˜๊ธฐ ๋•Œ๋ฌธ์— ํ•™์ƒ๋“ค์ด ์–ด๋ ค์›Œํ•˜๋Š” ๊ณผ๋ชฉ์ด๊ธฐ๋„ ํ•˜๋‹ค.\n์ด ์ฑ…์€ ์ž…๋ฌธ์ž๋“ค์ด ๋ณด๋‹ค ์‰ฝ๊ณ  ์žฌ๋ฏธ์žˆ๊ฒŒ ์ž๋ฃŒ๊ตฌ์กฐ๋ฅผ ๊ณต๋ถ€ํ•˜๊ณ  ๋‹ค์–‘ํ•œ ๋ฌธ์ œ ํ•ด๊ฒฐ์— ํ™œ์šฉํ•  ์ˆ˜ ์žˆ๋Š” ๋Šฅ๋ ฅ์„ ๊ธฐ๋ฅด๋Š”๋ฐ ์ดˆ์ ์„ ๋งž์ถ”์—ˆ๋‹ค.'}]}

๐Ÿ”ปsample df

import pandas as pd

def get_fields(json_data):
    title = [each["title"] for each in json_data["items"]]
    price = [each["discount"] for each in json_data["items"]]
    publisher = [each["publisher"] for each in json_data["items"]]
    isbn = [each["isbn"] for each in json_data["items"]]
    link = [each["link"] for each in json_data["items"]]
    
    result_pd = pd.DataFrame({
        "title" : title,
        "price" : price,
        "publisher" : publisher,
        "isbn" : isbn,
        'link' : link
    }, columns=["title", "price", "publisher", "isbn", 'link'])
    return result_pd

get_fields(one_result)

๐Ÿ”ป๋Œ€๋Ÿ‰์œผ๋กœ ์ถ”์ถœํ•˜๊ธฐ

# 100๊ฐœ ์”ฉ 10๋ฒˆ ์ถ”์ถœํ•˜๋Š” ๋ฐฉ์‹
# Naver Api๊ฐ€ ํ•œ ๋ฒˆ์— ํ•  ์ˆ˜ ์žˆ๋Š” ๊ฐœ์ˆ˜๊ฐ€ ์ •ํ•ด์ ธ ์žˆ์–ด์„œ concat์œผ๋กœ ๋‹จ์ˆœํžˆ DF๋ฅผ ๋ถ™์—ฌ์ค€๋‹ค
# index๊ฐ€ ๊ผฌ์ด๊ธฐ ๋•Œ๋ฌธ์—, reset_index ๊ณผ์ •์„ ๊ฑฐ์น˜๋Š” ๊ฑด ํ•„์ˆ˜
result_book = []
for n in range(1, 1000, 100):
    url = gen_search_url("book", "ํŒŒ์ด์ฌ", n, 100)
    json_result = get_result_onpage(url)
    pd_result = get_fields(json_result)
    
    result_book.append(pd_result)
    
result_book = pd.concat(result_book)
result_book.reset_index(drop=True, inplace=True)
result_book

๐Ÿ”ป๊ฐ€๊ฒฉ ์ •๋ณด type ์ˆ˜์ •

result_book.info()
>>>>
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 928 entries, 0 to 927
Data columns (total 5 columns):
 #   Column     Non-Null Count  Dtype 
---  ------     --------------  ----- 
 0   title      928 non-null    object
 1   price      928 non-null    object
 2   publisher  928 non-null    object
 3   isbn       928 non-null    object
 4   link       928 non-null    object
dtypes: object(5)
memory usage: 36.4+ KB
result_book['price'] = result_book['price'].astype('float')

๐Ÿ”ป์ชฝ ์ˆ˜ ์ •๋ณด ๊ฐ€์ ธ์˜ค๊ธฐ (book page)

import numpy as np

def get_page_num(soup):
    try:
        tmp = soup.find_all(class_="bookBasicInfo_spec__qmQ_N")[0].get_text()
    except:
        print('--> out of list error')

    try:
        result = tmp[:-1]
        return result
    except:
        print("--> Error in get_page_num")
        return np.nan

get_page_num(soup)
import time
page_num_col = []
# result_book์˜ link ์ปฌ๋Ÿผ์—์„œ url์„ ํ•˜๋‚˜์”ฉ ๋ฝ‘์•„์˜จ๋‹ค.
for url in result_book['link']:
    print(url)
    # ๊ฐ ์ฑ…์˜ ํŽ˜์ด์ง€ ์ˆ˜๋ฅผ get_page_num ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด์„œ ๊ตฌํ•˜๊ธฐ
    try:
        page_num = get_page_num(BeautifulSoup(urlopen(url), 'html.parser'))
        page_num_col.append(page_num)
    # try, except๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์˜ค๋ฅ˜๊ฐ€ ๋‚  ๋•Œ, NaN ๋ฐ์ดํ„ฐ๋กœ ์ฒ˜๋ฆฌํ•˜๊ธฐ
    except:
        print('--> Error')
        page_num_col.append(np.nan)
    print(len(page_num_col))
    time.sleep(0.5)
# ํŽ˜์ด์ง€ ์ˆ˜ append ํ•œ ๋ฐ์ดํ„ฐ๋ฅผ ์ ์šฉ
result_book['page_num'] = page_num_col
result_book['page_num'] = result_book['page_num'].astype('float')

๐Ÿ”ปNaN ๋ฐ์ดํ„ฐ ๋‚˜์˜จ ๋ถ€๋ถ„์„ ๋‹ค์‹œ ๊ตฌํ•ด๋ณด๊ธฐ

for i, r in result_book.iterrows():
    if np.isnan(r['page_num']):
        print(r['link'])
        page_num = get_page_num(BeautifulSoup(urlopen(r['link']), 'html.parser'))
        result_book.loc[i, 'page_num'] = page_num
        time.sleep(0.5)

921๊ฐœ๊ฐ€ ํ•œ๊ณ„..

๐Ÿ“Œ ๊ฒฐ์ธก์น˜ ํ™•์ธ ํ›„ ์ฒ˜๋ฆฌํ•˜๊ธฐ

๐Ÿ”ปraw_data

→ NaN ๋ฐ์ดํ„ฐ์™€ Price ๊ฐ’์ด 0์ธ ๋ฐ์ดํ„ฐ ๋“ฑ ๊ฒฐ์ธก์น˜ ์ฒ˜๋ฆฌ ํ›„ 835๊ฐœ์˜ ํ–‰์ด ๋‚จ์•˜๋‹ค.

๐Ÿ“Œ ์ƒ๊ด€๊ด€๊ณ„ ํ™•์ธ

๐Ÿ”ปํŽ˜์ด์ง€ ์ˆ˜์™€ ๊ฐ€๊ฒฉ๊ณผ์˜ ์ƒ๊ด€๊ด€๊ณ„

→ ์ชฝ์ˆ˜๊ฐ€ ๋งŽ์„ ์ˆ˜๋ก, ๊ฐ€๊ฒฉ์ด ๋†’์•„์ง€๋Š” ๊ด€๊ณ„๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ๋‹ค.

→ 100,000์› ๋ถ€๊ทผ์— ์ด์ƒ์น˜๊ฐ€ ์žˆ๋Š” ๊ฒƒ ๊ฐ™๋‹ค?..

import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

plt.figure(figsize=(12, 8))
sns.regplot(x='page_num', y='price', data = raw_data)
plt.show()

์ฑ… ๊ฐ€๊ฒฉ์ด 97000์›์ด ๋˜๋Š” ๋ฐ์ดํ„ฐ๊ฐ€ ์žˆ๋‹ค.

๐Ÿ”ป๊ด€๋ จ ์ฑ…์„ ๋งŽ์ด ์ถœํŒํ•œ ์ถœํŒ์‚ฌ์™€ ๊ฐ€๊ฒฉ์˜ ์ƒ๊ด€๊ด€๊ณ„

# ๋งŽ์ด ์ถœํŒํ•œ ์ถœํŒ์‚ฌ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์ง„ํ–‰
raw_1 = raw_data[raw_data['publisher'] == '์—์ด์ฝ˜์ถœํŒ']

plt.figure(figsize=(12, 8))
sns.regplot(x='page_num', y='price', data=raw_1)
plt.show()

raw_2 = raw_data[raw_data['publisher'] == 'ํ•œ๋น›๋ฏธ๋””์–ด']

plt.figure(figsize=(12, 8))
sns.regplot(x='page_num', y='price', data=raw_1)
plt.show()

raw_3 = raw_data[raw_data['publisher'] == '์œ„ํ‚ค๋ถ์Šค']

plt.figure(figsize=(12, 8))
sns.regplot(x='page_num', y='price', data=raw_1)
plt.show()

๐Ÿ“Œ train, test ๋ฐ์ดํ„ฐ

๐Ÿ”ป์„ ํ˜• ํšŒ๊ท€ ์ง„ํ–‰ (raw_data)

from sklearn.model_selection import train_test_split

X = raw_data['page_num'].values
y = raw_data['price'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=13)

X_train = X_train.reshape(-1, 1)
X_test = X_test.reshape(-1, 1)
from sklearn.linear_model import LinearRegression

reg = LinearRegression()
reg.fit(X_train, y_train)
from sklearn.metrics import mean_squared_error
pred_tr = reg.predict(X_train)
pred_test =  reg.predict(X_test)

rmse_tr = (np.sqrt(mean_squared_error(y_train, pred_tr)))
rmse_test = (np.sqrt(mean_squared_error(y_test, pred_test)))
# ์—๋Ÿฌ ๊ณ„์‚ฐ
print("RMSE of Train Data : ", rmse_tr)
print("RMSE of Train Data : ", rmse_test)
>>>>
RMSE of Train Data :  5488.9696012934755
RMSE of Train Data :  4469.722562719371
plt.scatter(y_test, pred_test)
plt.xlabel('Actual')
plt.ylabel('Predict')
plt.plot([0, 80000],[0, 80000],'r')
plt.show()

๐Ÿ”ป์„ ํ˜• ํšŒ๊ท€ ์ง„ํ–‰ (1๋“ฑ ์ถœํŒ์‚ฌ ๊ธฐ์ค€)

X = raw_1['page_num'].values
y = raw_1['price'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=13)

X_train = X_train.reshape(-1, 1)
X_test = X_test.reshape(-1, 1)

reg.fit(X_train, y_train)

pred_tr = reg.predict(X_train)
pred_test =  reg.predict(X_test)

rmse_tr = (np.sqrt(mean_squared_error(y_train, pred_tr)))
rmse_test = (np.sqrt(mean_squared_error(y_test, pred_test)))

print("RMSE of Train Data : ", rmse_tr)
print("RMSE of Train Data : ", rmse_test)
>>>>
RMSE of Train Data :  3468.6898753299415
RMSE of Train Data :  4445.502939270577
plt.scatter(y_test, pred_test)
plt.xlabel('Actual')
plt.ylabel('Predict')
plt.plot([0, 80000],[0, 80000],'r')
plt.show()

๐Ÿ”ป๊ฒฐ๋ก ?

→ raw_data ์ „์ฒด๋ฅผ ์ง„ํ–‰ํ–ˆ์„ ๋•Œ ๋ณด๋‹ค ์ถœํŒ์‚ฌ ๋ณ„๋กœ ์ง„ํ–‰ํ•œ ๋ฐ์ดํ„ฐ๊ฐ€ ์ข€ ๋” ์˜ˆ์ธก์„ ์ž˜ ํ•˜๋Š” ๊ฒƒ ๊ฐ™๋‹ค.

raw_data๋Š” ํŠน์ • ๋ถ€๋ถ„์— ๋งŽ์ด ๋ญ‰์ณ ์žˆ๋Š” ๊ฒƒ ๊ฐ™๋‹ค (์•„๋ฌด๋ž˜๋„ ๋ฐ์ดํ„ฐ๊ฐ€ ํ•œ ๊ณณ์— ๋ชฐ๋ ค ์žˆ๋Š” ๊ฒƒ์ด ๋งŽ์œผ๋‹ˆ ๊ทธ๋Ÿฐ ๊ฒƒ ๊ฐ™๋‹ค.)

๋ถ„์•ผ๋ณ„๋กœ ์ ๊ฒ€ํ•˜๋Š” ๊ฒƒ๋„ ํ•ด๋ณด๋ฉด ์ข‹์„ ๋“ฏ?