Study_note(zb_data)/EDA

์Šคํ„ฐ๋”” ๋…ธํŠธ (BeautifulSoup Web data 1)

KloudHyun 2023. 8. 11. 19:50

๐Ÿ“Œ BeautifulSoup - Web data 1

๐Ÿšฉ ๋ชฉํ‘œ ๋ฐ์ดํ„ฐ (ํ™˜์œจ)

  • ๋‚˜๋ผ ์ด๋ฆ„
  • ํ˜„์žฌ ํ™˜์œจ
  • ๋ณ€๋™ ํญ
  • ์ƒ์Šน or ํ•˜๋ฝ ๊ฐ’

๐Ÿšฉ urllib, BeautifulSoup ํ™œ์šฉ

urlopen, BeautifulSoup ๊ธฐ์–ตํ•˜์ž

  • ๋ฐ์ดํ„ฐ๋ฅผ ๊ฐ€์ ธ์˜ค๋ ค๋Š” url๊ณผ urlopen, BeautifulSoup ๋“ฑ์˜ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉ
  • html.parser, prettify ํ•จ์ˆ˜๋กœ ์ •๋ฆฌ๋œ web data๋ฅผ ๋ณผ ์ˆ˜ ์žˆ๋‹ค.
  • ๊ฐ€์ ธ์˜ค๋ ค๋Š” ์ง€ํ‘œ
 

https://finance.naver.com/marketindex/

ํ™˜์ „ ๊ณ ์‹œ ํ™˜์œจ 2023.08.11 18:39 ํ•˜๋‚˜์€ํ–‰ ๊ธฐ์ค€ ๊ณ ์‹œํšŒ์ฐจ 499ํšŒ

finance.naver.com

  • status ํ•จ์ˆ˜, 200์ด ์ถœ๋ ฅ๋˜๋ฉด ์›น ๋ฐ์ดํ„ฐ๋ฅผ ์ด์šฉํ•  ์ˆ˜ ์žˆ๋‹ค

HTTP ์ƒํƒœ ์ฝ”๋“œ : 200 ์ถœ๋ ฅ ํ™•์ธ

 

๋”๋ณด๊ธฐ

๐Ÿšฉ ํŠน์ • ๊ฐ’ ์ฐพ๊ธฐ

  • find_all("tag_name", "class_name")
  • find_all("tag_name", class_= "class_name")
  • find_all("tag_name", {"class" : "class_name"})

์ฐพ์œผ๋ ค๋Š” ๊ธˆ์œต ๋ฐ์ดํ„ฐ์˜ tag_name, class_name ํ™•์ธ
find_all("tag_name", "class_name")
find_all("tag_name", class_= "class_name")
find_all("tag_name", {"class" : "class_name"})
[index] + text, string, get_text() ํ™œ

๐Ÿšฉ requests, BeautifulSoup ํ™œ์šฉ

requests, BeautifulSoup ํ™œ์šฉ
request.get(url) ํ™•์ธ
status_code ํ™œ์šฉ, 200 ์ถœ๋ ฅ ํ™•์ธ

๐Ÿšฉ ํŠน์ • ๊ฐ’ ์ฐพ๊ธฐ

  • select ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•ด ํŠน์ • ๊ฐ’์„ ์ฐพ๋Š”๋‹ค
    • '#' ๊ณผ '.' ์„ ์ž˜ ๊ตฌ๋ถ„ํ•ด์•ผํ•œ๋‹ค.
    • '#' = id, '.'= class

  • select("#id_name), select(".class_name)
  • select_one("#id_name), select_one(".class_name)
  • ๊บฝ์‡ ๋Š” ๋ฐ์ดํ„ฐ์˜ ํ•˜์œ„ ๊ฐ’์„ ์ฐพ๋Š”๋‹ค

  • 0๋ฒˆ์งธ ์ธ๋ฑ์Šค ๊ฐ’ ๊ตฌํ•˜๊ธฐ
  • '.' (class) ๊ฐ’์„ ์‚ฌ์šฉํ•ด ๋ฐ์ดํ„ฐ ๊ฐ’๋“ค์„ ๋ฐ˜ํ™˜ํ•œ๋‹ค.

โ€ป updown ๊ฐ’์—์„œ  > ์˜ ์˜๋ฏธ? โ€ป

  • updown ๊ฐ’์˜ class_name = blind
  • ํ•˜์ง€๋งŒ ์ฐพ์„ ๋•Œ, ์ œ์ผ ์ฒ˜์Œ ๊ฐ’์„ ๋ถˆ๋Ÿฌ์˜ค๊ธฐ ๋•Œ๋ฌธ์— select_one(".blind")์„ ์ž…๋ ฅํ•˜๋ฉด ๋ฏธ๊ตญ USD๊ฐ€ ์ถœ๋ ฅ๋œ๋‹ค.
  • ๊ทธ๋ž˜์„œ div ํƒœ๊ทธ์— ์žˆ๋Š” ํด๋ž˜์Šค head_info ๊ฐ’ ๋ฐ‘ ๊ธฐ์ค€์œผ๋กœ ์ฐพ์œผ๋ผ๋Š” ๊ธฐ์ค€์„ ์„ธ์›Œ์คŒ.
    • span ํƒœ๊ทธ์— ์žˆ๋Š” ํด๋ž˜์Šค txt_krw ๊ฐ’ ๋ฐ‘ ๊ธฐ์ค€ blind๋„ ์žˆ์ง€๋งŒ, ์• ์ดˆ์— head_info ๊ธฐ์ค€์œผ๋กœ ์ •ํ•ด์ ธ์žˆ๊ธฐ ๋•Œ๋ฌธ์—
      blind ํด๋ž˜์Šค ๊ฐ’์ธ ์ƒ์Šน์ด ์ถœ๋ ฅ๋œ๋‹ค.

0๋ฒˆ์งธ index ๊ฐ’

๐Ÿšฉ ์ฐพ์€ ๊ฐ’์„ DataFrameํ™” ์‹œํ‚ค๊ธฐ

for๋ฌธ์„ ํ†ตํ•ด์„œ ๋ฐ์ดํ„ฐ ๊ฐ’์„ DataFrameํ™” ์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค.

# 4๊ฐœ ๋ฐ์ดํ„ฐ ์ˆ˜์ง‘
exchange_datas = [] #DataFrameํ™” ์‹œํ‚ค๊ธฐ ์œ„ํ•ด์„œ ๋นˆ ๋ฆฌ์ŠคํŠธ๋ฅผ ๋งŒ๋“ค์–ด ์คŒ
baseUrl = "https://finance.naver.com" #url ๋„ฃ๊ธฐ ์œ„ํ•ด baseUrl ์ง€์ •

for item in exchangeList: #exchangeList์—์„œ item์„ ํ•˜๋‚˜์”ฉ ๋ฝ‘์•„๋‚ธ๋‹ค
    data = {
        "title" : item.select_one(".h_lst").text, # ์ด๋ฆ„ ๊ฐ’
        "exchange" : item.select_one(".value").text, # ์•ก์ˆ˜ ๊ฐ’
        "change" : item.select_one(".change").text, # ๋ณ€๋™๋œ ์•ก์ˆ˜ ๊ฐ’
        "updown" : item.select_one(".head_info > .blind").text, # ์ƒ์Šน ๋ฐ ํ•˜๋ฝ ๊ฐ’
        "link" : baseUrl + item.select_one("a").get("href") # url ๊ฐ’
    }
    print(data)
    exchange_datas.append(data)
exchange_datas
df = pd.DataFrame(exchange_datas)
df.to_excel("./naverfinance.xlsx", encoding="utf-8") # excel ํŒŒ์ผ๋กœ ์ €์žฅ

>>>>
{'title': '๋ฏธ๊ตญ USD', 'exchange': '1,327.50', 'change': '12.50', 'updown': '์ƒ์Šน', 'link': 'https://finance.naver.com/marketindex/exchangeDetail.naver?marketindexCd=FX_USDKRW'}
{'title': '์ผ๋ณธ JPY(100์—”)', 'exchange': '918.08', 'change': '7.01', 'updown': '์ƒ์Šน', 'link': 'https://finance.naver.com/marketindex/exchangeDetail.naver?marketindexCd=FX_JPYKRW'}
{'title': '์œ ๋Ÿฝ์—ฐํ•ฉ EUR', 'exchange': '1,457.33', 'change': '8.07', 'updown': '์ƒ์Šน', 'link': 'https://finance.naver.com/marketindex/exchangeDetail.naver?marketindexCd=FX_EURKRW'}
{'title': '์ค‘๊ตญ CNY', 'exchange': '182.95', 'change': '1.11', 'updown': '์ƒ์Šน', 'link': 'https://finance.naver.com/marketindex/exchangeDetail.naver?marketindexCd=FX_CNYKRW'}