๋ณธ๋ฌธ ๋ฐ”๋กœ๊ฐ€๊ธฐ

Study_note(zb_data)/EDA

์Šคํ„ฐ๋”” ๋…ธํŠธ (BeautifulSoup basic)

๐Ÿ“Œ BeautifulSoup - Basic

๐Ÿšฉ pip install beautifulsoup4

  • from bs4 import BeautifulSoup

  • html ํŒŒ์ผ์„ open ๋ช…๋ น์–ด๋กœ ์„ ์–ธ, page ๋ณ€์ˆ˜์— ๋‹ด์•„์ค€๋‹ค
  • BeautifulSoup์— page, "html. parser" ์ด์šฉ, html ํŒŒ์ผ์„ ์ฝ๋Š”๋‹ค
  • prettify() ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ ๋ณด๊ธฐ์ข‹๊ฒŒ ์ •๋ฆฌํ•œ๋‹ค

๐Ÿšฉ Tag ํ™•์ธ

  • ํด๋ž˜์Šค๋ฅผ ์ง€์ •ํ•˜์ง€ ์•Š์„ ์‹œ, ์ฒ˜์Œ ๋ฐœ๊ฒฌ๋œ ํƒœ๊ทธ๋งŒ ์ถœ๋ ฅ ๋œ๋‹ค.

head tag
body ta
p tag

๐Ÿ“Œ find, find_all ํ•จ์ˆ˜

๐Ÿšฉ find ํ•จ์ˆ˜

  • find("tag_name", class_="class_name")
  • find("tag_name", {"class":"class_name"})

find ํ•จ์ˆ˜
tag_name, class๋ฅผ ์ง€์ •ํ•˜์—ฌ ์ฐพ๋Š”๋‹ค

๐Ÿšฉ ๋‹ค์ค‘ ์กฐ๊ฑด

๋‹ค์ค‘ ์กฐ๊ฑด

๐Ÿšฉ text, get_text(),string, strip() ํ•จ์ˆ˜๋กœ ๋ฐ์ดํ„ฐ ๊ตฌํ•˜๊ธฐ

text, get_text, strip()
text, string, get_text()

  • for๋ฌธ์œผ๋กœ tag์— ์žˆ๋Š” text ๊ฐ’ ๊ตฌํ•˜๊ธฐ

๐Ÿšฉ find_all ํ•จ์ˆ˜

  • ํ•ด๋‹น๋˜๋Š” ์—ฌ๋Ÿฌ ๊ฐœ์˜ ํƒœ๊ทธ๋ฅผ ๋ฆฌ์ŠคํŠธ ํ˜•ํƒœ๋กœ ๋ฐ˜ํ™˜
  • find_all("tag_name", class_="class_name")
  • find("tag_name", {"class":"class_name"})
  • find_all("tag_name", "class_name")

html ๋‚ด๋ถ€์— ์žˆ๋Š” p ํƒœ๊ทธ๋ฅผ list ํ˜•ํƒœ๋กœ ๋ฐ˜ํ™˜

๐Ÿšฉ link ๊ฐ’ ๊ตฌํ•˜๊ธฐ

  • ํ•˜์ดํผ๋งํฌ๊ฐ€ ์žˆ๋Š” ํƒœ๊ทธ ๊ฐ’์—์„œ Data๋ฅผ ๊ตฌํ•œ๋‹ค.
  • [index]+get("href) ํ•จ์ˆ˜ / [index]+["tag"] ํ•จ์ˆ˜
    • [index], get() ํ•จ์ˆ˜๋ฅผ ์ด์šฉํ•˜์—ฌ ํ•ด๋‹น๋˜๋Š” link ๊ฐ’์„ ๊ตฌํ•œ๋‹ค.

[index]+get("href) ํ•จ์ˆ˜ / [index]+["tag"] ํ•จ์ˆ˜

  • links์— ์žˆ๋Š” ๊ฐ’์„ for๋ฌธ์œผ๋กœ ์ถ”์ถœ