Study_note(zb_data)/EDA

์Šคํ„ฐ๋”” ๋…ธํŠธ (BeautifulSoup + selenium)

KloudHyun 2023. 8. 14. 01:40

๐Ÿ“Œ BeautifulSoup + selenium

https://www.opinet.co.kr/searRgSelect.do

 

์‹ผ ์ฃผ์œ ์†Œ ์ฐพ๊ธฐ ์˜คํ”ผ๋„ท

 

www.opinet.co.kr

๐Ÿšฉ ๋ชฉํ‘œ ๋ฐ์ดํ„ฐ

  • ๋ธŒ๋žœ๋“œ
  • ๊ฐ€๊ฒฉ
  • ์…€ํ”„ ์ฃผ์œ  ์—ฌ๋ถ€
  • ์œ„์น˜

๐Ÿšฉ selenium ์œผ๋กœ ์ ‘๊ทผํ•˜๊ธฐ

### selenium์œผ๋กœ ์ ‘๊ทผ
from selenium import webdriver
url = 'https://www.opinet.co.kr/searRgSelect.do'
driver = webdriver.Chrome()
driver.get(url)

๐Ÿšฉ ํŒ์—…์ฐฝ ํ™”๋ฉด ์ „ํ™˜ ํ›„ ๋‹ซ์•„์ฃผ๋Š” ๋ฐฉ๋ฒ• - ์žฌ์š”์ฒญ

# ํŒ์—…์ฐฝ ํ™”๋ฉด ์ „ํ™˜ ํ›„ ๋‹ซ์•„์ฃผ๊ธฐ
import time

url = 'https://www.opinet.co.kr/searRgSelect.do'
driver = webdriver.Chrome()
driver.get(url)
time.sleep(5)

# ํŒ์—…์ฐฝ์œผ๋กœ ์ „ํ™˜
driver.switch_to_window(driver.window_handles[-1])

# ํŒ์—‰์ฐฝ ๋‹ซ์•„์ฃผ๊ธฐ
driver.close()

# ๋ฉ”์ธ ํ™”๋ฉด ์ฐฝ ์ „ํ™˜
driver.switch_to_window(driver.window_handles[-1])

# ์ ‘๊ทผ url ์žฌ์š”์ฒญ
driver.get(url)

๐Ÿšฉ ์‹œ/๋„ ์ด๋ฆ„ ๊ฐ€์ ธ์˜ค๊ธฐ

  • ์‚ฌ์ดํŠธ ํ™•์ธํ•ด๋ณด๋ฉด ์‹œ or ๋„๋ฅผ ์„ ํƒํ•ด์•ผํ•˜๋Š” ๋ฒ„ํŠผ์ด ์žˆ๋‹ค
  • ๋ฒ„ํŠผ์˜ ์†Œ์Šค๋ฅผ ๊ฐœ๋ฐœ์ž ๋„๊ตฌ๋กœ ํ™•์ธํ•ด๋ณด์ž

# ์ง€์—ญ : ์‹œ/๋„

sido_list_raw = driver.find_element(By.ID, 'SIDO_NM0')
sido_list_raw
>>>>
<selenium.webdriver.remote.webelement.WebElement (session="879ddded5355d220345ff8dcdf185802", element="868E43D9EDC6FA4D08A403C8D0EDD5E2_element_124")>
  • ๊ฒฐ๊ณผ ๊ฐ’์ด ์ด์ƒํ•ด์„œ text๋กœ ํ™•์ธํ•ด๋ณด๋‹ˆ ๋ฐ์ดํ„ฐ๊ฐ€ ๋‚˜์˜ค๋Š” ๊ฒƒ์€ ํ™•์ธ์ด ๊ฐ€๋Šฅ

sido_list = sido_list_raw.find_elements(By.TAG_NAME, 'option')
# SIDO_NM0 ์•ˆ์— ์žˆ๋Š” option Tag๋ฅผ ํ™•์ธ
sido_list[1].text
# 1๋ฒˆ์งธ index ํ™•์ธ

>>>>
'์„œ์šธ'

sido_list[1].get_attribute("value")
# option ํƒœ๊ทธ์˜ ์„œ์šธ๋กœ ํ‘œ๊ธฐ๋˜์–ด ์žˆ๋Š” ๊ฒƒ์„ ํ™•์ธํ•œ ๋’ค get_attribute ๋กœ class "value" ๊ฐ’์„ ํ™•์ธ
>>>>
'์„œ์šธํŠน๋ณ„์‹œ'

๐Ÿšฉ ์‹œ/๋„ ์ด๋ฆ„ ๋ฆฌ์ŠคํŠธ ๋งŒ๋“ค๊ธฐ

# 1
sido_names = [] # ๋นˆ ๋ฆฌ์ŠคํŠธ ์ƒ์„ฑํ•˜๊ธฐ

for option in sido_list: # ์‹œ/๋„ ๋ฆฌ์ŠคํŠธ
    sido_name = option.get_attribute('value') # ํ•œ๊ฐœ์”ฉ ๊ฐ’์„ ๋นผ์˜ค๋ฉด์„œ sido_name์— ํ• ๋‹นํ•œ๋‹ค
    sido_names.append(sido_name) # ํ• ๋‹นํ•œ ๊ฐ’์„ sido_names ๊ฐ’์— append
sido_names # ๊ฒฐ๊ณผ ๊ฐ’ ์ถœ๋ ฅ
>>>>
['',
 '์„œ์šธํŠน๋ณ„์‹œ',
 '๋ถ€์‚ฐ๊ด‘์—ญ์‹œ',
 '๋Œ€๊ตฌ๊ด‘์—ญ์‹œ',
 '์ธ์ฒœ๊ด‘์—ญ์‹œ',
 '๊ด‘์ฃผ๊ด‘์—ญ์‹œ',
 '๋Œ€์ „๊ด‘์—ญ์‹œ',
 '์šธ์‚ฐ๊ด‘์—ญ์‹œ',
 '์„ธ์ข…ํŠน๋ณ„์ž์น˜์‹œ',
 '๊ฒฝ๊ธฐ๋„',
 '์ถฉ์ฒญ๋ถ๋„',
 '์ถฉ์ฒญ๋‚จ๋„',
 '์ „๋ผ๋ถ๋„',
 '์ „๋ผ๋‚จ๋„',
 '๊ฒฝ์ƒ๋ถ๋„',
 '๊ฒฝ์ƒ๋‚จ๋„',
 '์ œ์ฃผํŠน๋ณ„์ž์น˜๋„',
 '๊ฐ•์›ํŠน๋ณ„์ž์น˜๋„']
# 2
sido_names = [option.get_attribute('value') for option in sido_list]
>>>>
['',
 '์„œ์šธํŠน๋ณ„์‹œ',
 '๋ถ€์‚ฐ๊ด‘์—ญ์‹œ',
 '๋Œ€๊ตฌ๊ด‘์—ญ์‹œ',
 '์ธ์ฒœ๊ด‘์—ญ์‹œ',
 '๊ด‘์ฃผ๊ด‘์—ญ์‹œ',
 '๋Œ€์ „๊ด‘์—ญ์‹œ',
 '์šธ์‚ฐ๊ด‘์—ญ์‹œ',
 '์„ธ์ข…ํŠน๋ณ„์ž์น˜์‹œ',
 '๊ฒฝ๊ธฐ๋„',
 '์ถฉ์ฒญ๋ถ๋„',
 '์ถฉ์ฒญ๋‚จ๋„',
 '์ „๋ผ๋ถ๋„',
 '์ „๋ผ๋‚จ๋„',
 '๊ฒฝ์ƒ๋ถ๋„',
 '๊ฒฝ์ƒ๋‚จ๋„',
 '์ œ์ฃผํŠน๋ณ„์ž์น˜๋„',
 '๊ฐ•์›ํŠน๋ณ„์ž์น˜๋„']
sido_names = sido_names[1:] # ๋งจ ์•ž ๋น„์–ด์žˆ๋Š” ๊ฐ’ ์ง€์šฐ๊ธฐ
sido_names
>>>>
['์„œ์šธํŠน๋ณ„์‹œ',
 '๋ถ€์‚ฐ๊ด‘์—ญ์‹œ',
 '๋Œ€๊ตฌ๊ด‘์—ญ์‹œ',
 '์ธ์ฒœ๊ด‘์—ญ์‹œ',
 '๊ด‘์ฃผ๊ด‘์—ญ์‹œ',
 '๋Œ€์ „๊ด‘์—ญ์‹œ',
 '์šธ์‚ฐ๊ด‘์—ญ์‹œ',
 '์„ธ์ข…ํŠน๋ณ„์ž์น˜์‹œ',
 '๊ฒฝ๊ธฐ๋„',
 '์ถฉ์ฒญ๋ถ๋„',
 '์ถฉ์ฒญ๋‚จ๋„',
 '์ „๋ผ๋ถ๋„',
 '์ „๋ผ๋‚จ๋„',
 '๊ฒฝ์ƒ๋ถ๋„',
 '๊ฒฝ์ƒ๋‚จ๋„',
 '์ œ์ฃผํŠน๋ณ„์ž์น˜๋„',
 '๊ฐ•์›ํŠน๋ณ„์ž์น˜๋„']

๐Ÿšฉ ์‹œ/๋„ ์ด๋ฆ„ ํ‚ค ๊ฐ’ ๋ณด๋‚ด๊ธฐ

sido_list_raw.send_keys(sido_names[0])

๐Ÿšฉ ๊ตฌ ์ด๋ฆ„ ๊ฐ€์ ธ์˜ค๊ธฐ

# ๊ตฌ

gu_list_law = driver.find_element(By.ID, 'SIGUNGU_NM0')
gu_list_law.text

ID ์ด๋ฆ„ ์ฒดํฌ

  • ์‹œ/๋„ ์ด๋ฆ„ ๊ตฌํ• ๋•Œ ์ฒ˜๋Ÿผ get_attributes ํ™œ์šฉ
gu_list = gu_list_law.find_elements(By.TAG_NAME, 'option')
gu_list[1].text, len(gu_list)
>>>>
('๊ฐ•๋‚จ๊ตฌ', 26)
gu_list[1].get_attribute('value')
>>>>
'๊ฐ•๋‚จ๊ตฌ'

๐Ÿšฉ ๊ตฌ ์ด๋ฆ„ ๋ฆฌ์ŠคํŠธํ™”

gu_names = [option.get_attribute('value') for option in gu_list]
gu_names
>>>>
['',
 '๊ฐ•๋‚จ๊ตฌ',
 '๊ฐ•๋™๊ตฌ',
 '๊ฐ•๋ถ๊ตฌ',
 '๊ฐ•์„œ๊ตฌ',
 '๊ด€์•…๊ตฌ',
 '๊ด‘์ง„๊ตฌ',
 '๊ตฌ๋กœ๊ตฌ',
 '๊ธˆ์ฒœ๊ตฌ',
 '๋…ธ์›๊ตฌ',
 '๋„๋ด‰๊ตฌ',
 '๋™๋Œ€๋ฌธ๊ตฌ',
 '๋™์ž‘๊ตฌ',
 '๋งˆํฌ๊ตฌ',
 '์„œ๋Œ€๋ฌธ๊ตฌ',
 '์„œ์ดˆ๊ตฌ',
 '์„ฑ๋™๊ตฌ',
 '์„ฑ๋ถ๊ตฌ',
 '์†กํŒŒ๊ตฌ',
 '์–‘์ฒœ๊ตฌ',
 '์˜๋“ฑํฌ๊ตฌ',
 '์šฉ์‚ฐ๊ตฌ',
 '์€ํ‰๊ตฌ',
 '์ข…๋กœ๊ตฌ',
 '์ค‘๊ตฌ',
 '์ค‘๋ž‘๊ตฌ']
gu_names = gu_names[1:]
gu_names
>>>>
['๊ฐ•๋‚จ๊ตฌ',
 '๊ฐ•๋™๊ตฌ',
 '๊ฐ•๋ถ๊ตฌ',
 '๊ฐ•์„œ๊ตฌ',
 '๊ด€์•…๊ตฌ',
 '๊ด‘์ง„๊ตฌ',
 '๊ตฌ๋กœ๊ตฌ',
 '๊ธˆ์ฒœ๊ตฌ',
 '๋…ธ์›๊ตฌ',
 '๋„๋ด‰๊ตฌ',
 '๋™๋Œ€๋ฌธ๊ตฌ',
 '๋™์ž‘๊ตฌ',
 '๋งˆํฌ๊ตฌ',
 '์„œ๋Œ€๋ฌธ๊ตฌ',
 '์„œ์ดˆ๊ตฌ',
 '์„ฑ๋™๊ตฌ',
 '์„ฑ๋ถ๊ตฌ',
 '์†กํŒŒ๊ตฌ',
 '์–‘์ฒœ๊ตฌ',
 '์˜๋“ฑํฌ๊ตฌ',
 '์šฉ์‚ฐ๊ตฌ',
 '์€ํ‰๊ตฌ',
 '์ข…๋กœ๊ตฌ',
 '์ค‘๊ตฌ',
 '์ค‘๋ž‘๊ตฌ']

๐Ÿšฉ ๊ตฌ ์ด๋ฆ„ ํ‚ค ๊ฐ’ ๋ณด๋‚ด๊ธฐ

gu_list_law.send_keys(gu_names[3])

 

๐Ÿšฉ ๋ฐ์ดํ„ฐ ์—‘์…€๋กœ ์ €์žฅํ•˜๊ธฐ

  • ์—‘์…€ ์ €์žฅ ํ™•์ธ ๋ฐ html ์†Œ์Šค ํ™•์ธํ•˜๊ธฐ

# ์—‘์…€ ์ €์žฅ

elements_save_excel = driver.find_element(By.ID, 'glopopd_excel').click()
  • for๋ฌธ ์‚ฌ์šฉํ•˜๊ธฐ
import time
from tqdm import tqdm_notebook

for gu in tqdm_notebook(gu_names): # gu_names list
    element = driver.find_element(By.ID, 'SIGUNGU_NM0') # ์ฒ˜์Œ selectํ•  ๊ณณ ์ง€์ •
    element.send_keys(gu) # gu name์„ ํ•˜๋‚˜์”ฉ ์ง€์ •
    time.sleep(3) # 3์ดˆ ๋Œ€๊ธฐ
    
    elements_save_excel = driver.find_element(By.ID, 'glopopd_excel').click() # ์—‘์…€์ €์žฅ ํด๋ฆญ
    time.sleep(3) # 3์ดˆ ๋Œ€๊ธฐ

# gu_names์˜ list๊ฐ€ ๋๋‚ ๋•Œ๊นŒ์ง€ ์ง„ํ–‰

๐Ÿšฉ ๋ฐ์ดํ„ฐ ์ €์žฅ ํ™•์ธ