Selenium으로 로딩한 페이지의 html 파싱 - BeautifulSoup

저번 포스팅에 이어서 원하는 탭으로 이동하였고 원하는 정보를 수집할 수 있게 되었다. 이번에는 셀레니움을 이용하여 로딩한 페이지의 html을 추출하고 beautifulsoup 사용 방법을 간단하게 정리하였다.

1. Beautifulsoup 설치

먼저 html을 크롤링하고 파싱하기 위해서는 BeautifulSoup 패키지가 필요하다. 아나콘다를 사용하면 이미 설치되어 있지만 아니라면 pip 를 이용하여 패키지를 설치해야한다.

$ pip install bs4

2. selenium으로 로딩한 페이지의 HTML 파싱

먼저 selenium을 통해 해당 페이지의 html 소스를 가져온다. 이 html 소스를 beautifulsoup에 넣어주면 태그명, css 클래스명, id, css selector 등으로 html 파싱이 가능하다.

# 페이지 소스 가져오기
html = driver.page_source

# soup에 넣어주기
soup = BeautifulSoup(html, 'html.parser')
print(soup.text)

3. BeautifulSoup 시작하기

예시 HTML 구조로 BeautifulSoup 문서에 있는 html 코드를 이용하였다.

from bs4 import BeautifulSoup

html = """<html><head><title>The Dormouse's story</title></head>
<body>
<p class="title"><b>The Dormouse's story</b></p>
<p class="story">Once upon a time there were three little sisters; and their names were

<a href="http://example.com/elsie" class="sister" id="link1">Elsie</a>,
<a href="http://example.com/lacie" class="sister" id="link2">Lacie</a> and
<a href="http://example.com/tillie" class="sister" id="link3">Tillie</a>and they lived at the bottom of a well.</p>

<p class="story">...</p>"""

soup = BeautifulSoup(html, 'html.parser')

html 파일을 예쁘게 보고싶다면 아래의 함수를 이용하여 출력 가능하다.

print(soup.prettify())

soup을 이용한 html 구조는 트리 구조를 띈다. 따라서 soup을 이용하여 다음과 같은 접근이 가능하다.

# soup 내의 text 모두 추출
print(soup.get_text())
# "The Dormouse's story\n\nThe Dormouse's story\nOnce upon a time there were three little sisters; and their names were\nElsie,\nLacie and\nTillie;\nand they lived at the bottom of a well.\n..."

soup.title
# <title>The Dormouse's story</title>

soup.title.name
# 'title'

soup.title.string
# "The Dormouse's story"

4. BeautifulSoup 태그 파싱

위의 html 구조 예시에서 a 태그를 모두 수집하고 싶다면 위처럼 단순 트리구조가 아닌 find_all()이라는 함수를 이용하면 된다. find_all() 함수는 해당 조건에 맞는 모든 값들(여러개)을 리스트로 반환해준다. 맨 처음 1개의 값만 받고 싶다면 find() 함수를 사용하면 된다. find_all() 사용방법은 아래와 같다.

# 단순히 tree 구조로 검색하면 맨 위에 있는 a 태그만 추출할 수 있다.
soup.a
# <a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>


# 태그명으로 찾기(a 태그 모두 수집)
soup.find_all('a')
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

css 클래스 명과 태그 id로도 파싱이 가능하다. python의 class와 구분하기 위해서 파라미터는 class_ 를 사용한다.

# css class 명으로 찾기(class 이름이 sister인 태그 모두 수집)
soup.find_all(class_='sister')
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

# a 태그의 css class 이름이 'sister'인 태그 모두 수집
soup.find_all('a', class_='sister')
#[<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

# id 가 link1인 태그 수집
soup.find_all(id='link1')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

css selector를 이용한 파싱도 가능하다. find_all() 함수 대신 select()라는 함수를 사용한다. css selector 값은 크롬 개발자 도구를 이용하면 쉽게 값을 알아낼 수 있다.(태그 select -> 마우스 오른쪽 버튼 -> copy -> copy selector)

soup.select('title')
# [<title>The Dormouse's story</title>]

# 3번째 p 태그
soup.select('p:nth-of-type(3)')
# [<p class="story">...</p>]

# body 태그 안의 a 태그 모두
soup.select('body p')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

soup.select('body > a')
# [] -> '>'를 사용할 때는 태그 순서를 고려해야한다.
# 반드시 '>' 양 옆에는 띄어쓰기를 해야한다. (body>a 는 올바른 결과를 찾을 수 없음)

# p 태그 안의 a 태그 모두
soup.select('p > a')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

# p 태그 안에 id가 link1인 태그
soup.select('p > #link1')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

# id가 link1과 link2인 태그 모두
soup.select('#link1, #link2')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

# a태그이고 id가 link1인 태그
soup.select('a#link1')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>]

# css 클래스가 sister인 태그 모두(두 코드 결과 같음)
soup.select('.sister')
soup.select('[class~=sister]')
# [<a class="sister" href="http://example.com/elsie" id="link1">Elsie</a>,
# <a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

# id가 link1이고 css 클래스가 sister인 태그의 자매(sibilings) 태그들 (같은 계층)
# '~' 양 옆에 띄어쓰기 필수
soup.select('#link1 ~ .sister')
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>,
# <a class="sister" href="http://example.com/tillie" id="link3">Tillie</a>]

# id가 link1이고 css 클래스가 sister인 태그의 바로 옆의 자매(sibling) 태그
# '+' 양 옆에 띄어쓰기 필수
soup.select('#link1 + .sister')
# [<a class="sister" href="http://example.com/lacie" id="link2">Lacie</a>]

이외에도 find_next_siblings(), find_next_sibling(), find_previous_siblings(), find_previous_sibling() 같은 탐색 함수와 다양한 기능의 파싱 함수들도 제공한다. 위의 내용은 beautifulsoup를 다루기 위한 기본적인 내용들이며 자세한 사항들은 beautifulsoup 공식 문서를 참고하면 예시와 함께 자세히 설명되어 있다.

reference

https://kiddwannabe.blog.me/221288079822
https://www.crummy.com/software/BeautifulSoup/bs4/doc/