Last Updated on March 27, 2023 by mishou

I. Beautiful Soup

Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching, and modifying the parse tree. It commonly saves programmers hours or days of work.

https://github.com/wention/BeautifulSoup4

II. Selenium

Beautiful Soup is a library for static scraping. Static scraping ignores JavaScript. If you need data that are present in components which get rendered on clicking JavaScript links, you should use Selenium besides Beautiful Soup.

III. Creating a virtual environment with Pipenv

Create a virtual environment with Pipenv

mkdir scraping && cd scraping
pipenv install --python 3.9
pipenv shell
pipenv install selenium
pipenv install pandas

You can learn about Pipenv from my previous post.

IV. Setting up Chrome Driver

Chrome Driver is a separate executable that Selenium WebDriver uses to control Chrome.

Check my Chromium version. I can find the version on About Brave in Brave Browser:

show chromium version on Brave Browser

You can download the ChromeDriver binary for your platform under the downloads. You can also use a library called chromedriver-py:

pipenv install chromedriver-py

You can learn more here.

I opened Python by running the command:

python

And run the following Python code showed on the page linked above.

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from chromedriver_py import binary_path # this will get you the path variable
service_object = Service(binary_path)
driver = webdriver.Chrome(service=service_object)

But I encountered an error. It said:

raise exception_class(message, screen, stacktrace) selenium.common.exceptions.WebDriverException: Message: unknown error: cannot find Chrome binary

This means that ChromeDriver was unable to find the Chrome binary in the default location. google-chrome is expected to be located at /usr/bin/google-chrome on Linux.

As I mentioned earlier, I use Brave Browser and the brave file is installed at /usr/bin/brave on Linux. I overrode the default Chrome binary location as follows:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from chromedriver_py import binary_path # for the path variable
service_object = Service(binary_path)
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.binary_location = "/usr/bin/brave"
driver = webdriver.Chrome(chrome_options = options, executable_path=binary_path)
driver.get('http://google.com/')
print("Chrome Browser Invoked")
driver.quit()

You can learn more here.

V. Using Selenium and Beautiful Soup

I was able to retrieve the titles and authors on the Amazon website without using any Xpath. First I accessed the website with Brave Browser (not with headless mode) by running the following code:

from selenium import webdriver
from selenium.webdriver.chrome.service import Service
from chromedriver_py import binary_path # for the path variable
service_object = Service(binary_path)
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
import pandas as pd
from selenium.webdriver.common.by import By
from bs4 import BeautifulSoup
# setting up sebdriver
options = Options()
options.binary_location = "/usr/bin/brave"
driver = webdriver.Chrome(chrome_options = options, executable_path=binary_path)
# access the page with the browser
driver.get('https://www.amazon.com/b?ie=UTF8&node=8192263011')

Then I zoomed out to show all the books (actually run the scripts on the page.)

100 books on amazon

Retrieve the page source and scrape titles and authors using Beautiful Soup.

code on jupyter lab

To be continued.

VI. References

Web Scraping with Selenium in Python — Amazon Search Result (Part 1)

Web Scraping using Beautiful Soup and Selenium for dynamic page

This page documents how to start using ChromeDriver for testing your website on desktop

Scraping Amazon results with Selenium and Python.

https://selenium-python.readthedocs.io/installation.html

Selenium with Python Tutorial: Getting started with Test Automation

How to run Selenium tests on Chrome using ChromeDriver

https://pypi.org/project/chromedriver-py/

Macos: Selenium gives “selenium.common.exceptions.WebDriverException: Message: unknown error: cannot find Chrome binary” on Mac

By mishou

Leave a Reply

Your email address will not be published. Required fields are marked *