Last Updated on April 30, 2023 by mishou

I. Scrapy and BeautifulSoup

Scrapy is a powerful library that can be used to extract data from web pages and XML files. I usually use Pandas’ read_html() to get the data of HTML tables. When it comes to texts or other types of data, I have used BeautifulSoup. But Scrapy is a full-fledged web scraper and it is sometimes said you should learn Scrapy if you want do serious web scraping.

Scrapy has a steep learning curve and a reputation for not being beginner-friendly but I believe you would master it with a rather gentle learning curve if you run Scrapy on Google Colaboratory.

You can learn the advantages and disadvantages of each library at Difference between BeautifulSoup and Scrapy crawler. It reads:



  • Easy for beginners to learn and master in web scrapping.
  • It has good community support to figure out the issue.
  • It has good comprehensive documentation.


  • It has an external python dependency.

Scrapy crawler


  • It is easily extensible.
  • It has built-in support for extracting data.
  • It has very fast speed compared to other libraries.
  • It is both memory and CPU efficient.
  • You can also build robust and extensive applications.
  • Has strong community support.


  • It has light documentation for beginners.

II. Create a spider

Spiders are classes that you define and that Scrapy uses to scrape information from a website. Let’s create our first spider. You can learn the scripts here:

Scrapy Tutorial

1. Creating the files and directories for our project

# install scrapy
!pip install Scrapy
# create files for learning
!scrapy startproject firstproject
google colab

2. Creating and save it

Change the current working directory to the spiders directory with os.chdir().

# change working directories

Create a and save it under the spiders directory using a IPython Magic Command, %%writefile.

google colab

III. Extracting data using Scrapy shell

google colab

Now, run the follwoing code:

!scrapy shell ''

Then put the following commands by turns:

quote = response.css("div.quote")[0]
text = quote.css("span.text::text").get()

You will get the texts:

'“The world as we have created it is a process of our thinking. It cannot be changed without changing our thinking.”'

You can quit Scrapy shell with Ctrl + c. You can see the scripts here:

To be continued.

By mishou

Leave a Reply

Your email address will not be published. Required fields are marked *