Web crawler: A Scrapy Crawl Spider Tutorial

in #tutorial7 years ago (edited)

Have you ever had to extract lots of data from a website? There is a very simple solution called Scrapy that fits everyone’s requirements. Scrapy is a Python module that lets you easily write your own specialized web crawler.

1. Setup your environment

I usually use Vagrant boxes for Python applications. So, to easily setup a basic scrapy environment just copy and run this Vagrant config and it will provision the box automatically with all dependencies for Scrapy.

Vagrant.configure(2) do |config|
 
  config.vm.box = "ubuntu/trusty64"
  config.vm.synced_folder "./", "/vagrant_data"
 
  config.vm.provider "virtualbox" do |vb|
     vb.gui = false
     vb.memory = "2048"
  end
  
  config.vm.provision "shell", inline: <<-SHELL
     sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 627220E7
     echo 'deb http://archive.scrapy.org/ubuntu scrapy main' | sudo tee /etc/apt/sources.list.d/scrapy.list
     sudo apt-get update && sudo apt-get install scrapy -y
     sudo pip install pyasn1 --upgrade
  SHELL
 
end

Alternatively, just copy the code in the provision section and paste it into your terminal.

sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 627220E7
echo 'deb http://archive.scrapy.org/ubuntu scrapy main' | sudo tee /etc/apt/sources.list.d/scrapy.list
sudo apt-get update && sudo apt-get install scrapy -y
sudo pip install pyasn1 --upgrade

If everything is installed correctly, you should be able to open the Scrapy shell by typing:

scrapy shell "https://google.com"

2. Writing a minimal web crawler.

We are ready to build our first crawl spider! Fortunately, this is a pretty easy task thanks to Scrapy. In the next steps, we will build a spider that crawls all blog posts of a basic wordpress page. The following code demonstrates how the basic crawl spider looks like:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
 
class ExampleSpider(CrawlSpider):
    name = "my-crawler" #Spider name
    allowed_domains = ["example.com"] # Which (sub-)domains shall be scraped?
    start_urls = ["https://example.com/"] # Start with this one
    rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)] # Follow any link scrapy finds (that is allowed).
 
    def parse_item(self, response):
        print('Got a response from %s.' % response.url)

Save this code into a file named spider.py and run it with:

scrapy runspider spider.py

3. How to adapt the crawler to extract only specific parts of a page.

Every page has its own structure and, therefore, to extract only specific parts of it, you will need to adapt your Spider to the given circumstances. In this example, we want to crawl any post on a wordpress page, that is, we need the title of a post and the text content. The links of the posts are standardized, namely, follow a fixed format scheme:

https://example.com/year/month/day/post-entry-title

Thus, this can be matched by the following regular expression:

/[0-9]+/[0-9]+/[0-9]+/.+

But, the, for example, sharing links are basically built the same way and do not contain any content. So, we want to exclude links containing the string “share”:

.+share.+

Finally, we can add these characteristics to our Scrapy rule in order to match only blog post links:

rules = [Rule(LinkExtractor(allow=(r'/[0-9]+/[0-9]+/[0-9]+/.+'), deny=(r'.+share.+')), callback='parse_item', follow=True)]

Relating to the extraction of title and post text, we need to inspect the HTML of the respective website. For example, the HTML of a blog post could be structured as follows:

<article>
  <div class="post-header">
    <h1 class="post-title">Some Title</h1>
    ...
  </div>
  <div class="post-content">
    <p>Some Text</p>
    ...
  </div>
  ...
</article>

For this purpose, import the Scrapy selector and construct the XPath logic (further examples can be found here) that replicates this HTML structure:

selector = Selector(response)

title_selector_str = '//article/div[@class="post-header"]/h1[@class="post-title"]/text()'
posttext_selector_str = '//article/div[@class="post-content"]//text()'

title = selector.xpath(title_selector_str).extract()[0]
post = selector.xpath(posttext_selector_str).extract()

print('Title: %s \n' % title)
print('Content: %s \n' % post)

4. Putting all together.

If you combine all code snippets, you’ll get a fully working web crawler:

from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector

class ExampleSpider(CrawlSpider):
    name = "my-crawler" #Spider name
    allowed_domains = ["example.com"] # Which (sub-)domains shall be scraped?
    start_urls = ["https://example.com/"] # Start with this one

    # Follow any link scrapy finds (that is allowed and matches the patterns).
    rules = [Rule(LinkExtractor(allow=(r'/[0-9]+/[0-9]+/[0-9]+/.+'), deny=(r'.+share.+')), callback='parse_item', follow=True)] 

    def parse_item(self, response):

        print('Got a response from %s.' % response.url)

        selector = Selector(response)

        title_selector_str = '//article/div[@class="post-header"]/h1[@class="post-title"]/text()'
        posttext_selector_str = '//article/div[@class="post-content"]//text()'

        title = selector.xpath(title_selector_str).extract()[0]
        post = selector.xpath(posttext_selector_str).extract()

        print('Title: %s \n' % title)
        print('Content: %s \n' % post)
Sort:  

Been thinking about getting into Python - I've done most of my scraping in R. Starting to do some in C# too lately since linqpad is pretty cool imo

You wouldn't regret Python ;-) I like the "pythonic way": Clean, short code you can read and understand (even without any comments).
And it has a lot of packages that make it easy to work with any kind of data such as numpy, pandas or scipy!