Have you ever had to extract lots of data from a website? There is a very simple solution called Scrapy that fits everyone’s requirements. Scrapy is a Python module that lets you easily write your own specialized web crawler.
1. Setup your environment
I usually use Vagrant boxes for Python applications. So, to easily setup a basic scrapy environment just copy and run this Vagrant config and it will provision the box automatically with all dependencies for Scrapy.
Vagrant.configure(2) do |config|
config.vm.box = "ubuntu/trusty64"
config.vm.synced_folder "./", "/vagrant_data"
config.vm.provider "virtualbox" do |vb|
vb.gui = false
vb.memory = "2048"
end
config.vm.provision "shell", inline: <<-SHELL
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 627220E7
echo 'deb http://archive.scrapy.org/ubuntu scrapy main' | sudo tee /etc/apt/sources.list.d/scrapy.list
sudo apt-get update && sudo apt-get install scrapy -y
sudo pip install pyasn1 --upgrade
SHELL
end
Alternatively, just copy the code in the provision section and paste it into your terminal.
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 627220E7
echo 'deb http://archive.scrapy.org/ubuntu scrapy main' | sudo tee /etc/apt/sources.list.d/scrapy.list
sudo apt-get update && sudo apt-get install scrapy -y
sudo pip install pyasn1 --upgrade
If everything is installed correctly, you should be able to open the Scrapy shell by typing:
scrapy shell "https://google.com"
2. Writing a minimal web crawler.
We are ready to build our first crawl spider! Fortunately, this is a pretty easy task thanks to Scrapy. In the next steps, we will build a spider that crawls all blog posts of a basic wordpress page. The following code demonstrates how the basic crawl spider looks like:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class ExampleSpider(CrawlSpider):
name = "my-crawler" #Spider name
allowed_domains = ["example.com"] # Which (sub-)domains shall be scraped?
start_urls = ["https://example.com/"] # Start with this one
rules = [Rule(LinkExtractor(), callback='parse_item', follow=True)] # Follow any link scrapy finds (that is allowed).
def parse_item(self, response):
print('Got a response from %s.' % response.url)
Save this code into a file named spider.py and run it with:
scrapy runspider spider.py
3. How to adapt the crawler to extract only specific parts of a page.
Every page has its own structure and, therefore, to extract only specific parts of it, you will need to adapt your Spider to the given circumstances. In this example, we want to crawl any post on a wordpress page, that is, we need the title of a post and the text content. The links of the posts are standardized, namely, follow a fixed format scheme:
https://example.com/year/month/day/post-entry-title
Thus, this can be matched by the following regular expression:
/[0-9]+/[0-9]+/[0-9]+/.+
But, the, for example, sharing links are basically built the same way and do not contain any content. So, we want to exclude links containing the string “share”:
.+share.+
Finally, we can add these characteristics to our Scrapy rule in order to match only blog post links:
rules = [Rule(LinkExtractor(allow=(r'/[0-9]+/[0-9]+/[0-9]+/.+'), deny=(r'.+share.+')), callback='parse_item', follow=True)]
Relating to the extraction of title and post text, we need to inspect the HTML of the respective website. For example, the HTML of a blog post could be structured as follows:
<article>
<div class="post-header">
<h1 class="post-title">Some Title</h1>
...
</div>
<div class="post-content">
<p>Some Text</p>
...
</div>
...
</article>
For this purpose, import the Scrapy selector and construct the XPath logic (further examples can be found here) that replicates this HTML structure:
selector = Selector(response)
title_selector_str = '//article/div[@class="post-header"]/h1[@class="post-title"]/text()'
posttext_selector_str = '//article/div[@class="post-content"]//text()'
title = selector.xpath(title_selector_str).extract()[0]
post = selector.xpath(posttext_selector_str).extract()
print('Title: %s \n' % title)
print('Content: %s \n' % post)
4. Putting all together.
If you combine all code snippets, you’ll get a fully working web crawler:
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.selector import Selector
class ExampleSpider(CrawlSpider):
name = "my-crawler" #Spider name
allowed_domains = ["example.com"] # Which (sub-)domains shall be scraped?
start_urls = ["https://example.com/"] # Start with this one
# Follow any link scrapy finds (that is allowed and matches the patterns).
rules = [Rule(LinkExtractor(allow=(r'/[0-9]+/[0-9]+/[0-9]+/.+'), deny=(r'.+share.+')), callback='parse_item', follow=True)]
def parse_item(self, response):
print('Got a response from %s.' % response.url)
selector = Selector(response)
title_selector_str = '//article/div[@class="post-header"]/h1[@class="post-title"]/text()'
posttext_selector_str = '//article/div[@class="post-content"]//text()'
title = selector.xpath(title_selector_str).extract()[0]
post = selector.xpath(posttext_selector_str).extract()
print('Title: %s \n' % title)
print('Content: %s \n' % post)
Been thinking about getting into Python - I've done most of my scraping in R. Starting to do some in C# too lately since linqpad is pretty cool imo
You wouldn't regret Python ;-) I like the "pythonic way": Clean, short code you can read and understand (even without any comments).
And it has a lot of packages that make it easy to work with any kind of data such as numpy, pandas or scipy!