Learning to Crawl - Building a Bare Bones Web Crawler with Elixir
Originally posted on East5th. Check out the original post.
I’ve been cooking up a side project recently that involves crawling through a domain, searching for links to specific websites.
While I’m keeping the details of the project shrouded in mystery for now, building out a web crawler using Elixir sounds like a fantastic learning experience.
Let’s roll up our sleeves and dig into it!
Let’s Think About What We Want
Before we start writing code, it’s important to think about what we want to build.
We’ll kick off the crawling process by handing our web crawler a starting URL. The crawler will fetch the page at that URL and look for links (<a>
tags with href
attributes) to other pages.
If the crawler finds a link to a page on the same domain, it’ll repeat the crawling process on that new page. This second crawl is considered one level “deeper” than the first. Our crawler should have a configurable maximum depth.
Once the crawler has traversed all pages on the given domain up to the specified depth, it will return a list of all internal and external URLs that the crawled pages link to.
To keep things efficient, let’s try to parallelize the crawling process as much as possible.
Hello, Crawler
Let’s start off our crawler project by creating a new Elixir project:
mix new hello_crawler
While we’re at it, let’s pull in two dependencies we’ll be using in this project: HTTPoison, a fantastic HTTP client for Elixir, and Floki, a simple HTML parser.
We’ll add them to our mix.exs
file:
defp deps do
[
{:httpoison, "~> 0.13"},
{:floki, "~> 0.18.0"}
]
end
And pull them into our project by running a Mix command from our shell:
mix deps.get
Now that we’ve set up the laid out the basic structure of our project, let’s start writing code!
Crawling a Page
Based on the description given above, we know that we’ll need some function (or set of functions) that takes in a starting URL, fetches the page, and looks for links.
Let’s start sketching out that function:
def get_links(url) do
[url]
end
So far our function just returns a list holding URL that was passed into it. That’s a start.
We’ll use HTTPoison
to fetch the page at that URL (being sure to specify that we want to follow 301
redirects), and pull the resulting HTML out of the body
field of the response:
with {:ok, %{body: body}} <- HTTPoison.get(url, [], follow_redirect: true) do
[url]
else
_ -> [url]
end
Notice that we’re using a with
block and pattern matching on a successful call to HTTPoison.get
. If a failure happens anywhere in our process, we abandon ship and return the current url
in a list.
Now we can use Floki
to search for any <a>
tags within our HTML and grab their href
attributes:
with {:ok, %{body: body}} <- HTTPoison.get(url, [], [follow_redirect: true]),
tags <- Floki.find(body, "a"),
hrefs <- Floki.attribute(tags, "href") do
[url | hrefs]
else
_ -> [url]
end
Lastly, let’s clean up our code by replacing our with
block with a new handle_response
function with multiple function heads to handle success and failure:
def handle_response({:ok, %{body: body}}, url) do
[url | body
|> Floki.find("a")
|> Floki.attribute("href")]
end
def handle_response(_response, url) do
[url]
end
def get_links(url) do
headers = []
options = [follow_redirect: true]
url
|> HTTPoison.get(headers, options)
|> handle_response(url)
end
This step is purely a stylistic choice. I find with
blocks helpful for sketching out solutions, but prefer the readability of branching through pattern matching.
Running our get_links
function against a familiar site should return a handful of internal and external links:
iex(1)> HelloCrawler.get_links("http://www.east5th.co/")
["http://www.east5th.co/", "/", "/blog/", "/our-work/", "/services/", "/blog/",
"/our-work/", "/services/",
"https://www.google.com/maps/place/Chattanooga,+TN/", "http://www.amazon.com/",
"https://www.cigital.com/", "http://www.tapestrysolutions.com/",
"http://www.surgeforward.com/", "/blog/", "/our-work/", "/services/"]
Quick, someone take a picture; it’s crawling!
Crawling Deeper
While it’s great that we can crawl a single page, we need to go deeper. We want to scape links from our starting page and recursively scrape any other pages we’re linked to.
The most naive way of accomplishing this would be to recurse on get_links
for every new list of links we find:
[url | body
|> Floki.find("a")
|> Floki.attribute("href")
|> Enum.map(&get_links/1) # Recursive call get_links
|> List.flatten]
While this works conceptually, it has a few problems:
- Relative URLs, like
/services
, don’t resolve correctly in our call toHTTPoison.get
. - A page that links to itself, or a set of pages that form a loop, will cause our crawler to enter an infinite loop.
- We’re not restricting crawling to our original host.
- We’re not counting depth, or enforcing any depth limit.
Let’s address each of these issues.
Handling Relative URLs
Many links within a page are relative; they don’t specify all of the necessary information to form a full URL.
For example, on http://www.east5th.co/
, there’s a link to /blog/
. Passing "/blog/"
into HTTPoison.get
will return an error, as HTTPoison
doesn’t know the context for this relative address.
We need some way of transforming these relative links into full-blown URLs given the context of the page we’re currently crawling. Thankfully, Elixir’s standard library ships with the fantastic URI
module that can help us do just that!
Let’s use URI.merge
and to_string
to transform all of the links scraped by our crawler into well-formed URLs that can be understood by HTTPoison.get
:
def handle_response({:ok, %{body: body}}, url) do
[url | body
|> Floki.find("a")
|> Floki.attribute("href")
|> Enum.map(&URI.merge(url, &1)) # Merge our URLs
|> Enum.map(&to_string/1) # Convert the merged URL to a string
|> Enum.map(&get_links/1)
|> List.flatten]
end
Now our link to /blog/
is transformed into http://www.east5th.co/blog/
before being passed into get_links
and subsequently HTTPoison.get
.
Perfect.
Preventing Loops
While our crawler is correctly resolving relative links, this leads directly to our next problem: our crawler can get trapped in loops.
The first link on http://www.east5th.co/
is to /
. This relative link is translated into http://www.east5th.co/
, which is once again passed into get_links
, causing an infinite loop.
To prevent looping in our crawler, we’ll need to maintain a list of all of the pages we’ve already crawled. Before we recursively crawl a new link, we need to verify that it hasn’t been crawled previously.
We’ll start by adding a new, private, function head for get_links
that accepts a path
argument. The path
argument holds all of the URLs we’ve visited on our way to the current URL.
defp get_links(url, path) do
headers = []
options = [follow_redirect: true]
url
|> HTTPoison.get(headers, options)
|> handle_response(url, path)
end
We’ll call our new private get_links
function from our public function head:
def get_links(url) do
get_links(url, []) # path starts out empty
end
Notice that our path
starts out as an empty list.
While we’re refactoring get_links
, let’s take a detour and add a third argument called context
:
defp get_links(url, path, context) do
url
|> HTTPoison.get(context.headers, context.options)
|> handle_response(url, path, context)
end
The context
argument is a map that lets us cleanly pass around configuration values without having to drastically increase the size of our function signatures. In this case, we’re passing in the headers
and options
used by HTTPoison.get
.
To populate context
, let’s change our public get_links
function to build up the map by merging user-provided options with defaults provided by the module:
def get_links(url, opts \\ []) do
context = %{
headers: Keyword.get(opts, :headers, @default_headers),
options: Keyword.get(opts, :options, @default_options)
}
get_links(url, [], context)
end
If the user doesn’t provide a value for headers
, or options
, we’ll use the defaults specified in the @default_headers
and @default_options
module attributes:
@default_headers []
@default_options [follow_redirect: true]
This quick refactor will help keep things clean going down the road.
Whenever we crawl a URL, we’ll add that URL to our path
, like a breadcrumb marking where we’ve been.
Next, we’ll filter out any links we find that we’ve already visited, and append the current url
to path
before recursing on get_links
:
path = [url | path] # Add a breadcrumb
[url | body
|> Floki.find("a")
|> Floki.attribute("href")
|> Enum.map(&URI.merge(url, &1))
|> Enum.map(&to_string/1)
|> Enum.reject(&Enum.member?(path, &1)) # Avoid loops
|> Enum.map(&get_links(&1, [&1 | path], context))
|> List.flatten]
It’s important to note that this doesn’t completely prevent the repeated fetching of the same page. It simply prevents the same page being refetched in a single recursive chain, or path.
Imagine if page A links to pages B and C. If page B or C link back to page A, page A will not be refetched. However, if both page B and page C link to page D, page D will be fetched twice.
We won’t worry about this problem in this article, but caching our calls to HTTPoison.get
might be a good solution.
Restricting Crawling to a Host
If we try and run our new and improved WebCrawler.get_links
function, we’ll notice that it takes a long time to run. In fact in most cases, it’ll never return!
The problem is that we’re not limiting our crawler to crawl only pages within our starting point’s domain. If we crawl http://www.east5th.co/
, we’ll eventually get linked to Google and Amazon, and from there, the crawling never ends.
We need to detect the host of the starting page we’re given, and restrict crawling to only that host.
Thankful, the URI
module once again comes to the rescue.
We can use URI.parse
to pull out the host
of our starting URL, and pass it into each call to get_links
and handle_response
via our context
map:
def get_links(url, opts \\ []) do
url = URI.parse(url)
context = %{
...
host: url.host # Add our initial host to our context
}
get_links(url, [], context)
end
Using the parsed host
, we can check if the url
passed into get_links
is a crawlable URL:
defp get_links(url, path, context) do
if crawlable_url?(url, context) do # check before we crawl
...
else
[url]
end
end
The crawlable_url?
function simply verifies that the host of the URL we’re attempting to crawl matches the host of our initial URL, passed in through our context
:
defp crawlable_url?(%{host: host}, %{host: initial}) when host == initial, do: true
defp crawlable_url?(_, _), do: false
This guard prevents us from crawling external URLs.
Because we passed in the result of URI.parse
into get_links
, our url
is now a URI
struct, rather than a string. We need to convert it back into a string before passing it into HTTPoison.get
and handle_response
:
if crawlable_url?(url, context) do
url
|> to_string # convert our %URI to a string
|> HTTPoison.get(context.headers, context.options)
|> handle_response(path, url, context)
else
[url]
end
Additionally, you’ve probably noticed that our collected list of links is no longer a list of URL strings. Instead, it’s a list of URI
structs.
After we finish crawling through our target site, let’s convert all of the resulting structs back into strings, and remove any duplicates:
def get_links(url, opts \\ []) do
...
get_links(url, [], context)
|> Enum.map(&to_string/1) # convert URI structs to strings
|> Enum.uniq # remove any duplicate urls
end
We’re crawling deep now!
Limiting Our Crawl Depth
Once again, if we let our crawler loose on the world, it’ll most likely take a long time to get back to us.
Because we’re not limiting how deep our crawler can traverse into our site, it’ll explore all possible paths that don’t involve retracing its steps. We need a way of telling it not to continue crawling once it’s already followed a certain number of links and reached a specified depth.
Due to the way we structured out solution, this is incredibly easy to implement!
All we need to do is add another guard in our get_links
function that makes sure that the length of path
hasn’t exceeded our desired depth:
if continue_crawl?(path, context) and crawlable_url?(url, context) do
The continue_crawl?
function verifies that the length of path
hasn’t grown past the max_depth
value in our context
map:
defp continue_crawl?(path, %{max_depth: max}) when length(path) > max, do: false
defp continue_crawl?(_, _), do: true
In our public get_links
function we can add max_depth
to our context
map, pulling the value from the user-provided opts
or falling back to the @default_max_depth
module attribute:
@default_max_depth 3
context = %{
...
max_depth: Keyword.get(opts, :max_depth, @default_max_depth)
}
As with our other opts
, the default max_depth
can be overridden by passing a custom max_depth
value into get_links
:
HelloCrawler.get_links("http://www.east5th.co/", max_depth: 5)
If our crawler tries to crawl deeper than the maximum allowed depth, continue_crawl?
returns false
and get_links
returns the url
being crawled, preventing further recursion.
Simple and effective.
Crawling in Parallel
While our crawler is fully functional at this point, it would be nice to improve upon it a bit. Instead of waiting for every path to be exhausted before crawling the next path, why can’t we crawl multiple paths in parallel?
Amazingly, parallelizing our web crawler is as simple as swapping our map over the links we find on a page with a parallelized map:
|> Enum.map(&(Task.async(fn -> get_links(URI.parse(&1), [&1 | path], context) end)))
|> Enum.map(&Task.await/1)
Instead of passing each scraped link into get_links
and waiting for it to fully crawl every sub-path before moving onto the next link, we pass all of our links into asynchronous calls to get_links
, and then wait for all of those asynchronous calls to return.
For every batch of crawling we do, we really only need to wait for the slowest page to be crawled, rather than waiting one by one for every page to be crawled.
Efficiency!
Final Thoughts
It’s been a process, but we’ve managed to get our simple web crawler up and running.
While we accomplished everything we set out to do, our final result is still very much a bare bones web crawler. A more mature crawler would come equipped with more sophisticated scraping logic, rate limiting, caching, and more efficient optimizations.
That being said, I’m proud of what we’ve accomplished. Be sure to check out the entire HelloCrawler
project on Github.
I’d like to thank Mischov for his incredibly helpful suggestions for improving this article and the underlying codebase. If you have any other suggestions, let me know!
Hi! I am a robot. I just upvoted you! I found similar content that readers might be interested in:
http://www.east5th.co/blog/2017/10/09/learning-to-crawl-building-a-bare-bones-web-crawler-with-elixir/
Congratulations @petecorey! You have completed some achievement on Steemit and have been rewarded with new badge(s) :
Award for the number of posts published
Click on any badge to view your own Board of Honor on SteemitBoard.
For more information about SteemitBoard, click here
If you no longer want to receive notifications, reply to this comment with the word
STOP