Learning to Crawl - Building a Bare Bones Web Crawler with Elixir

in #programming7 years ago

Learning to Crawl - Building a Bare Bones Web Crawler with Elixir

 

Originally posted on East5th. Check out the original post.

I’ve been cooking up a side project recently that involves crawling through a domain, searching for links to specific websites.

While I’m keeping the details of the project shrouded in mystery for now, building out a web crawler using Elixir sounds like a fantastic learning experience.

Let’s roll up our sleeves and dig into it!

Let’s Think About What We Want

 
Before we start writing code, it’s important to think about what we want to build.

We’ll kick off the crawling process by handing our web crawler a starting URL. The crawler will fetch the page at that URL and look for links (<a> tags with href attributes) to other pages.

If the crawler finds a link to a page on the same domain, it’ll repeat the crawling process on that new page. This second crawl is considered one level “deeper” than the first. Our crawler should have a configurable maximum depth.

Once the crawler has traversed all pages on the given domain up to the specified depth, it will return a list of all internal and external URLs that the crawled pages link to.

To keep things efficient, let’s try to parallelize the crawling process as much as possible.

Hello, Crawler

 
Let’s start off our crawler project by creating a new Elixir project:

mix new hello_crawler

 
While we’re at it, let’s pull in two dependencies we’ll be using in this project: HTTPoison, a fantastic HTTP client for Elixir, and Floki, a simple HTML parser.

We’ll add them to our mix.exs file:

defp deps do
  [
    {:httpoison, "~> 0.13"},
    {:floki, "~> 0.18.0"}
  ]
end

 
And pull them into our project by running a Mix command from our shell:

mix deps.get

 
Now that we’ve set up the laid out the basic structure of our project, let’s start writing code!

Crawling a Page

 
Based on the description given above, we know that we’ll need some function (or set of functions) that takes in a starting URL, fetches the page, and looks for links.

Let’s start sketching out that function:

def get_links(url) do
  [url]
end

 
So far our function just returns a list holding URL that was passed into it. That’s a start.

We’ll use HTTPoison to fetch the page at that URL (being sure to specify that we want to follow 301 redirects), and pull the resulting HTML out of the body field of the response:

with {:ok, %{body: body}} <- HTTPoison.get(url, [], follow_redirect: true) do
  [url]
else
  _ -> [url]
end

 
Notice that we’re using a with block and pattern matching on a successful call to HTTPoison.get. If a failure happens anywhere in our process, we abandon ship and return the current url in a list.

Now we can use Floki to search for any <a> tags within our HTML and grab their href attributes:

with {:ok, %{body: body}} <- HTTPoison.get(url, [], [follow_redirect: true]),
     tags                 <- Floki.find(body, "a"),
     hrefs                <- Floki.attribute(tags, "href") do
  [url | hrefs]
else
  _ -> [url]
end

 
Lastly, let’s clean up our code by replacing our with block with a new handle_response function with multiple function heads to handle success and failure:

def handle_response({:ok, %{body: body}}, url) do
  [url | body
         |> Floki.find("a")
         |> Floki.attribute("href")]
end

def handle_response(_response, url) do
  [url]
end

def get_links(url) do
  headers = []
  options = [follow_redirect: true]
  url
  |> HTTPoison.get(headers, options)
  |> handle_response(url)
end

 
This step is purely a stylistic choice. I find with blocks helpful for sketching out solutions, but prefer the readability of branching through pattern matching.

hello-crawler.png

Running our get_links function against a familiar site should return a handful of internal and external links:

iex(1)> HelloCrawler.get_links("http://www.east5th.co/")
["http://www.east5th.co/", "/", "/blog/", "/our-work/", "/services/", "/blog/",
 "/our-work/", "/services/",
 "https://www.google.com/maps/place/Chattanooga,+TN/", "http://www.amazon.com/",
 "https://www.cigital.com/", "http://www.tapestrysolutions.com/",
 "http://www.surgeforward.com/", "/blog/", "/our-work/", "/services/"]

 
Quick, someone take a picture; it’s crawling!

Crawling Deeper

 
While it’s great that we can crawl a single page, we need to go deeper. We want to scape links from our starting page and recursively scrape any other pages we’re linked to.

The most naive way of accomplishing this would be to recurse on get_links for every new list of links we find:

[url | body
       |> Floki.find("a")
       |> Floki.attribute("href")
       |> Enum.map(&get_links/1) # Recursive call get_links
       |> List.flatten]

 
While this works conceptually, it has a few problems:

  • Relative URLs, like /services, don’t resolve correctly in our call to HTTPoison.get.
  • A page that links to itself, or a set of pages that form a loop, will cause our crawler to enter an infinite loop.
  • We’re not restricting crawling to our original host.
  • We’re not counting depth, or enforcing any depth limit.

Let’s address each of these issues.

Handling Relative URLs

 
Many links within a page are relative; they don’t specify all of the necessary information to form a full URL.

For example, on http://www.east5th.co/, there’s a link to /blog/. Passing "/blog/" into HTTPoison.get will return an error, as HTTPoison doesn’t know the context for this relative address.

We need some way of transforming these relative links into full-blown URLs given the context of the page we’re currently crawling. Thankfully, Elixir’s standard library ships with the fantastic URI module that can help us do just that!

Let’s use URI.merge and to_string to transform all of the links scraped by our crawler into well-formed URLs that can be understood by HTTPoison.get:

def handle_response({:ok, %{body: body}}, url) do
  [url | body
         |> Floki.find("a")
         |> Floki.attribute("href")
         |> Enum.map(&URI.merge(url, &1)) # Merge our URLs
         |> Enum.map(&to_string/1)        # Convert the merged URL to a string
         |> Enum.map(&get_links/1)
         |> List.flatten]
end

 
Now our link to /blog/ is transformed into http://www.east5th.co/blog/ before being passed into get_links and subsequently HTTPoison.get.

Perfect.

Preventing Loops

 
While our crawler is correctly resolving relative links, this leads directly to our next problem: our crawler can get trapped in loops.

The first link on http://www.east5th.co/ is to /. This relative link is translated into http://www.east5th.co/, which is once again passed into get_links, causing an infinite loop.

To prevent looping in our crawler, we’ll need to maintain a list of all of the pages we’ve already crawled. Before we recursively crawl a new link, we need to verify that it hasn’t been crawled previously.

We’ll start by adding a new, private, function head for get_links that accepts a path argument. The path argument holds all of the URLs we’ve visited on our way to the current URL.

defp get_links(url, path) do
  headers = []
  options = [follow_redirect: true]
  url
  |> HTTPoison.get(headers, options)
  |> handle_response(url, path)
end

 
We’ll call our new private get_links function from our public function head:

def get_links(url) do
  get_links(url, []) # path starts out empty
end

 
Notice that our path starts out as an empty list.


While we’re refactoring get_links, let’s take a detour and add a third argument called context:

defp get_links(url, path, context) do
  url
  |> HTTPoison.get(context.headers, context.options)
  |> handle_response(url, path, context)
end

 
The context argument is a map that lets us cleanly pass around configuration values without having to drastically increase the size of our function signatures. In this case, we’re passing in the headers and options used by HTTPoison.get.

To populate context, let’s change our public get_links function to build up the map by merging user-provided options with defaults provided by the module:

def get_links(url, opts \\ []) do
  context = %{
    headers: Keyword.get(opts, :headers, @default_headers),
    options: Keyword.get(opts, :options, @default_options)
  }
  get_links(url, [], context)
end

 
If the user doesn’t provide a value for headers, or options, we’ll use the defaults specified in the @default_headers and @default_options module attributes:

@default_headers []
@default_options [follow_redirect: true]

 
This quick refactor will help keep things clean going down the road.


Whenever we crawl a URL, we’ll add that URL to our path, like a breadcrumb marking where we’ve been.

crawler-path.png

Next, we’ll filter out any links we find that we’ve already visited, and append the current url to path before recursing on get_links:

path = [url | path] # Add a breadcrumb
[url | body
       |> Floki.find("a")
       |> Floki.attribute("href")
       |> Enum.map(&URI.merge(url, &1))
       |> Enum.map(&to_string/1)
       |> Enum.reject(&Enum.member?(path, &1)) # Avoid loops
       |> Enum.map(&get_links(&1, [&1 | path], context))
       |> List.flatten]

 
It’s important to note that this doesn’t completely prevent the repeated fetching of the same page. It simply prevents the same page being refetched in a single recursive chain, or path.

Imagine if page A links to pages B and C. If page B or C link back to page A, page A will not be refetched. However, if both page B and page C link to page D, page D will be fetched twice.

We won’t worry about this problem in this article, but caching our calls to HTTPoison.get might be a good solution.

Restricting Crawling to a Host

 
If we try and run our new and improved WebCrawler.get_links function, we’ll notice that it takes a long time to run. In fact in most cases, it’ll never return!

The problem is that we’re not limiting our crawler to crawl only pages within our starting point’s domain. If we crawl http://www.east5th.co/, we’ll eventually get linked to Google and Amazon, and from there, the crawling never ends.

We need to detect the host of the starting page we’re given, and restrict crawling to only that host.

Thankful, the URI module once again comes to the rescue.

We can use URI.parse to pull out the host of our starting URL, and pass it into each call to get_links and handle_response via our context map:

def get_links(url, opts \\ []) do
  url = URI.parse(url)
  context = %{
    ...
    host: url.host # Add our initial host to our context
  }
  get_links(url, [], context)
end

 
Using the parsed host, we can check if the url passed into get_links is a crawlable URL:

defp get_links(url, path, context) do
  if crawlable_url?(url, context) do # check before we crawl
    ...
  else
    [url]
  end
end

 
The crawlable_url? function simply verifies that the host of the URL we’re attempting to crawl matches the host of our initial URL, passed in through our context:

defp crawlable_url?(%{host: host}, %{host: initial}) when host == initial, do: true
defp crawlable_url?(_, _), do: false

 
This guard prevents us from crawling external URLs.


Because we passed in the result of URI.parse into get_links, our url is now a URI struct, rather than a string. We need to convert it back into a string before passing it into HTTPoison.get and handle_response:

if crawlable_url?(url, context) do
  url
  |> to_string # convert our %URI to a string
  |> HTTPoison.get(context.headers, context.options)
  |> handle_response(path, url, context)
else
  [url]
end

 
Additionally, you’ve probably noticed that our collected list of links is no longer a list of URL strings. Instead, it’s a list of URI structs.

After we finish crawling through our target site, let’s convert all of the resulting structs back into strings, and remove any duplicates:

def get_links(url, opts \\ []) do
  ...
  get_links(url, [], context)
  |> Enum.map(&to_string/1) # convert URI structs to strings
  |> Enum.uniq # remove any duplicate urls
end

 
We’re crawling deep now!

Limiting Our Crawl Depth

 
Once again, if we let our crawler loose on the world, it’ll most likely take a long time to get back to us.

Because we’re not limiting how deep our crawler can traverse into our site, it’ll explore all possible paths that don’t involve retracing its steps. We need a way of telling it not to continue crawling once it’s already followed a certain number of links and reached a specified depth.

Due to the way we structured out solution, this is incredibly easy to implement!

All we need to do is add another guard in our get_links function that makes sure that the length of path hasn’t exceeded our desired depth:

if continue_crawl?(path, context) and crawlable_url?(url, context) do

 
The continue_crawl? function verifies that the length of path hasn’t grown past the max_depth value in our context map:

defp continue_crawl?(path, %{max_depth: max}) when length(path) > max, do: false
defp continue_crawl?(_, _), do: true

 
In our public get_links function we can add max_depth to our context map, pulling the value from the user-provided opts or falling back to the @default_max_depth module attribute:

@default_max_depth 3

 

context = %{
  ...
  max_depth: Keyword.get(opts, :max_depth, @default_max_depth)
}

 
As with our other opts, the default max_depth can be overridden by passing a custom max_depth value into get_links:

HelloCrawler.get_links("http://www.east5th.co/", max_depth: 5)

 
If our crawler tries to crawl deeper than the maximum allowed depth, continue_crawl? returns false and get_links returns the url being crawled, preventing further recursion.

Simple and effective.

Crawling in Parallel

 
While our crawler is fully functional at this point, it would be nice to improve upon it a bit. Instead of waiting for every path to be exhausted before crawling the next path, why can’t we crawl multiple paths in parallel?

Amazingly, parallelizing our web crawler is as simple as swapping our map over the links we find on a page with a parallelized map:

|> Enum.map(&(Task.async(fn -> get_links(URI.parse(&1), [&1 | path], context) end)))
|> Enum.map(&Task.await/1)

 
Instead of passing each scraped link into get_links and waiting for it to fully crawl every sub-path before moving onto the next link, we pass all of our links into asynchronous calls to get_links, and then wait for all of those asynchronous calls to return.

For every batch of crawling we do, we really only need to wait for the slowest page to be crawled, rather than waiting one by one for every page to be crawled.

Efficiency!

Final Thoughts

 
It’s been a process, but we’ve managed to get our simple web crawler up and running.

While we accomplished everything we set out to do, our final result is still very much a bare bones web crawler. A more mature crawler would come equipped with more sophisticated scraping logic, rate limiting, caching, and more efficient optimizations.

That being said, I’m proud of what we’ve accomplished. Be sure to check out the entire HelloCrawler project on Github.

I’d like to thank Mischov for his incredibly helpful suggestions for improving this article and the underlying codebase. If you have any other suggestions, let me know!

Sort:  

Hi! I am a robot. I just upvoted you! I found similar content that readers might be interested in:
http://www.east5th.co/blog/2017/10/09/learning-to-crawl-building-a-bare-bones-web-crawler-with-elixir/

Congratulations @petecorey! You have completed some achievement on Steemit and have been rewarded with new badge(s) :

Award for the number of posts published

Click on any badge to view your own Board of Honor on SteemitBoard.
For more information about SteemitBoard, click here

If you no longer want to receive notifications, reply to this comment with the word STOP

By upvoting this notification, you can help all Steemit users. Learn how here!