SCRAPING A WEATHER DATA USING BEAUTIFUL SOUP PYTHON LIBRARY WITH DJANGO

in Programming & Dev4 years ago

Before beginning our tutorial on how to scrape/extract data in Django, I will like to tell you all about what web scraping is, what are its uses and whether is legal or illegal.

Web Scraping is the automated process of extracting data from the web pages. Web scraping is used for various purposes be it for financial research and market study by scraping financial data, or scraping Amazon's product data for comparison with your store product data or simply scraping schools/colleges/universities data to create a list of educational institutions around particular area and to provide these information to visitors of your site. The most heated discussion about web scraping this decade is whether it is legal or not. While there doesn't seem to exist a perfect law that monitors the scraping of web information, it all depends on you who is extracting the content of the web.

It is legal to scrape a web information if you are not violating any terms and conditions of the websites from where you are extracting data. As long as you are not violating the source website's rules, it is legal to scrape data. It may become illegal if you are trying to scrape confidential data like customers data, or maybe copyright data, scraping content protected by login authentication or even scraping your competitors website for your own profit. Many websites now have dedicated software to block the scrapers. They do so because an automatic scrapers will make too many requests that could cause heavy load on the source website's.

Web Scraping is a popular topics today in the field of computer science and information technology. More and more giant companies are making great utilization of this tools and concepts for their growth. Make sure you are not using it unethically. Many sites have now provided API for data collection. Be sure to make use of it instead of scraping.

Python has one amazing library called Beautiful Soup that can help programmers to scrape data from the web page written in markup languages like HTML, XML, XHTML. In this tutorial, we are going to use this library along with Django for scraping weather data from Bing website.


In this tutorial, we will be creating a simple application that allows user to enter the desired location for which they want to see the weather about. And then it will show user the temperature of that particular place along with weather status like sunny, rainy, stormy along with date and time. The project needs an understanding of Python and Django frameworks. However, knowledge of Beautiful Soup package is not required but will be great if you have a little or great understanding of it. You should have python and Django installed in your machine. My python version is 3.8.5 and Django version is 3.1.6.

This project is not going to be a complex one or a series. We won't be working with models and database here as it is not required. You can use any IDE of your choice. I am using PyCharm to build the project. I am on Windows OS. So first lets open our cmd and head over to desktop where we will be creating our project directory. Type cd desktop to go to dekstop. Then type django-admin startproject webscraping. It will create a new directory named as webscraping in our desktop. Now go inside that directory by typing cd webscraping. Here, we will be creating our app and we will name it as weatherapp. So type this command: python manage.py startapp weatherapp.
See the overview of the commands of we have used so far:

image.png

Now lets open this project in our IDE to see how the directory and file structure looks like.

image.png

You can also see the following tree structures of our files and folders.

│   manage.py
│
├───.idea
│   │   .gitignore
│   │   misc.xml
│   │   modules.xml
│   │   webscraping.iml
│   │
│   └───inspectionProfiles
│           profiles_settings.xml
│           Project_Default.xml
│
├───weatherapp
│   │   admin.py
│   │   apps.py
│   │   models.py
│   │   tests.py
│   │   views.py
│   │   __init__.py
│   │
│   └───migrations
│           __init__.py
│
└───webscraping
    │   asgi.py
    │   settings.py
    │   urls.py
    │   wsgi.py
    │   __init__.py
    │
    └───__pycache__
            settings.cpython-38.pyc
            __init__.cpython-38.pyc

We will be working with some of the few files here for our project. Now lets see if our project has been configured properly. Go to cmd and run the development server using python manage.py runserver command. You should see the following screen. You can ignore for now about that 18 unapplied migrations as it is not needed for our project.

image.png

Go to browser and type: http://127.0.0.1:8000. This will show you default Django success message.

image.png

Now everything has been configured properly, we will first replace this homepage with our weather page design. First we need to register our app called as weatherapp in our Django settings. So open webscraping/settings.py and in the list of INSTALLED_APPS, we will write down the name of our app.

INSTALLED_APPS = [
    'django.contrib.admin',
    'django.contrib.auth',
    'django.contrib.contenttypes',
    'django.contrib.sessions',
    'django.contrib.messages',
    'django.contrib.staticfiles',
    'weatherapp',
]

Now, the next step is to create a template for our weather application. We will work with just one template for now. So by default Django way of how templates are rendered, we will create a new directory named as templates inside our weatherapp folder. Now inside this templates folder create another folder and named it as our app name which is weatherapp. Now finally create a HTML file inside this newly created folder and named it as index.html. We will put a simple responsive bootstrap card for the design.

<!doctype html>
<html lang="en">
<head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/bootstrap.min.css">
    <link rel="stylesheet" href="https://pro.fontawesome.com/releases/v5.10.0/css/all.css">
    <title>Weather</title>
    <style>
        body{
            background-color: #130f40;
            background-image: linear-gradient(315deg, #130f40 0%, #000000 74%);
        }
    </style>
</head>
<body>

<div class="container col-md-5 text-white" style="margin-top: 60px;">
    <form method="get" action="">
        <div class="card bg-dark">
            <div class="card-header">Find the Weather</div>
            <div class="card-body">
                <div class="form-group">
                    <div class="input-group input-group-lg">
                        <div class="input-group-prepend">
                            <span class="input-group-text"><i class="fa fa-location-arrow"></i></span>
                        </div>
                        <input type="text" name="location" class="form-control" placeholder="Enter Location">
                    </div>
                    <input type="submit" class="btn btn-outline-light mt-3" value="Search">
                </div>
            </div>
        </div>
    </form>
</div>


<script src="https://code.jquery.com/jquery-3.5.1.slim.min.js"></script>
<script src="https://cdn.jsdelivr.net/npm/[email protected]/dist/js/bootstrap.bundle.min.js"></script>
</body>
</html>

To see how the design looks, go to weatherapp/views.py file and lets create a view for our homepage.

from django.shortcuts import render


def home(request):
    return render(request, 'weatherapp/index.html')

Now time to work with our URL paths. First open webscraping/urls.py. We need to make an import of include from django.urls.

from django.contrib import admin
from django.urls import path, include

urlpatterns = [
    path('', include('weatherapp.urls')),
    path('admin/', admin.site.urls),
]

Basically what we are trying to say is that if there is nothing after http://127.0.0.1:8000/, we want to check the URL patterns of our app named as weatherapp. So go create a new python file called urls.py inside our weatherapp directory. Define the folowing URL patterns in this file.

from django.urls import path
from . import views

urlpatterns = [
    path('', views.home, name="home"),
]

We have imported our views. And when Django encounters empty path after the localhost address, it checks the URL patterns inside our app called weatherapp. There we have defined the pattern that if there is empty path, we want to show home view. And just above in our home view we have rendered the template. So we can see how the template design looks. Check if you development server is running, if not type this command python manage.py runserver. Type the localhost address into your browser to see the following screen.

image.png

Now the default homepage has been replaced with our template, now we will make those buttons working and showing weather data for the entered location.

We will be scraping the weather data from Bing website. And we will only scrape the following data: region name, temperature in Celsius, weather status and the daytime.

image.png

Please notice the URL format of the website for showing the weather. If you want to see the weather of Delhi, then we need to type this URL: https://www.bing.com/search?q=weather+in+delhi. That means to say that rest of the URL string is same except the location. And through the form, user will enter the location and this location will be passed to the above URL format of Bing.

For this we need to install two popular Python packages: requests and beautifulsoup4. Go to your terminal and type: pip install requests. This will allows us to work with HTTP requests in a simpler way.

image.png

Now it the library has been installed, its time to install Beautiful Soup by typing pip install beautifulsoup4.

image.png

This package has also been installed. Now we can scrape the web data from Bing with these two popular libraries. Google and Bing doesn't allow scraping data in automated way. So our code needs to pretend that the requests are actually being made by a legitimate browser and not just by the bots. So we will create a new function in our views.py file and add this Request headers.

import requests
from bs4 import BeautifulSoup

def get_html_content(location):
    USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36"
    LANGUAGE = "en-US,en;q=0.5"
    session = requests.Session()
    session.headers['User-Agent'] = USER_AGENT
    session.headers['Accept-Language'] = LANGUAGE
    session.headers['Content-Language'] = LANGUAGE

We also need to import requests and beautiful soup at the top to work with it here. Also change your home view function as below:

def home(request):
    if 'location' in request.GET:
        location = request.GET.get('location')
    return render(request, 'weatherapp/index.html')

Remember in the design above, we have set the name of input field to location where user will be entering the location. If there is 'location' then we will get the content of that input field and we will pass this location as an argument to the above get_html_content() function. So this will be our final corresponding function.

def get_html_content(location):
    USER_AGENT = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/89.0.4389.114 Safari/537.36"
    LANGUAGE = "en-US,en;q=0.5"
    session = requests.Session()
    session.headers['User-Agent'] = USER_AGENT
    session.headers['Accept-Language'] = LANGUAGE
    session.headers['Content-Language'] = LANGUAGE
    location = location.replace(" ", "+")
    html_content = session.get(f'https://www.bing.com/search?q=weather+in+{location}').text
    return html_content

The reason why we are using replace() function is that suppose if user wants to find weather of New York, then there is space between New and York and Bing website it will make this URL format for such.

image.png

So if user enters the location where there is space between the names then we need to replace that space with "+" so that proper URL format request is sent to Bing. Now we will receive this HTML content in our home view function. Lets try out printing this html content.

def home(request):
    if 'location' in request.GET:
        location = request.GET.get('location')
        html_data = get_html_content(location)
        print(html_data)
    return render(request, 'weatherapp/index.html')

If you go to your browser and enter sample location and see your terminal, it prints all the HTML content there:

image.png

You can see here a thousands of lines of code (need to scroll a lot). Its where we will be using beautiful soup to scrape the data from these lines of code. Before using just to check a few things on Bing, hover over the region area which is Plano, TX below in my case and inspect it.

image.png

Inspect will help to access the corresponding element of that website.

image.png

If you want to know the element and class or id which gives that Cloudy status hover to that and you can see the corresponding element and class name in the right bar, which is highlighted.

image.png

Now we know this one:

  1. Region name stored in <span> tag with class name wtr_foreground.
  2. Daytime stored in <div> tag with class name wtr_daytime.
  3. Weather status stored in <div> tag with class name wtr_caption.
  4. Temperature stored in <div> tag with classname wtr_currTemp.

Now we will use beautiful soup to scrape data from this element and class name inside our home function. But first we need to create a soup object that will take two parameters: one HTML content and the other is parser which is HTML parser in our case.

And with this object we will find our desired attribute or element. So we will use find() function. We need to pass HTML element name as first parameter like div, span, section and the class name or id name of that element should be passed in dictionary form.

def home(request):
    if 'location' in request.GET:
        location = request.GET.get('location')
        html_data = get_html_content(location)
        soup = BeautifulSoup(html_data, "html.parser")
        region = soup.find('span', attrs={'class': 'wtr_foreGround'}).text
        daytime = soup.find('div', attrs={'class': 'wtr_dayTime'}).text
        status = soup.find('div', attrs={'class': 'wtr_caption'}).text
        temperature = soup.find('div', attrs={'class': 'wtr_currTemp'}).text
        print(region, daytime, status, temperature)
    return render(request, 'weatherapp/index.html')

Now lets try entering Plano again in the browser and see if it prints the above four data in our terminal.

image.png

I have entered Plano and hit Search. Now in terminal you can see this:

Screenshot_1.png

So it has printed region name, status ,temperature and the time but we want this to show in the browser itself i.e. rendering dynamically to the template. In order to render it, we need to create a context dictionary, pass those 4 values in key-value pairs and pass it to the render function. Then in our template, we can use a loop logic to show this content using Django syntax. So our home view function in views.py will be:

def home(request):
    context = None
    if 'location' in request.GET:
        location = request.GET.get('location')
        html_data = get_html_content(location)
        soup = BeautifulSoup(html_data, "html.parser")
        region = soup.find('span', attrs={'class': 'wtr_foreGround'}).text
        daytime = soup.find('div', attrs={'class': 'wtr_dayTime'}).text
        status = soup.find('div', attrs={'class': 'wtr_caption'}).text
        temperature = soup.find('div', attrs={'class': 'wtr_currTemp'}).text
        context = {'region': region, 'daytime': daytime, 'status': status, 'temperature': temperature}
    return render(request, 'weatherapp/index.html', context)

Go to your index.html file and lets render out this dynamically. We will show all those data just below our form.

{% if region %}
<div class="container col-md-5" style="margin-top: 60px;">
<div class="card">
  <div class="card-header bg-dark text-white">
    Weather in {{region}}
  </div>
  <ul class="list-group list-group-flush">
          <li class="list-group-item">Daytime: {{daytime}}</li>
    <li class="list-group-item">Temperature: {{temperature}} °C</li>
    <li class="list-group-item">Status: {{status}}</li>

  </ul>
</div>

</div>
{% endif %}

So our final index.html code is:

<!doctype html>
<html lang="en">
<head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
    <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/[email protected]/dist/css/bootstrap.min.css">
    <link rel="stylesheet" href="https://pro.fontawesome.com/releases/v5.10.0/css/all.css">
    <title>Weather</title>
    <style>
        body{
            background-color: #130f40;
            background-image: linear-gradient(315deg, #130f40 0%, #000000 74%);
        }
    </style>
</head>
<body>

<div class="container col-md-5 text-white" style="margin-top: 60px;">
    <form method="get" action="">
        <div class="card bg-dark">
            <div class="card-header">Find the Weather</div>
            <div class="card-body">
                <div class="form-group">
                    <div class="input-group input-group-lg">
                        <div class="input-group-prepend">
                            <span class="input-group-text"><i class="fa fa-location-arrow"></i></span>
                        </div>
                        <input type="text" name="location" class="form-control" placeholder="Enter Location">
                    </div>
                    <input type="submit" class="btn btn-outline-light mt-3" value="Search">
                </div>
            </div>
        </div>
    </form>
</div>

{% if region %}
<div class="container col-md-5" style="margin-top: 60px;">
<div class="card">
  <div class="card-header bg-dark text-white">
    Weather in {{region}}
  </div>
  <ul class="list-group list-group-flush">
          <li class="list-group-item">Daytime: {{daytime}}</li>
    <li class="list-group-item">Temperature: {{temperature}} °C</li>
    <li class="list-group-item">Status: {{status}}</li>

  </ul>
</div>

</div>
{% endif %}


<script src="https://code.jquery.com/jquery-3.5.1.slim.min.js"></script>
<script src="https://cdn.jsdelivr.net/npm/[email protected]/dist/js/bootstrap.bundle.min.js"></script>
</body>
</html>

Now lets check the final output. Lets enter the location name as Plano in our form and hit search.

image.png

After hitting search this will appear in the same page.

image.png

Lets see the weather for London, UK.

image.png

Its successfully working and showing us the weather data. You can scrape many other data using the same process.

I have provided the complete source code of this project here. There's no any password required for this one as we haven't worked with models and admin dashboard in this tutorial.

Sort:  

wow, that's a lot of effort to get this data, and with a minor change of the bing results format it breaks. Isn't there an API to get weather data from MS?

I think there is. They provide web search API. One should be able to get from that but I have only little knowledge on API.

Your content has been voted as a part of Encouragement program. Keep up the good work!

Use Ecency daily to boost your growth on platform!

Support Ecency
Vote for Proposal
Delegate HP and earn more

Cool manual!

Nice post !