Creating .epub files in Python - Part 2: Formatting the data

in #utopian-io7 years ago (edited)

What Will I Learn?

In this tutorial series the key aspects will be:

  • How to get HTML documents from the internet
  • How to get the information you want from the HTML document
  • Create an ebook (.epub) without the help of external programs or libraries
  • Learn the basic structure of .epub files
  • Learn how to debug epub files

In this part of the tutorial specifically you'll learn:

  • How to filter data out of an HTML document with beautifulsoup4
  • How to open, create and delete files with Python
  • How a .epub file is structured

Requirements

The requirements are the same as in part 1

  • Python version 3.6.x
  • The libraries: requests and beautifulsoup4

Difficulty

This part of the tutorial series is still considered basic. The last part will be intermediate.

  • Basic

Tutorial Contents

In the first part of the series, we got the raw HTML code of the website where the content for our .epub file is located. Right now you should have 2 files that look like this.

main.py:

main.py

functions.py:

functions.py

Since we now are able to download the HTML files of the chapters we want to work with, the next step is to define a function that will filter out all the unnecessary data of the HTML file. For this, we will work with the beautifulsoup4 library.

So before we can work with BeautifulSoup we have to import the library into our functions.py file. We do this with adding:

from bs4 import BeautifulSoup

Right below or above:

import functions

The next step is to define a function that takes two input parameters, "file_name_in" and "file_name_out". "file_name_in" is the HTML file we downloaded using the download function and "file_name_out" is the file name of our "cleaned" HTML file.
In this example, I will name the function clean(), so it looks like this:

def clean(file_name_in, file_name_out):

Before we can work with the HTML file we have to open it. If you still recall the first part of this tutorial, you know we do this with the open() function. It's essentially the same as in the first part but this time we only need read permission instead of writing permission.

raw = open(file_name_in, "r", encoding="utf8")

Since we've opened the file, the next step is to create a BeautifulSoup object otherwise, we can't work with BeautifulSoup.

soup = BeautifulSoup(raw, "html.parser")

As you can see we use the BeautifulSoup() function to create a BeautifulSoup object. We pass two parameters to the function, the file (raw) and the parser. In this tutorial, we're going to use the build-in parser that comes with Python.

Ok, we have created the BeautifulSoup object, now we have to know where the chapter content is located in the HTML file. To do this we open a random chapter of "Emperor's Domination", for example the second chapter.
Looking at the source code we see that the chapter content is located within a div element that has the item property "articleBody".

itemprop="articleBody"

This is an important piece of information! We use it to tell BeautifulSoup to filter out every element expect his one and it's children.
Since there is only one element with the item property "articleBody", we use .find()

soup = soup.find(itemprop="articleBody")

For formatting purposes we only want the text, so let's create a new variable called "text", that contains only the text.

text = soup.text

If you paid close attention to the code, you'd know that the div element we filtered out also contains the text "Previous Chapter" and "Next Chapter". Normally this is the text for the link to the next and previous chapter but since we're creating a .epub file, and also filtered out the link to these chapters, we want to delete them. To do this we use the replace() function.

text = text.replace("Previous Chapter", "").replace("Next Chapter", "")

This replaces the text "Previous Chapter" and "Next Chapter" with nothing ("").

Wuxiaworld.com operates on WordPress and gives the translators and editors a lot of freedom how they format their text, link structure, and even HTML code. This will sometimes result in unexpected behavior of the program or the formatting of the scraped data. To prevent wrong formating we will strip the unnecessary whitespace with:

text = text.lstrip().rstrip()

This will make it easy to get the chapter title. We need the chapter title so that we can later add it to the table of content of the book. The table of content is stored in a separate file (more on that later), so we will just save it in a variable for now.
To get the chapter title we abuse the fact that it's always located in the first line of text.

chapter_title = text.split('\n', 1)[0]

We use the split attribute to split the text after the first line break. Line breaks in Python (and Unix systems) are usually referred to with "\n" alternatively you could also use "\n\r", which is the line break for windows, but most of the time also works for Unix.

Since we now sucessfully extraced the chapter title, we'll remove it from the text so that we can give it a different HTML tag then the rest of the text. Then we stip the whitespace again, in case we messed something up:

text = text.replace(chapter_title, "")
text = text.lstrip().rstrip()
text = text.split("\n\r")[0]

Right now we have all the text we need but we can't write our file for now. First, we have to add the necessary HTML tags, using the replace() attribute. This is especially easy since there are only <p> and </p> tags. The easiest way to add those tags is to replace all line breaks (\n) with both tags (<p>\n</p>).

text = text.replace("\n", "</p>\n<p>")
raw.close()

Adding the tags in this way will result in a missing <p> tag in the beginning and a missing </p> tag at the end. We have to keep that in mind for later!

Great now we are ready to write the text to the final HTML file!

Keep in mind that the following code is still part of the clean function!
To write a file we first have to create one, as we did in part 1:

file = open(file_name_out, "w", encoding = "utf8")

In Python writing is done with the .write() attribute of files. In the very first line, we have to write the HTML with the xmlns Attribute. This is required since, by convention, epub files are using the XHTML file format.

file.write('<html xmlns="http://www.w3.org/1999/xhtml">')

Remeber that we once saved the chapter title in the variable chapter_title? Now is the time to dump it in the HTML document for later. We're going to put it inside the title tag because it's logical and easy to extract later on, although I do not know of an e-reader that uses the title tag for anything.

file.write("\n<head>")
file.write("\n<title>" + chapter_title + "</title>")
file.write("\n</head>")

After the head comes the body, the next thing we'll write. We start things off by writing the chapter title. I'll use the strong tag since I don't like a huge font on my kindle but you could add a different tag. We also use the line to write the missing <p> tag.

file.write("\n<body>")
file.write("\n<strong>" + chapter_title + "</strong>" + "\n<p>")

Next up is the text stored in the text variable.

file.write(text)

Now we have to close all the tags (<p>, <body> and <html>) and close the file

file.write("</p>")
file.write("\n</body>")
file.write("\n</html>")
file.close()

Now we have our formatted and cleaned HTML file! The only thing left to do is to delete the raw file. To do this we have to import os, simply add:

import os

to the other imports of functions.py and add

os.remove(file_name_in)

as the last line of the clean function.

Save the file and open main.py again.
Before the HTML file actually gets cleaned we have to call our clean() function inside of main.py.
We'll call the clean function inside the last for loop (the one that also calls the download() function). Before we can call the clean() function, we have to add a counter that counts starting from starting_chapter, so we can add the chapter number to our HTML files.
Just add:

name_counter = int(starting_chapter)

Right above the for loop. We will +1 the value of the variabe at the end of the for loop. To call the clean() function just write following lines below the call for the download() function:

functions.clean(str(x) + ".html", info["ChapterName"] + str(name_counter) + ".xhtml")
name_counter += 1

As you can see we pass (str(x) + ".html") as the "file_name_in" parameter and (info["ChapterName"] + str(name_counter) + ".xhtml") as the "file_name_out" parameter.

And that's it for this part of the tutorial series!

main.py should now look like this:
main.py

and functions.py should look like this:

functions.py

This part of the tutorial was way longer than I expected, so I won't start explaining how a .epub file is structured till the next part. Should I go more into detail about that topic and make one part only about the structure of the epub files? Write it in the comments!

Curriculum

Here is the link to the first part of the tutorial series.



Posted on Utopian.io - Rewarding Open Source Contributors

Sort:  

Thank you for the contribution. It has been approved.

You can contact us on Discord.
[utopian-moderator]

@bloodviolet, No matter approved or not, I upvote and support you.

Hey @bloodviolet I am @utopian-io. I have just upvoted you!

Achievements

  • You have less than 500 followers. Just gave you a gift to help you succeed!
  • Seems like you contribute quite often. AMAZING!

Suggestions

  • Contribute more often to get higher and higher rewards. I wish to see you often!
  • Work on your followers to increase the votes/rewards. I follow what humans do and my vote is mainly based on that. Good luck!

Get Noticed!

  • Did you know project owners can manually vote with their own voting power or by voting power delegated to their projects? Ask the project owner to review your contributions!

Community-Driven Witness!

I am the first and only Steem Community-Driven Witness. Participate on Discord. Lets GROW TOGETHER!

mooncryption-utopian-witness-gif

Up-vote this comment to grow my power and help Open Source contributions like this one. Want to chat? Join me on Discord https://discord.gg/Pc8HG9x