Extracting PDF Data With Pdfplumber - Lines, Rectangles, And Crop

in #python2 years ago

pdfplumber_data.png

In the past I have written how useful pdfplumber library is when extracting data from pdf files. Its true power becomes evident with dealing with multiple pdf files that have hundreds of pages. When you know what you are looking for, and don't want to go through hundreds of pages manually, and if you have to do deal with such files on daily basis, best thing to do is to automate. That's what python is great at, automating. Pdfplumber as the naming suggest works with pdf files and makes it easy to extract data. It works best with machine-generated pdf files rather than scanned pdf files.

When extracting data from pdf files we can utilize multiple approaches. If we just need some text, we can start with the simple .extract_text() method. However, pdfplumber let's us extract all objects in the document like images, lines, rectangles, curves, chars, or we can just get all of these objects with .objects. Sometimes machine generated pdf files utilize lines and rectangles to separate the information on the page. This can help up in identifying the type of text within those lines or rectangles. I recently came across some financial pdf data formatted in such a way. Using the location of these lines and rectangles can help to select the text in that area using pdfplumber's .crop() method.

First, let's take a look at basic text extraction with pdfplumber.

import pdfplumber

with pdfplumber.open('/Users/librarian/Desktop/document.pdf') as pdf:
    page1 = pdf.pages[0]
    page1_text = page1.extract_text().split('\n')
    for text in page1_text:
        print(text)

We open the file with pdfplumber, .pages returns list of pages in the pdf and all the data within those pages. Since it is a list we can access them one by one. In the example above we are just looking at page one for now. Using .extract_text() method, we can get all text of page one. It is one long string. If we want to separate the text line by line, we use the .split('\n'). Now that we have a list of lines of text from page one, we can iterate through the list and display all lines of text.

In most cases, this might be all you need. But sometimes you may want to extract these lines of text and retain the layout formatting. To do this, we add layout=True parameter to .extract_text() method, like this page1.extract_text(layout=True).split('\n'). Be careful when using layout=True, because this feature is experimental and not stable yet. In might work in most cases, but sometimes it may return unexpected results.

Now that we know how to extract the text from the page, we can apply some string manipulation and regex to get only the data that we actually need. If we know the exact area on the page where our data is located, we can use .crop() method and extract only that data using the same extraction methods described above.

pdfplumber.Page class has properties like .page_number, .width, and .height. We can use width and height of the page in determining which area we are going to crop. Let's take a look at a code example using .crop()

import pdfplumber

with pdfplumber.open('/Users/librarian/Desktop/document.pdf') as pdf:
    page1 = pdf.pages[0]
    bounding_box = (200, 300, 400, 450)
    crop_area = page1.crop(bounding_box)
    crop_text = crop_area.extract_text().split('\n')
    for text in crop_text:
        print(text)

Once we have our page instance, we use the .crop(bounding_box) method, and result is still page but only covers the area defined by bounding_box. Think of it is a piece of the page, but it still is a page, and we can apply other other methods like .extract_text() on this piece of a page.

This cropping the area can be very useful if you know the exact area your text is located in. This feature become even more useful when the pdf documents we are working with have lines and rectangles for formatting and separating information. We can extract all the lines and rectangles on the page and get their locations. Using these locations we can easily identify which area of the page we need to crop. To get the lines on the page, we use .lines property and to get the rectangles on the page we use .rects property. To see how many lines we have on the page and properties of a line we can run the following code.

import pdfplumber
import pprint

with pdfplumber.open('/Users/librarian/Desktop/document.pdf') as pdf:
    page1 = pdf.pages[0]
    lines = page1.lines
    print(len(lines))
    pprint.pprint(lines[0])

The result would show the following properties and their values line objects will have. Some of them will be useful, other we can ignore.

{'bottom': 130.64999999999998,
 'doctop': 130.64999999999998,
 'evenodd': False,
 'fill': False,
 'height': 0.0,
 'linewidth': 1,
 'non_stroking_color': [0.859],
 'object_type': 'line',
 'page_number': 1,
 'pts': [(18.0, 661.35), (590.25, 661.35)],
 'stroke': True,
 'stroking_color': (0, 0, 0),
 'top': 130.64999999999998,
 'width': 572.25,
 'x0': 18.0,
 'x1': 590.25,
 'y0': 661.35,
 'y1': 661.35}

Which property to use will be based on the project. In my case I would be using top, bottom, x0, and x1. Although top and bottom values are same in this example because line width is only 1, I would still get both values just in case the value of the line width changes in the future.

We would get the rectangles on the page the same way as we did with lines. In this case we change the property to .rects. When using rects, the top and bottom value will be different for obvious reasons. Now that we have the coordinates where we need to crop and extract text from, we just plug in these values we get from .lines and .rects into our bounding_box for .crop() method.

I just started using these features of pdfplumber today, and so far everything is working great and I have seen any issues yet. If you work with many pdf files to extract data and these documents have repeating lines and rectangles that separate information, you too may find pdfplumber to be useful in automating these tasks. Let me know your thoughts and experiences about text extraction from pdf documents in the comments.

Pdfplumber has great documentation. Feel free to visit the github page: https://github.com/jsvine/pdfplumber

Sort:  

You have widened my horizon via this information you have passed out I will use this system to get pdf data when ever I have the need. Thank you a lot.

Extracting text from a PDF is a real mess. With pdfplumber, we can also extract the tables or shapes from a PDF page. Perhaps, it will be much more capable of doing from a scanned PDF after some developments.

I am not that good with regards to things like this.
Thank you for sharing

that is neat! very helpful tool! 😉🤙

This is really nice @geekgirl and thanks for sharing

Pdfplumber has great documentation

Agree on that and github is a great source where from we collect resources. Thanks for sharing such helpful blog with us.

Great information. Thank you.

Thanks for your contribution to the STEMsocial community. Feel free to join us on discord to get to know the rest of us!

Please consider delegating to the @stemsocial account (85% of the curation rewards are returned).

You may also include @stemsocial as a beneficiary of the rewards of this post to get a stronger support. 
 

Congratulations @geekgirl! You have completed the following achievement on the Hive blockchain and have been rewarded with new badge(s):

You received more than 180000 upvotes.
Your next target is to reach 190000 upvotes.

You can view your badges on your board and compare yourself to others in the Ranking
If you no longer want to receive notifications, reply to this comment with the word STOP

Check out the last post from @hivebuzz:

Hive Power Up Month Challenge 2022-07 - Winners List
The 8th edition of the Hive Power Up Month starts today!
Hive Power Up Day - August 1st 2022
Thank you for sharing this amazing post on HIVE!
  • Your content got selected by our fellow curator @priyanarc & you just received a little thank you via an upvote from our non-profit curation initiative!

  • You will be featured in one of our recurring curation compilations and on our pinterest boards! Both are aiming to offer you a stage to widen your audience within and outside of the DIY scene of hive.

Join the official DIYHub community on HIVE and show us more of your amazing work and feel free to connect with us and other DIYers via our discord server: https://discord.gg/mY5uCfQ !

If you want to support our goal to motivate other DIY/art/music/homesteading/... creators just delegate to us and earn 100% of your curation rewards!

Stay creative & hive on!

Fantastic tutorial on extracting PDF data with Pdfplumber! The step-by-step guide to working with lines, rectangles, and crop features is incredibly helpful. For those looking to take their PDF manipulation to the next level, I highly recommend checking out https://pdfflex.com/png-to-pdf – a free PDF converter that simplifies editing, merging, and compressing PDFs with just a few clicks. It's a game-changer!

I will say that for those who deal with a large number of scanned documents, PDF Harvester from CoolUtils on the website https://www.coolutils.com/PDFCombine will be a real godsend, which is much easier to use. Not only does it merge files, but it also automatically removes those annoying blank pages. Saved me a lot of time and will definitely do the same for you. You can start by using the free version on the website.

Merging PDF files sometimes takes a lot of time but it is still a solvable problem. While Adobe Acrobat Pro DC seems like an obvious choice, its capabilities fall short of expectations. I use Guru's feature-rich PDF converter, this tool not only flattens PDFs but also bypasses the file size and usage restrictions faced by other online sources. All the tricks and innovations of this file conversion technology are described in the blog https://pdfguru.com/blog/pdf-history-and-future .
Therefore, using a PDF converter, you can quickly and efficiently combine PDF files and solve the problem associated with the complexity of layers.