Extracting data from documents with python is not only fun but also saves ton of time. Python provides tools for automating such repetitive tasks and also many libraries that let us interact with documents programmatically. I have multiple scripts that does just that, extract data from hundreds of documents, clean data, and present in a more useful format. All of this can be automated and done with a click of a button. Alternative would be spending hours scanning through documents manually. Over the time things change. The data we need change, structure of documents we use change, the goals change. This may require revisiting and updating scripts. This becomes a bit more challenging if it has been a while since we wrote the scripts. This has been the case for me again this week.
I had a project to revisit some data extracting scripts because the structure of the documents used have changed over time. While everything worked as expected, tweaking the data extracting and processing could improve the desired output. Python has many libraries that deal with pdf documents. Pdfplumber is my favorite one and I have used many times. One feature that it has I haven't experimented with yet was the Visual Debugging. It is very simple process and using it saves a lot of time when writing the actual data extraction code from these documents. Sometimes when you extract data from PDFs, the results don’t match what you see on the page. For example, tables might look scrambled or text could be out of order. Visual debugging with pdfplumber lets you see how your code interprets the document so you can fix mistakes quickly.
If you don't have pdfplumber installed yet, make sure to pip install first. Extracting text from pdf documents is as simple as displayed below with few lines of code.
import pdfplumber
with pdfplumber.open("example.pdf") as pdf:
first_page = pdf.pages[0]
print(first_page.extract_text())
The code above gets all text on the page. However, we may want to get text only in specific locations on the page. For this we can use .crop(bounding_box, relative=False, strict=True) method. Using this method on the page we are working on will return a version of the page but only including items within the bounding box location we have provided with x and y coordinates. I just create a helper function like below to crop the areas I need. All we need to do is figure out our bounding box coordinates.
def get_rect_text(page, bounding_box):
text = page.crop(bounding_box).extract_text().split('\n')
return text
We can guess where approximately the x, y, top, bottom are and play with numbers until we get what we need. But this may create errors in the future, but also can be a very boring process of trying different numbers. Alternatively, we can utilize visual debugging features pdfplumber provides to visually see where things are. The simplest way would be drawing lines horizontally and vertically, kinda creating a grid and then figuring out what these numbers are super simple. Plugging in these numbers we can crop any area we need, and keep repeating the same process for all the pages and documents as needed.
def pdf_draw_lines(filename):
with pdfplumber.open(filename) as pdf:
count = 1
for page in pdf.pages:
page_img = page.to_image(resolution=250)
page_img.draw_line(((60,0), (60,800)), stroke='red', stroke_width=1)
page_img.draw_line(((63,0), (63,800)), stroke='blue', stroke_width=1)
page_img.draw_line(((110,0), (110,800)), stroke='red', stroke_width=1)
page_img.draw_line(((113,0), (113,800)), stroke='blue', stroke_width=1)
page_img.save(f'/location/doc{count}.png', format="PNG", quantize=True, colors=256, bits=8)
count += 1
Above you can see small function that draws lines on each page of the documents and saves pages locally. We can examine these pictures of the documents to get a better understanding the structure of the document and plan how we will be extracting and using the data. Drawing horizontal and vertical lines is the simplest way for us to visually debug the documents. pdfplumber provides much more interesting and powerful ways of accomplishing these tasks. Feel free to visit the pdfplumber documentation for more details.
This didn't work for me right away. I did get errors initially that complained I don't have the imagePage related dependencies on the machine. This wasn't just a pip install. The error suggested what to install and it took a while for it to complete the installation. In the end everything worked, except for .show() method. I didn't need, since I could just save the images and view them afterwards.
Pdfplumber works great with other Python libraries, like pandas, for handling data. For example, if you extract a table from a PDF, you can turn it into a pandas DataFrame to clean or analyze the data more easily. Debugging with pdfplumber ensures the data is clean before you move to the next steps.
Pdfplumber is a simple yet powerful tool for working with PDFs. It’s especially useful for beginners because it gives you visual feedback, making it easier to see what’s happening and fix issues. Whether you’re working with text, tables, or images, pdfplumber helps make the process smoother and more reliable.
Yuhh it looks really like an universal tool. Wish I had such library while I was writing my graduation work in university years ago it could save my nerves and time a lot...
Even copying data from PDF and pasting it on a word is a mess.
Thank you, this is quite informative.
I didn't know about pdflumber. It seems pretty impressive in terms of time and efficiency. I've used pdfbinder before, which merges PDF files.
If PDFplumber saves time and produces more effective results when extracting data from PDFs then it's the way to go. Automating repetitive tasks sounds like a fine idea. Maybe I'll try PDFplumber when I have such PDF work to do. Thanks for this useful info. Have a great day.
Interesting. I will have to create a script to extract data from videos. Kinda pushing it off :(
View or trade
BEER
.BEER
Hey @bluerobo, here is a little bit of from @isnochys for you. Enjoy it!Learn how to earn FREE BEER each day by staking your
BEER
.View or trade
BEER
.BEER
Hey @bluerobo, here is a little bit of from @isnochys for you. Enjoy it!Do you want to win SOME BEER together with your friends and draw the
BEERKING
.Thank you for your witness vote!
Have a !BEER on me!
To Opt-Out of my witness beer program just comment STOP below
Thank you for your witness vote!
Have a !BEER on me!
To Opt-Out of my witness beer program just comment STOP below
is this free? i only use pdf converter (pdf to word, excel).
I know how to use pdf. But this is too complicated although looks useful !
i only do java, but phyton seems very dynamic and pretty modern language, even stable diffusion runs on phyton and you show this which is totally different...very adaptive
I have some religious manuscripts I have written over the years (and I'm still writing more) maybe I should trying them with this python and see how it works for me. Thank you for sharing.
Working with pdf editing is always annoying, thats a good tool!
I don't know anything about coding, but I realized that this can be done using the Python coding language.
Wow
I never knew PDFs could be edited
I’ve tried it but didn’t work for me
!pimp
Wow this is so amazing 👏
I started self learning Python towards what I want to do in school which is Artificial Intelligence, but I just feel really stuck right now. It's probably because I learned the wrong way lol 😅
But hopefully I'll get on track back