Reading & Writing to PDFs
Motivation for the Post
PDF extraction is admittedly a tough engineering task. I know people who founded startups offering PDF extraction services and failed. Hey, it’s not their fault. Even, Amazon failed to perfect it through its product Textract.
There are some beautiful libraries out there trying to perfect the process as much as possible. However, I didn’t find a tool that could help me end-to-end in preserving the structure i.e headings, annotations and writing back to the PDF. I found an ensemble of tools each better at individual tasks but stitching them together is a tough task. After dedicating good amount of time, I was able to zero-down all my requirements to one tool albeit it requires some coding from you.
So, this post helps you understand how PDF works and what you can/cannot do with it. Note: I’ll be using PyMuPDF in this post.
Understanding PDF
What makes PDF parsing (Portable Document Format) so difficult is its way of storing the data. For instance, do you (90s kids especially) remember cutting celebrities’ photos from newspapers and pasting them on a page/chart? PDF is inspired from the same methodology. Every word/character is just pasted on a blank page w.r.t coordinate system. So unlike your HTML/XML format, you can’t simply find a table by looking at tags because there are no tags. This problem is also reflected & discussed in the recent efforts of Camelot and Tabula.
Coming back to the question of how storage is done, it is coordinates. Yes, we’ll get back to why that’s the case in just a minute. Also, note that PDF maintains two layers - Data Layer and Metadata layer for storing different types of data.
Let’s understand it better using this PDF.
For brevity, narrow down our understanding of Data Layer to Words and Metadata to Annotations/Highlights. Now, I have highlighted some portion of the PDF to obtain both layers with some code.
1. Reading
1.1 Getting words’ info
import fitz
pdf_filename = "sample-highlight.pdf"
page_number = 0
doc = fitz.open(f'{pdf_filename}') # Read the file as doc
page1 = doc[page_numer] # Get the page object
page1_info = page1.getText("words") # Get Text from a particular page as a list of words.
print(page1_info)
.getText("words")
gives word-level information i.e [rectangle/bounding-box of the word, word, paragraph #, line #, position in line]
[
(x1, y1, x2, y2, word, paragraph #, line #, word-position)
(72.02, 92.64, 90.91, 106.38, 'This', 1, 0, 0),
(93.69, 92.64, 100.91, 106.38, 'is', 1, 0, 1),
(103.69, 92.64, 109.25, 106.38, 'a', 1, 0, 2),
(112.03, 92.64, 135.36, 106.38, 'small', 1, 0, 3),
(138.14, 92.64, 201.50, 106.38, 'demonstration', 1, 0, 4),
(204.28, 92.64, 220.96, 106.38, '.pdf', 1, 0, 5),
(223.74, 92.64, 236.52, 106.38, 'file', 1, 0, 6),
(239.30, 92.64, 242.63, 106.38, '-', 1, 0, 7),
...
]
1.2. Getting Highlights
.annots()
gives information about each annotation i.e [Type of annotation - Highlight/Text, Rectangle of the highlight]
.
for annot in doc[0].annots():
print(annot)
print(annot.info)
print(annot.rect)
Note that this will only give you the rectangle coordinates of the annotation/highlight but not the word(s) inside it because of the storage structure.
'Highlight' annotation on page 0 of sample-highlight.pdf
{'content': '', 'name': '', 'title': 'mano', 'creationDate': '', 'modDate': "D:20200830183034+05'30", 'subject': '', 'id': ''}
Rect(135.17, 94.25, 204.82, 107.75)
Glimpse of how words & annotations are stored inside PDF.
To get the words inside highlighted, we need to map rectangle-coordinates of either sides. In our case, rectangle coordinates for
word "demonstration" is 138.14, 92.64, 201.50, 106.38
highlighted area is 135.17, 94.25, 204.82, 107.75
So, one can write a simple script to find which words coincide-with/lie-inside the annotation/highlight.
2. Writing to PDFs
Here comes the most important question, how do you write to PDFs? Again, its using coordinates. Let’s consider 2 scenarios here,
- Masking out a word
- Replacing the word
In our case, let’s use the word small
in the sentence,
This is a small demonstration .pdf file -
As shown in 1.1 section, we can easily get the coordinates of the word small
=> [112.03, 92.64, 135.36, 106.38]
So, step-1 i.e masking out
is done this way,
coords = (112.03999328613281, 92.64202880859375, \
135.3699951171875, 106.38202667236328)
annot1 = page1.addRedactAnnot(coords, text = " ")
annot1.setColors(stroke=fitz.utils.getColor('black'),
fill=fitz.utils.getColor('black'))
page1.apply_redactions()
doc.save('output.pdf')
Open the file output.pdf
to see the changes:
Step-2 i.e overwriting
can be done this way,
coords = (112.03999328613281, 92.64202880859375, \
135.3699951171875, 106.38202667236328)
page1.addFreetextAnnot(coords, 'XXX')
doc.save('output2.pdf')
That’s it from my side. Hope you find this post useful.
Thanks,
Murali Manohar.
Leave a comment