About Me

My photo
PhD Candidate at Purdue University, Computer Science.

Saturday, November 03, 2007

Extracting metadata from PDFs

I have assigned to create a DB starting from PDF files. Typical fields are (title, Location, Date, Actors mentioned in each PDF..etc)
plan A:
The first idea i got in my mind is to create a Lexical Analyser.
I should parse the PDF file word by word then check if it is "Held" then i should expect a location description after that...If I find "Present" then i should expect a bunch of actors' names.

TO parse Text of PDF file I used PDFBox. Surprisingly, it didn't work as i expected.
  1. the PDFParse can split a single word into multiple words. So my code would receive "hel" then "d", and not "held". In this case to make it works I should generate a state machine that keeps track of the history which I found non feasible solution.
  2. I have French documents among these files. Applying a text analyser for french document is a kind of stupidity, in french..instead of saying "hold on", they type "le". Le is a very frequent string (if u r not familiar with french, it is equivalent to the".
plan B:
which implies that putting positions of text in consideration, hence I can build a set of blocks, and based on this structure I can isolate different logical blocks and then separate their contents.
this didn't work :( the problem this time was with the OCR generating these PDFs. the OCR is not accurate with text positioning 100%. hence the input to my code is buggy. in consequence, the OCR merges two different columns, or split 1 column to different columns.

plan C:
Implies to work on the TIFF files directly(ignores the Buggy PDF)
I read the original TIFF file, then parse pixel by pixel to split the original tiff into sub tiff. in this way I have control on the blocks without need to let the OCR do this for me.
after identifying basic blocks in the tiff file, I call an OCR SDK to transform my Tiffs into text, which finally will be my fields in the DB.

No comments: