plan A:
The first idea i got in my mind is to create a Lexical Analyser.
I should parse the PDF file word by word then check if it is "Held" then i should expect a location description after that...If I find "Present" then i should expect a bunch of actors' names.
TO parse Text of PDF file I used PDFBox. Surprisingly, it didn't work as i expected.
- the PDFParse can split a single word into multiple words. So my code would receive "hel" then "d", and not "held". In this case to make it works I should generate a state machine that keeps track of the history which I found non feasible solution.
- I have French documents among these files. Applying a text analyser for french document is a kind of stupidity, in french..instead of saying "hold on", they type "le". Le is a very frequent string (if u r not familiar with french, it is equivalent to the".
which implies that putting positions of text in consideration, hence I can build a set of blocks, and based on this structure I can isolate different logical blocks and then separate their contents.
this didn't work :( the problem this time was with the OCR generating these PDFs. the OCR is not accurate with text positioning 100%. hence the input to my code is buggy. in consequence, the OCR merges two different columns, or split 1 column to different columns.
plan C:
Implies to work on the TIFF files directly(ignores the Buggy PDF)
I read the original TIFF file, then parse pixel by pixel to split the original tiff into sub tiff. in this way I have control on the blocks without need to let the OCR do this for me.
after identifying basic blocks in the tiff file, I call an OCR SDK to transform my Tiffs into text, which finally will be my fields in the DB.
No comments:
Post a Comment