About Me

My photo
PhD Candidate at Purdue University, Computer Science.

Saturday, November 03, 2007

Extracting metadata from PDFs

I have assigned to create a DB starting from PDF files. Typical fields are (title, Location, Date, Actors mentioned in each PDF..etc)
plan A:
The first idea i got in my mind is to create a Lexical Analyser.
I should parse the PDF file word by word then check if it is "Held" then i should expect a location description after that...If I find "Present" then i should expect a bunch of actors' names.

TO parse Text of PDF file I used PDFBox. Surprisingly, it didn't work as i expected.
  1. the PDFParse can split a single word into multiple words. So my code would receive "hel" then "d", and not "held". In this case to make it works I should generate a state machine that keeps track of the history which I found non feasible solution.
  2. I have French documents among these files. Applying a text analyser for french document is a kind of stupidity, in french..instead of saying "hold on", they type "le". Le is a very frequent string (if u r not familiar with french, it is equivalent to the".
plan B:
which implies that putting positions of text in consideration, hence I can build a set of blocks, and based on this structure I can isolate different logical blocks and then separate their contents.
this didn't work :( the problem this time was with the OCR generating these PDFs. the OCR is not accurate with text positioning 100%. hence the input to my code is buggy. in consequence, the OCR merges two different columns, or split 1 column to different columns.

plan C:
Implies to work on the TIFF files directly(ignores the Buggy PDF)
I read the original TIFF file, then parse pixel by pixel to split the original tiff into sub tiff. in this way I have control on the blocks without need to let the OCR do this for me.
after identifying basic blocks in the tiff file, I call an OCR SDK to transform my Tiffs into text, which finally will be my fields in the DB.

Friday, November 02, 2007

Joke of the year

While I was working on my MAC...I was very concentrating and stressed...
I was trying to Burn A CD....Well, it was my first time to burn a CD on my MacBook...

I tried inserting the CDs for 20 minutes(i tried 10 blank CDs)..but Nothing happened, so i decided that i may have corrupted CD blanks and I should buy some new ones.
Luckily, I noticed that all that time I was inserting the blank CD in my DESKTOP DVD , not my MacBook... :D
I should lost my mind somewhere...
grrrrrrrrrrrrr.

Saturday, October 27, 2007

Free Access To All Human Knowledge

As we are still organizing our local Egyptian team to host the Wikimania at Alex at August 2008, I am didn't define the committee I will be responsible of till the moment. But, I think I have some ideas in many committees held by my team mates, and I hope that I can be a member in these committees as well as a core member of one committee.

Saturday, October 20, 2007

No ldd on MAC OS X

ldd is not available on MAC OS X. Instead, I used the otool -L command to have the same functionality that ldd does.

Monday, October 15, 2007

Le Naufrage

"On part quand on le veut; mais on arrive quand Dieu le veut."

L'avantage des tempêtes, c'est qu'elles nous libères de tout soucis. Contre les élèments déchainés, il n'y a rien à faire. Alors, on s'en remet au destin.
Ce qu'on doit bien comprendre que le départ est dûe par Dieu.

Parfois; ton navire ne bouge plus. Il devait étre blooqué sur un banc de sable ou sur des récifs. ne penses plus que ton navire peut fuir sous la tempête pendant des centaines de kilomètres sans rien rencontrer.

Wednesday, October 10, 2007

Wikimania 2008 at ALEXANDRIA

"The Jury for Wikimania 2008 bids have met and are pleased to announce that Wikimania 2008 will be held in Alexandria, Egypt".

Yes, we did it and we won the bid :)
Statements about the Jury choice can be grabbed here.

We have to meet now to plan for the next steps. Hopefully it will be good experience...

Wednesday, October 03, 2007

Friday, September 28, 2007

Wikimania 2008

For the second time I am in the bidding team for the wikimania, I think we are close to win it this year. The bidding team (except me) did a great job during the last few days. Africa should get its turn to host this event. South Africa has its own problems, and don't forget what is going with the World Cup as they didn't start yet to prepare for the event..

Tuesday, September 18, 2007

New Task - الباز أفندى


I have a new Project as assignment...
A huge data for Dr Boutros Boutros-Ghali, which I should manipulate and categorize, create metadata, then finally create launch a site to browse the documents..

Whenever I look at the boxes at my disk I remember the Arabic old movies, where we used to see an employee handling tons of papers in front of him in a caricature profile for Egyptian employee. There is also a well known character who appeared in the movie "Ebn Hameedo" played by Tawfik El-De'n known as "El Baz Afandy". :)) he was a skilled employee, pretending the knowledge of everything while he has not gotten any certificates...

Actually I like the task...I find it has some kind of challenge... It tests how much you can handle these docs, how much you can be organized, knowing where a specific document can fit, and how you can make good procedures to organize the structure of such data.
Well, I guess It is a new kind of experience.

Friday, September 14, 2007

VISTA.....

As a new developer in the Bibliotheca Alexandrina, I should get a tour to VISTA(Virtual Immersive Science and Technology Applications). Bibliotheca Alexandrina is the first to provide researchers with such advanced visualization tools (in the middle east of course I mean). The VISTA is also known as CAVE (Computer Aided Virtual Environment).

It displays 3D stereoscopic images generated from a PC cluster, on four 10-ft × 10-ft vertical walls and the floor. The cluster is Five workstations linked together. The four projectors used in the VISTA render 1400 × 1050 pixels each and have a very bright light output rated at 7200 ANSI lumens. this can be very good for simulating real phenomenon. There are a lot of applications applied to Vista concerning Aero Dynamics, Chemistry, Medicine..etc.

One of the applications, which I see it quite impressive, is a 3d chart visualizer. It gives you a nice feeling that you are running on the axis, or between the points of the plot.

Actually, I have no pics for the VISTA.... but i can give you an approximation...
imagine the Gapminder in 3D effect.. That's wt u get when you use the VISTA_3dCharts.