Re: PDF Parsing and Indexing - Mailing list pgsql-general

From Doug McNaught
Subject Re: PDF Parsing and Indexing
Date
Msg-id m3bsnpjp2h.fsf@belphigor.mcnaught.org
Whole thread Raw
In response to PDF Parsing and Indexing  ("Raymond" <support@bigriverinfotech.com>)
Responses Re: PDF Parsing and Indexing  (Mike Castle <dalgoda@ix.netcom.com>)
List pgsql-general
"Raymond" <support@bigriverinfotech.com> writes:

> I need to parse / index Adobe PDF content and store both the document and
> index in Postgres.
>
> Has anybody had experience in doing this?

I can give you some information that may get you started.

PDF is a (mostly) open standard.  The spec is available from Adobe's
website, and there are libraries out there (some free, some
commercial) to help you work with it.  That said, there are several
gotchas:

* It is possible to both compress and encrypt PDF content.  You need
  the proper data filters to handle documents of these types, and some
  may only be available commercially.

* PDF is a page description language like PostScript (except it does
  not include a Turing-complete programming language as well).  It
  provides for arbitrary placement of each glyph on the page.  So the
  word "this" might be encoded in the file as something like:

moveto(100, 200)
draw("t")
moveto(105, 200)
draw("h")
moveto(112, 200)
draw("i")
moveto(115, 200)
draw("s")

  You can see that it would hard to index something like this in any
  kind of useful way.

PDF files are binary and can be arbitrarily large, so I would probably
store them in Postgres as large objects.

I recommend you download and at least skim the PDF spec (a 500-page
PDF, natch) to get an idea of what you're in for in the general case.

-Doug
--
The rain man gave me two cures; he said jump right in,
The first was Texas medicine--the second was just railroad gin,
And like a fool I mixed them, and it strangled up my mind,
Now people just get uglier, and I got no sense of time...          --Dylan

pgsql-general by date:

Previous
From: Tabor Kelly
Date:
Subject: NULL feilds and ERROR messages from libpq++
Next
From: Mike Castle
Date:
Subject: Re: PDF Parsing and Indexing