Thread: PDF Parsing and Indexing

PDF Parsing and Indexing

From
"Raymond"
Date:
I need to parse / index Adobe PDF content and store both the document and
index in Postgres.

Has anybody had experience in doing this?

Raymond


Re: PDF Parsing and Indexing

From
Doug McNaught
Date:
"Raymond" <support@bigriverinfotech.com> writes:

> I need to parse / index Adobe PDF content and store both the document and
> index in Postgres.
>
> Has anybody had experience in doing this?

I can give you some information that may get you started.

PDF is a (mostly) open standard.  The spec is available from Adobe's
website, and there are libraries out there (some free, some
commercial) to help you work with it.  That said, there are several
gotchas:

* It is possible to both compress and encrypt PDF content.  You need
  the proper data filters to handle documents of these types, and some
  may only be available commercially.

* PDF is a page description language like PostScript (except it does
  not include a Turing-complete programming language as well).  It
  provides for arbitrary placement of each glyph on the page.  So the
  word "this" might be encoded in the file as something like:

moveto(100, 200)
draw("t")
moveto(105, 200)
draw("h")
moveto(112, 200)
draw("i")
moveto(115, 200)
draw("s")

  You can see that it would hard to index something like this in any
  kind of useful way.

PDF files are binary and can be arbitrarily large, so I would probably
store them in Postgres as large objects.

I recommend you download and at least skim the PDF spec (a 500-page
PDF, natch) to get an idea of what you're in for in the general case.

-Doug
--
The rain man gave me two cures; he said jump right in,
The first was Texas medicine--the second was just railroad gin,
And like a fool I mixed them, and it strangled up my mind,
Now people just get uglier, and I got no sense of time...          --Dylan

Re: PDF Parsing and Indexing

From
Mike Castle
Date:
On Fri, Jun 15, 2001 at 07:33:42PM -0400, Doug McNaught wrote:
> "Raymond" <support@bigriverinfotech.com> writes:
> > Has anybody had experience in doing this?

Wonder if Google's solution to this is available.

>   provides for arbitrary placement of each glyph on the page.  So the
>   word "this" might be encoded in the file as something like:
>
> moveto(100, 200)
> draw("t")
> moveto(105, 200)
> draw("h")
> moveto(112, 200)
> draw("i")
> moveto(115, 200)
> draw("s")
>
>   You can see that it would hard to index something like this in any
>   kind of useful way.

PDF's generate from MS utilities (Word I think?) are notoriously bad for
this.  Big surprise.

mrc
--
     Mike Castle      dalgoda@ix.netcom.com      www.netcom.com/~dalgoda/
    We are all of us living in the shadow of Manhattan.  -- Watchmen
fatal ("You are in a maze of twisty compiler features, all different"); -- gcc

Re: PDF Parsing and Indexing

From
jdassen@cistron.nl (J.H.M. Dassen (Ray))
Date:
Mike Castle <dalgoda@ix.netcom.com> wrote:
> On Fri, Jun 15, 2001 at 07:33:42PM -0400, Doug McNaught wrote:
>> "Raymond" <support@bigriverinfotech.com> writes:
>> > Has anybody had experience in doing this?
>
> Wonder if Google's solution to this is available.

AOL. I suspect it is largely based on existing free software that deals with
PDF, like pstotext,
    http://www.research.compaq.com/SRC/virtualpaper/pstotext.html

Ray
--
Do Microsoft's TCO calculations include TC of downtime?