Re: PDF Parsing and Indexing - Mailing list pgsql-general

From Mike Castle
Subject Re: PDF Parsing and Indexing
Date
Msg-id 20010615170202.I26165@thune.mrc-home.com
Whole thread Raw
In response to Re: PDF Parsing and Indexing  (Doug McNaught <doug@wireboard.com>)
List pgsql-general
On Fri, Jun 15, 2001 at 07:33:42PM -0400, Doug McNaught wrote:
> "Raymond" <support@bigriverinfotech.com> writes:
> > Has anybody had experience in doing this?

Wonder if Google's solution to this is available.

>   provides for arbitrary placement of each glyph on the page.  So the
>   word "this" might be encoded in the file as something like:
>
> moveto(100, 200)
> draw("t")
> moveto(105, 200)
> draw("h")
> moveto(112, 200)
> draw("i")
> moveto(115, 200)
> draw("s")
>
>   You can see that it would hard to index something like this in any
>   kind of useful way.

PDF's generate from MS utilities (Word I think?) are notoriously bad for
this.  Big surprise.

mrc
--
     Mike Castle      dalgoda@ix.netcom.com      www.netcom.com/~dalgoda/
    We are all of us living in the shadow of Manhattan.  -- Watchmen
fatal ("You are in a maze of twisty compiler features, all different"); -- gcc

pgsql-general by date:

Previous
From: Doug McNaught
Date:
Subject: Re: PDF Parsing and Indexing
Next
From: Randall Perry
Date:
Subject: canned code to get db on web quickly via perl or PHP?