Home > mailing lists

Re: Replacement for Oracle Text - Mailing list pgsql-general

From	Stephen Davies
Subject	Re: Replacement for Oracle Text
Date	February 20, 2016 06:32:53
Msg-id	56C80876.1080808@sdc.com.au Whole thread Raw
In response to	Re: Replacement for Oracle Text (Chris Travers <chris.travers@gmail.com>)
List	pgsql-general

Tree view

On 20/02/16 16:21, Chris Travers wrote:
> A more general way would be to have a function which takes a pdf in and
> returns the text.  Mark it immutable.
>
> Then you can index the output of converting that text to a tsvector.
>
> You may want to pull everything into a tsvector column for ease of review, but
> functional indexes also make that less important
>
> On Sat, Feb 20, 2016 at 1:10 AM, Stephen Davies <sdavies@sdc.com.au
> <mailto:sdavies@sdc.com.au>> wrote:
>
>     On 20/02/16 00:24, Bruce Momjian wrote:
>
>         On Fri, Feb 19, 2016 at 02:49:16PM +0100, s d wrote:
>
>             On 19 February 2016 at 14:19, Bruce Momjian <bruce@momjian.us
>             <mailto:bruce@momjian.us>> wrote:
>                   >     Ah, no. That's not possible
>                   >
>                   >
>                   > ...not possible, Yet.
>                   >
>                   > PostgreSQL grows by adding the features people need and
>             its changing
>                   rapidly.
>
>                   I wonder if PLPerl could be used to extract the words from a PDF
>                   document and create a tsvector column from it.
>
>                I don't know about PLPerl(I'm pretty sure it could be used for
>             this purpose,
>             though.).  On the other hand I've written code for this in Python
>             which should
>             be easy to adapt for PLPython, if necessary.
>
>
>         Right, so you would write a PL/Perl or PL/Python trigger function that
>         would populate the tsvector column on every INSERT or UPDATE.
>
>     FWIW, I just use pdftotext in my CGI.
>
>     --
>     =============================================================================
>     Stephen Davies Consulting P/L                             Phone: 08-8177 1595
>     Adelaide, South Australia.                                Mobile:040 304 0583
>
>
>
>     --
>     Sent via pgsql-general mailing list (pgsql-general@postgresql.org
>     <mailto:pgsql-general@postgresql.org>)
>     To make changes to your subscription:
>     http://www.postgresql.org/mailpref/pgsql-general
>
>
>
>
> --
> Best Wishes,
> Chris Travers
>
> Efficito:  Hosted Accounting and ERP.  Robust and Flexible.  No vendor lock-in.
> http://www.efficito.com/learn_more

I reckon my approach is simpler and easier (given web-based data entry).
I get all the meta data plus the PDF BLOB in one HTML request, get out the
text and do the insert and all indexing including the tsvector in one PG request.
It also makes is easier to handle BLOB types other than PDF in the same CGI
script as I just include the extracted text in the PG request.
There are readily callable text extraction utilities similar to pdftotext for
all BLOB types that I see.

With a function, I would have to have separate functions or an extra BLOB-type
parameter to the function and separate extraction logic in the function.


--
=============================================================================
Stephen Davies Consulting P/L                             Phone: 08-8177 1595
Adelaide, South Australia.                                Mobile:040 304 0583

pgsql-general by date:

From: Chris Travers
Date: 20 February 2016, 05:52:01
Subject: Re: Replacement for Oracle Text

From: Craig Ringer
Date: 20 February 2016, 11:02:35
Subject: Re: [JDBC] JDBC behaviour

Re: Replacement for Oracle Text - Mailing list pgsql-general

Previous

Next