Re: Importing text file into a TEXT field - Mailing list pgsql-general

From Bruno Lavoie
Subject Re: Importing text file into a TEXT field
Date
Msg-id 4918471C.1020703@gmail.com
Whole thread Raw
In response to Re: Importing text file into a TEXT field  (Sam Mason <sam@samason.me.uk>)
List pgsql-general
Hello Sam,

Thanks for your Python script! It's on my todo list to learn this scripting language.  I've done a mini script in Perl for my testing purposes and then test full text search!

Here is my table definition:
test1=# \d pdfsys.document;
                 Table "pdfsys.document"
    Column     |            Type             | Modifiers
---------------+-----------------------------+-----------
 document_id   | integer                     | not null
 original_file | character varying(4000)     | not null
 pdf_text      | text                        | not null
 pdf_ts        | tsvector                    |
 md5           | character varying(32)       |
 date_added    | timestamp without time zone |
Indexes:
    "documents_pkey" PRIMARY KEY, btree (document_id)


And my script, that get all .pdf documents from a directory and extract text, store it and update tsvector:
#!/usr/bin/perl
use strict;
use warnings;

use DBI;

my $dbname = '***********';
my $user   = '***********';
my $pass   = '***********';

my $dbh = DBI->connect("dbi:Pg:dbname=$dbname", $user, $pass, {AutoCommit => 0, RaiseError => 1, PrintError => 0});

opendir(DIR, "./files/");
my @files = grep(/\.pdf$/,readdir(DIR));
closedir(DIR);

undef $/;

my $sth = $dbh->prepare("INSERT INTO pdfsys.document(document_id, original_file, pdf_text) VALUES (nextval('pdfsys.document_id'), ?, ?)");
foreach my $file (@files) {
    print "*** $file ***\n";

    open (PS, 'pdftotext -layout -nopgbrk -enc UTF-8 ./files/'. $file .' - |');
    my $content = <PS>;
    close (PS);

    $sth->execute($file, $content);
}

$dbh->do("update pdfsys.document set pdf_ts = to_tsvector('english', pdf_text) where pdf_ts is null");

$dbh->commit();
$dbh->disconnect;



It works fine.... I need to develop this more sophisticated: different language per files, file versionning, etc...

Bruno Lavoie

Sam Mason a écrit :
On Fri, Nov 07, 2008 at 01:20:27PM -0500, Bruno Lavoie wrote: 
The intent is to use pdftotext and store the resulting text in datbase 
for full text search purposes... I'm trying to develop a mini content 
server where I'll put pdf documents to make it searchable.   
I've not tried to do this sort of thing before; but the FTS code (native
in PG 8.3, contrib modules before this version) sounds like what you
want to be using.  As far as getting the data in, you're going to have
to write a bit of code.  A quick hack suggests that you can get things
going in a small amount of Python code:
 import sys; import psycopg2; conn = psycopg2.connect(""); cur = conn.cursor(); cur.execute("INSERT INTO tbl (tsvec) SELECT to_tsvector(%s);",             [sys.stdin.read()]); conn.commit();

You can then do:
 pdftotext file.pdf - | python script.py

One performance issue with psycopg2 is that it always expands the SQL;
you may want to find something that uses PQexecParams() underneath so
you spend less time escaping everything and then having PG undo that
work.

 Sam
 

pgsql-general by date:

Previous
From: "Dmitry Teslenko"
Date:
Subject: LIKE, "=" and fixed-width character fields
Next
From: Bruno Lavoie
Date:
Subject: Re: Importing text file into a TEXT field