Thread: Importing text file into a TEXT field
Hello, Is there a way to easily import a relatively huge text file into a table column? I'd like to use psql and I`'ve looked at lo_* commands and I can't figure how to import my text file into my TEXT column. My last solution is to write a little script to load my text file in a var and then insert to databse. Thanks Bruno Lavoie
On Fri, Nov 07, 2008 at 11:15:43AM -0500, Bruno Lavoie wrote: > Is there a way to easily import a relatively huge text file into a table > column? How big is "relatively huge"? > I'd like to use psql and I`'ve looked at lo_* commands and I > can't figure how to import my text file into my TEXT column. the "lo_*" commands are for working with large objects; these have somewhat unusual semantics compared to the normal data in columns in PG. If you're routinely expecting files of more than, say, one MB then they're probably a good way to go, but it's a lot more work getting them going in the first place. > My last > solution is to write a little script to load my text file in a var and > then insert to databse. If you want to just get the data into a TEXT column as quickly as possible; I'd probably just write a little bit of code to perform the escaping that PG requires on the file. You can then simply do: COPY tbl (col) FROM '/path/to/escaped/file'; I'm not sure if this is really what you want though! "Enormous" TEXT columns can be a bit fiddly to work. Sam
Hello, The intent is to use pdftotext and store the resulting text in datbase for full text search purposes... I'm trying to develop a mini content server where I'll put pdf documents to make it searchable. Generally, PDFs are in size of 500 to 3000 pages resulting in text from 500kb to 2megabytes... I'm also looking at open source projects like Alfresco if it can serve with ease to my purpose... Anyone use this one? Comments are welcome. Thanks Bruno Lavoie Sam Mason a écrit : > On Fri, Nov 07, 2008 at 11:15:43AM -0500, Bruno Lavoie wrote: > >> Is there a way to easily import a relatively huge text file into a table >> column? >> > > How big is "relatively huge"? > > >> I'd like to use psql and I`'ve looked at lo_* commands and I >> can't figure how to import my text file into my TEXT column. >> > > the "lo_*" commands are for working with large objects; these have > somewhat unusual semantics compared to the normal data in columns in > PG. If you're routinely expecting files of more than, say, one MB then > they're probably a good way to go, but it's a lot more work getting them > going in the first place. > > >> My last >> solution is to write a little script to load my text file in a var and >> then insert to databse. >> > > If you want to just get the data into a TEXT column as quickly as > possible; I'd probably just write a little bit of code to perform the > escaping that PG requires on the file. You can then simply do: > > COPY tbl (col) FROM '/path/to/escaped/file'; > > I'm not sure if this is really what you want though! "Enormous" TEXT > columns can be a bit fiddly to work. > > > Sam > >
On Fri, Nov 07, 2008 at 01:20:27PM -0500, Bruno Lavoie wrote: > The intent is to use pdftotext and store the resulting text in datbase > for full text search purposes... I'm trying to develop a mini content > server where I'll put pdf documents to make it searchable. I've not tried to do this sort of thing before; but the FTS code (native in PG 8.3, contrib modules before this version) sounds like what you want to be using. As far as getting the data in, you're going to have to write a bit of code. A quick hack suggests that you can get things going in a small amount of Python code: import sys; import psycopg2; conn = psycopg2.connect(""); cur = conn.cursor(); cur.execute("INSERT INTO tbl (tsvec) SELECT to_tsvector(%s);", [sys.stdin.read()]); conn.commit(); You can then do: pdftotext file.pdf - | python script.py One performance issue with psycopg2 is that it always expands the SQL; you may want to find something that uses PQexecParams() underneath so you spend less time escaping everything and then having PG undo that work. Sam
Bruno Lavoie, 07.11.2008 19:20: > Hello, > > The intent is to use pdftotext and store the resulting text in datbase > for full text search purposes... I'm trying to develop a mini content > server where I'll put pdf documents to make it searchable. > > Generally, PDFs are in size of 500 to 3000 pages resulting in text from > 500kb to 2megabytes... > > I'm also looking at open source projects like Alfresco if it can serve > with ease to my purpose... Anyone use this one? Comments are welcome. If you are not bound to "native" Postgres tools, you might want to take a look at my SQL Workbench/J (http://www.sql-workbench.net) It can insert the contents of files (located on the client) into tables. You can either do this using an extended SQL syntax: UPDATE pdf_table SET text_content = {$clobfile=c:/temp/convertet.txt encoding=utf8} WHERE id = 42; (of course this statement can not be run with psql) You could also bulk-upload several files at one using my flat-file import. (http://www.sql-workbench.net/manual/command-import.html) Assuming the table has two columns (id, text_content), the flat file would look like this: id|text_content 1|content_1.txt 2|content_2.txt 3|content_3.txt and the import would store the content of the files not the literl 'content_1.txt' in the column text_content. You can either insert or update the content, depending on your needs. You could even store the orginal pdf file if the tablecontains a bytea column for the blob data. Contact me offline (contact information on my homepage) if you need help. Regards Thomas
Hello Sam,
Thanks for your Python script! It's on my todo list to learn this scripting language. I've done a mini script in Perl for my testing purposes and then test full text search!
Here is my table definition:
test1=# \d pdfsys.document;
Table "pdfsys.document"
Column | Type | Modifiers
---------------+-----------------------------+-----------
document_id | integer | not null
original_file | character varying(4000) | not null
pdf_text | text | not null
pdf_ts | tsvector |
md5 | character varying(32) |
date_added | timestamp without time zone |
Indexes:
"documents_pkey" PRIMARY KEY, btree (document_id)
And my script, that get all .pdf documents from a directory and extract text, store it and update tsvector:
#!/usr/bin/perl
use strict;
use warnings;
use DBI;
my $dbname = '***********';
my $user = '***********';
my $pass = '***********';
my $dbh = DBI->connect("dbi:Pg:dbname=$dbname", $user, $pass, {AutoCommit => 0, RaiseError => 1, PrintError => 0});
opendir(DIR, "./files/");
my @files = grep(/\.pdf$/,readdir(DIR));
closedir(DIR);
undef $/;
my $sth = $dbh->prepare("INSERT INTO pdfsys.document(document_id, original_file, pdf_text) VALUES (nextval('pdfsys.document_id'), ?, ?)");
foreach my $file (@files) {
print "*** $file ***\n";
open (PS, 'pdftotext -layout -nopgbrk -enc UTF-8 ./files/'. $file .' - |');
my $content = <PS>;
close (PS);
$sth->execute($file, $content);
}
$dbh->do("update pdfsys.document set pdf_ts = to_tsvector('english', pdf_text) where pdf_ts is null");
$dbh->commit();
$dbh->disconnect;
It works fine.... I need to develop this more sophisticated: different language per files, file versionning, etc...
Bruno Lavoie
Sam Mason a écrit :
Thanks for your Python script! It's on my todo list to learn this scripting language. I've done a mini script in Perl for my testing purposes and then test full text search!
Here is my table definition:
test1=# \d pdfsys.document;
Table "pdfsys.document"
Column | Type | Modifiers
---------------+-----------------------------+-----------
document_id | integer | not null
original_file | character varying(4000) | not null
pdf_text | text | not null
pdf_ts | tsvector |
md5 | character varying(32) |
date_added | timestamp without time zone |
Indexes:
"documents_pkey" PRIMARY KEY, btree (document_id)
And my script, that get all .pdf documents from a directory and extract text, store it and update tsvector:
#!/usr/bin/perl
use strict;
use warnings;
use DBI;
my $dbname = '***********';
my $user = '***********';
my $pass = '***********';
my $dbh = DBI->connect("dbi:Pg:dbname=$dbname", $user, $pass, {AutoCommit => 0, RaiseError => 1, PrintError => 0});
opendir(DIR, "./files/");
my @files = grep(/\.pdf$/,readdir(DIR));
closedir(DIR);
undef $/;
my $sth = $dbh->prepare("INSERT INTO pdfsys.document(document_id, original_file, pdf_text) VALUES (nextval('pdfsys.document_id'), ?, ?)");
foreach my $file (@files) {
print "*** $file ***\n";
open (PS, 'pdftotext -layout -nopgbrk -enc UTF-8 ./files/'. $file .' - |');
my $content = <PS>;
close (PS);
$sth->execute($file, $content);
}
$dbh->do("update pdfsys.document set pdf_ts = to_tsvector('english', pdf_text) where pdf_ts is null");
$dbh->commit();
$dbh->disconnect;
It works fine.... I need to develop this more sophisticated: different language per files, file versionning, etc...
Bruno Lavoie
Sam Mason a écrit :
On Fri, Nov 07, 2008 at 01:20:27PM -0500, Bruno Lavoie wrote:The intent is to use pdftotext and store the resulting text in datbase for full text search purposes... I'm trying to develop a mini content server where I'll put pdf documents to make it searchable.I've not tried to do this sort of thing before; but the FTS code (native in PG 8.3, contrib modules before this version) sounds like what you want to be using. As far as getting the data in, you're going to have to write a bit of code. A quick hack suggests that you can get things going in a small amount of Python code: import sys; import psycopg2; conn = psycopg2.connect(""); cur = conn.cursor(); cur.execute("INSERT INTO tbl (tsvec) SELECT to_tsvector(%s);", [sys.stdin.read()]); conn.commit(); You can then do: pdftotext file.pdf - | python script.py One performance issue with psycopg2 is that it always expands the SQL; you may want to find something that uses PQexecParams() underneath so you spend less time escaping everything and then having PG undo that work. Sam
Hello Thomas, nice tool! Thanks for your help... For batch and automation purposes, I've done a small script... (check out my last post in the thread few minutes ago) Bruno Lavoie Thomas Kellerer a écrit : > Bruno Lavoie, 07.11.2008 19:20: >> Hello, >> >> The intent is to use pdftotext and store the resulting text in >> datbase for full text search purposes... I'm trying to develop a mini >> content server where I'll put pdf documents to make it searchable. >> >> Generally, PDFs are in size of 500 to 3000 pages resulting in text >> from 500kb to 2megabytes... >> >> I'm also looking at open source projects like Alfresco if it can >> serve with ease to my purpose... Anyone use this one? Comments are >> welcome. > > If you are not bound to "native" Postgres tools, you might want to > take a look at my SQL Workbench/J (http://www.sql-workbench.net) > > It can insert the contents of files (located on the client) into > tables. You can either do this using an extended SQL syntax: > UPDATE pdf_table SET text_content = {$clobfile=c:/temp/convertet.txt > encoding=utf8} > WHERE id = 42; > > (of course this statement can not be run with psql) > > You could also bulk-upload several files at one using my flat-file > import. > (http://www.sql-workbench.net/manual/command-import.html) > > Assuming the table has two columns (id, text_content), the flat file > would look like this: > > id|text_content > 1|content_1.txt > 2|content_2.txt > 3|content_3.txt > > and the import would store the content of the files not the literl > 'content_1.txt' in the column text_content. > > You can either insert or update the content, depending on your needs. > You could even store the orginal pdf file if the table contains a > bytea column for the blob data. > > Contact me offline (contact information on my homepage) if you need help. > > Regards > Thomas > >
no quiero recibir mas correos de estos temas en donde no estoy incluida, gracias..... Bruno Lavoie escribió: > Hello Thomas, > > nice tool! > > Thanks for your help... For batch and automation purposes, I've done a > small script... (check out my last post in the thread few minutes ago) > > Bruno Lavoie > > > Thomas Kellerer a écrit : >> Bruno Lavoie, 07.11.2008 19:20: >>> Hello, >>> >>> The intent is to use pdftotext and store the resulting text in >>> datbase for full text search purposes... I'm trying to develop a >>> mini content server where I'll put pdf documents to make it searchable. >>> >>> Generally, PDFs are in size of 500 to 3000 pages resulting in text >>> from 500kb to 2megabytes... >>> >>> I'm also looking at open source projects like Alfresco if it can >>> serve with ease to my purpose... Anyone use this one? Comments are >>> welcome. >> >> If you are not bound to "native" Postgres tools, you might want to >> take a look at my SQL Workbench/J (http://www.sql-workbench.net) >> >> It can insert the contents of files (located on the client) into >> tables. You can either do this using an extended SQL syntax: >> UPDATE pdf_table SET text_content = {$clobfile=c:/temp/convertet.txt >> encoding=utf8} >> WHERE id = 42; >> >> (of course this statement can not be run with psql) >> >> You could also bulk-upload several files at one using my flat-file >> import. >> (http://www.sql-workbench.net/manual/command-import.html) >> >> Assuming the table has two columns (id, text_content), the flat file >> would look like this: >> >> id|text_content >> 1|content_1.txt >> 2|content_2.txt >> 3|content_3.txt >> >> and the import would store the content of the files not the literl >> 'content_1.txt' in the column text_content. >> >> You can either insert or update the content, depending on your needs. >> You could even store the orginal pdf file if the table contains a >> bytea column for the blob data. >> >> Contact me offline (contact information on my homepage) if you need >> help. >> >> Regards >> Thomas >> >> > > Aviso Legal Este mensaje puede contener informacion de interes solo para CVG Venalum. Solo esta permitida su copia, distribuciono uso a personas autorizadas. Si recibio este corre por error, por favor destruyalo. Eventualmentew los correoselectonicos pueden ser alterados. Al respecto, CVG Venalum no se hace responsable por los errores que pudieran afectaral mensaje original.