Thread: How to store text files in the postgresql?
Hello, I am very new to postgresql database. I`ve used a little of MySql previously. My task is to store a lot (10^5) of small ( <10 MB) text files in the database with the ability to restore them back to the hard drive on demand. That means that I need two functions. First - grab file from the directory, store it in the database and delete from the disk; second - recreate in back to the disk. 1/ Is it possible? 2/ Could you give me some quick tips on how to manage it from the start so that I knew what to look for in the manual? Thank you for your time. Dimitry
On Saturday 6. June 2009, DimitryASuplatov wrote: >Hello, > >I am very new to postgresql database. I`ve used a little of MySql >previously. > >My task is to store a lot (10^5) of small ( <10 MB) text files in the >database with the ability to restore them back to the hard drive on >demand. > >That means that I need two functions. First - grab file from the >directory, store it in the database and delete from the disk; second - >recreate in back to the disk. > >1/ Is it possible? >2/ Could you give me some quick tips on how to manage it from the > start so that I knew what to look for in the manual? > >Thank you for your time. >Dimitry About 15 lines of Perl code should do that. -- Leif Biberg Kristensen | Registered Linux User #338009 Me And My Database: http://solumslekt.org/blog/
On 06/06/2009 11:41, DimitryASuplatov wrote: > My task is to store a lot (10^5) of small ( <10 MB) text files in the > database with the ability to restore them back to the hard drive on > demand. > > That means that I need two functions. First - grab file from the > directory, store it in the database and delete from the disk; second - > recreate in back to the disk. > > > 1/ Is it possible? Yes. You can do this in your favourite scripting language - PHP, Perl, etc etc. > 2/ Could you give me some quick tips on how to manage it from the start > so that I knew what to look for in the manual? You haven't given much detail on your needs, but something like this should do you: create table the_files ( file_id serial, file_data text, ...[any other metatdata you need]... constraint files_pk primary key (file_id) ); The "serial" type is a pseudo-type which creates a sequence, which in turn supplies incrementing integers, while "text" allows you to store strings of arbitrary length. - Look them up in the docs under "Data types". Really, though, I think your question is more about how to write the script which reads the files from the filesystem and restores them there - this depends on what language you use. Ray. ------------------------------------------------------------------ Raymond O'Donnell, Director of Music, Galway Cathedral, Ireland rod@iol.ie Galway Cathedral Recitals: http://www.galwaycathedral.org/recitals ------------------------------------------------------------------
On 06/06/2009 14:37, DimitryASuplatov wrote: > But then comes the problem because the only command I found to read in > the file content is COPY but the following command would not work Yes, this wont work here - COPY is intended for reading an entire table to or from a disk file, not a single column. It's used for bulk loading/exporting of data - pg_dump uses it, for example, when backing up a database. BTW, "index" is a reserved word, so you shouldn't use it as a column name. If you *really* have to use it, you'll need to double-quote it every time, which is a PITA. > mypdb=# copy pdb (filecontent) from > '/home/sda/Documents/Work/PVA_India/PGA_test_modeling/pdb/1gm9.pdb' > where index=1; > ERROR: syntax error at or near "where" > LINE 1: ...ts/Work/PVA_India/PGA_test_modeling/pdb/1gm9.pdb' where > inde... > > > QUESTION: what is the command to read the content of a plain text file > into a SPECIFIED table entry? I don't think you can, directly in the database. Instead, write a short script which reads the file contents into a string and then issues an SQL query to INSERT a new row or UPDATE an existing one. In PHP, for example (untested): // Get the file contents. $filename = 'my_file'; $h = fopen($filename, 'r'); $lines = fread($h, filesize($filename)); fclose($h); // Write the contents to the database. $sql = 'insert into pdb("index", filename, filecontents) values ($1, $2, $3)'; $conn = pg_connect([..your connection string..]); pg_query_params($sql, array(1, $filename, $lines)); HTH, Ray. ------------------------------------------------------------------ Raymond O'Donnell, Director of Music, Galway Cathedral, Ireland rod@iol.ie Galway Cathedral Recitals: http://www.galwaycathedral.org/recitals ------------------------------------------------------------------
Thank you very much. I`ve also worked out how to do this simply from bash ./bin/psql mypdb <<EOF insert into pdb values ('`cat /file/name`'); EOF SDA On Sat, 2009-06-06 at 16:32 +0100, Raymond O'Donnell wrote: > On 06/06/2009 14:37, DimitryASuplatov wrote: > > But then comes the problem because the only command I found to read in > > the file content is COPY but the following command would not work > > Yes, this wont work here - COPY is intended for reading an entire table > to or from a disk file, not a single column. It's used for bulk > loading/exporting of data - pg_dump uses it, for example, when backing > up a database. > > BTW, "index" is a reserved word, so you shouldn't use it as a column > name. If you *really* have to use it, you'll need to double-quote it > every time, which is a PITA. > > > mypdb=# copy pdb (filecontent) from > > '/home/sda/Documents/Work/PVA_India/PGA_test_modeling/pdb/1gm9.pdb' > > where index=1; > > ERROR: syntax error at or near "where" > > LINE 1: ...ts/Work/PVA_India/PGA_test_modeling/pdb/1gm9.pdb' where > > inde... > > > > > > QUESTION: what is the command to read the content of a plain text file > > into a SPECIFIED table entry? > > I don't think you can, directly in the database. Instead, write a short > script which reads the file contents into a string and then issues an > SQL query to INSERT a new row or UPDATE an existing one. In PHP, for > example (untested): > > // Get the file contents. > $filename = 'my_file'; > $h = fopen($filename, 'r'); > $lines = fread($h, filesize($filename)); > fclose($h); > > // Write the contents to the database. > $sql = 'insert into pdb("index", filename, filecontents) values ($1, $2, > $3)'; > $conn = pg_connect([..your connection string..]); > pg_query_params($sql, array(1, $filename, $lines)); > > HTH, > > Ray. > > ------------------------------------------------------------------ > Raymond O'Donnell, Director of Music, Galway Cathedral, Ireland > rod@iol.ie > Galway Cathedral Recitals: http://www.galwaycathedral.org/recitals > ------------------------------------------------------------------
Thank you for answering. I`ve read a manual and now I have a more advanced question. 1/ I`ve created a table in the database mypdb=# create table pdb( index int, filename text, filecontent text ); 2/ Then I want to read a file into it First I insert metainfo mypdb=# insert into pdb (index , filename) values (1, '/home/sda/Documents/Work/PVA_India/PGA_test_modeling/pdb/1gm9.pdb'); INSERT 0 1 But then comes the problem because the only command I found to read in the file content is COPY but the following command would not work mypdb=# copy pdb (filecontent) from '/home/sda/Documents/Work/PVA_India/PGA_test_modeling/pdb/1gm9.pdb' where index=1; ERROR: syntax error at or near "where" LINE 1: ...ts/Work/PVA_India/PGA_test_modeling/pdb/1gm9.pdb' where inde... QUESTION: what is the command to read the content of a plain text file into a SPECIFIED table entry? Thank you for your time. SDA
* DimitryASuplatov: > I`ve also worked out how to do this simply from bash > > ./bin/psql mypdb <<EOF > insert into pdb values ('`cat /file/name`'); > EOF This doesn't work if the file contains embedded "'" characters (and backslashes and NULs are also problematic). You will also get errors if the file encoding does not match the database encoding. You probably should use a BYTEA column and a little Perl script which uses bind_param to specify a type of PG_BYTEA for the parameter.
> 1/ Is it possible? > 2/ Could you give me some quick tips on how to manage it from the start > so that I knew what to look for in the manual? Not sure how much you know about programming, but easiest will probably be to have a small application. Here is some code in the Npgsql library documentation that shows how to do it in C#: using System; using System.Data; using Npgsql; using System.IO; public class t { public static void Main(String[] args) { NpgsqlConnection conn = new NpgsqlConnection( "server=localhost;user id=npgsql_tests;password=npgsql_tests"); conn.Open(); FileStream fs = new FileStream(args[0], FileMode.Open, FileAccess.Read); BinaryReader br = new BinaryReader(new BufferedStream(fs)); Byte[] bytes = br.ReadBytes((Int32)fs.Length); Console.WriteLine(fs.Length); br.Close(); fs.Close(); NpgsqlCommand command = new NpgsqlCommand( "insert into tableBytea(field_bytea) values(:bytesData)", conn); NpgsqlParameter param = new NpgsqlParameter( ":bytesData", DbType.Binary); param.Value = bytes; command.Parameters.Add(param); command.ExecuteNonQuery(); command = new NpgsqlCommand( "select field_bytea from tableBytea " + "where field_serial = (select max(select field_serial) from " + "tableBytea);", conn); Byte[] result = (Byte[])command.ExecuteScalar(); fs = new FileStream(args[0] + "database", FileMode.Create, FileAccess.Write); BinaryWriter bw = new BinaryWriter(new BufferedStream(fs)); bw.Write(result); bw.Flush(); fs.Close(); bw.Close(); conn.Close(); } } HTH, Johan Nel Pretoria, South Africa.
Hi, Le 6 juin 09 à 12:41, DimitryASuplatov a écrit : > My task is to store a lot (10^5) of small ( <10 MB) text files in the > database with the ability to restore them back to the hard drive on > demand. The following article deals specifically with files containing XML but goes as far as explaining exactly what to change to have it more general: http://blog.rhodiumtoad.org.uk/2009/02/05/reading-xml-files-into-the-database/ Regards, -- dim
DimitryASuplatov wrote: > My task is to store a lot (10^5) of small ( <10 MB) text files in the > database with the ability to restore them back to the hard drive on > demand. > > That means that I need two functions. First - grab file from the > directory, store it in the database and delete from the disk; second - > recreate in back to the disk.... > > 2/ Could you give me some quick tips on how to manage it from the start > so that I knew what to look for in the manual? > Others have pointed out some of the scripting methods to use. Some other things to consider: 1. Give some thought to character-sets. It's likely that UTF8 on the backend will be fine but there exists the possibility that you will have to set the appropriate encoding on the client-side depending on the nature of the files you will be reading. 2. Similarly, think about language. I don't know if you need to do searches based on the contents of your text-fields but you should be aware that PostgreSQL has some very nice text-search functions. Take a moment to scan those docs to see if there is anything that may be of use: http://www.postgresql.org/docs/8.3/interactive/textsearch.html 3. Tell us more about the app. How does data flow in/out of the system? Are records updated or static? Are they retained indefinitely or purged? If purged, on what basis? Lets say, for example, that you are putting the data in for archival purposes and the records should be purged monthly after 5 years. In that case, you should consider including some sort of time indicator in your data and partitioning your table into month-sized tables. Partitioning uses PostgreSQL's inheritance feature allowing you to have a single parent table with no data and multiple child tables that contain chunks of your data so every month you could create a new child table to take in the new files for the month and drop the table containing the data you want to purge. Or you could have the table partitioned by users, projects or whatever is appropriate for your app. Read up on inheritance and partitioning at http://www.postgresql.org/docs/8.3/interactive/ddl-inherit.html and http://www.postgresql.org/docs/8.3/interactive/ddl-partitioning.html. If you are able to partition into static chunks and changing chunks, you may be able to create a much more efficient backup configuration wherein you only regularly backup the changing piece of data. Cheers, Steve
DimitryASuplatov wrote: > > My task is to store a lot (10^5) of small ( <10 MB) text files in the > database with the ability to restore them back to the hard drive on > demand. I cannot but ask the community a related question here: Can such design, that is, storing quite large objects of varying size in a PostgreSQL database, be a good idea in the first place? I used to believe that what RDBMS were really good at was storing a huge number of relations, each of a small and mostly uniform size if expressed in bytes; but today people tend to put big things, e.g., email or files, in relational databases because it's convenient to them. That's absolutely normal as typical data objects we have to deal with keep growing in size, but how well can databases stand the pressure? And can't it still be better to store large things as plain files and put just their names in the database? File systems were designed for such kind of job after all, unlike RDBMS. Thanks! Yar
Yaroslav Tykhiy wrote: > DimitryASuplatov wrote: >> >> My task is to store a lot (10^5) of small ( <10 MB) text files in the >> database with the ability to restore them back to the hard drive on >> demand. > > I cannot but ask the community a related question here: Can such design, > that is, storing quite large objects of varying size in a PostgreSQL > database, be a good idea in the first place? I used to believe that > what RDBMS were really good at was storing a huge number of relations, > each of a small and mostly uniform size if expressed in bytes; but today > people tend to put big things, e.g., email or files, in relational > databases because it's convenient to them. That's absolutely normal as > typical data objects we have to deal with keep growing in size, but how > well can databases stand the pressure? And can't it still be better to > store large things as plain files and put just their names in the > database? File systems were designed for such kind of job after all, > unlike RDBMS. > > Thanks! > > Yar > I'd have to vote yes, its a good idea. It offers consistency that you cant get with the file system. We store assessor information in a database along with a photo of the house. If the photos were on the file system whats to prevent someone from blowing away the folder? (our network admin loves to find big folders taking up lots of disk space and blow them away) Sure, you say, they can "delete from photos", but that's different (and because our network admin does not know sql). If they delete a file, I'll still have a photo record and know they used to have a photo, and I know the path to the photo, but there is no photo there. But with the photo in the database, if there is no photo, there is no record either. If you are looking for speed, yea, I'd say store it on the file system. But we are looking for convenience and consistency. -Andy
2009/6/6 DimitryASuplatov <genesup@gmail.com>: > Hello, > > I am very new to postgresql database. I`ve used a little of MySql > previously. > > My task is to store a lot (10^5) of small ( <10 MB) text files in the > database with the ability to restore them back to the hard drive on > demand. > > That means that I need two functions. First - grab file from the > directory, store it in the database and delete from the disk; second - > recreate in back to the disk. > > 1/ Is it possible? > 2/ Could you give me some quick tips on how to manage it from the start > so that I knew what to look for in the manual? > > Thank you for your time. > Dimitry > You can use the function pg_read_file(text,bigint,bigint)? like: inser into table foo values(1,2,3 [...], pg_read_file(filename,1,[?]); The problem is to write it on the disc, there is not a function for that AFAIK. -- Emanuel Calvo Franco ArPUG [www.arpug.com.ar] / AOSUG Member www.emanuelcalvofranco.com.ar
This is a recurring debate and there are pros and cons for both sides. It usually comes down to whether you need transactional guarantees for these large objects. There are also practical concerns. Transfering these large objects over a single database tcp connection limits the application performance a lot. And the database is often more heavyweight than you really want to keep tied up to serve up images. Also, it makes backups a pain since it's a lot easier to back up a file system than a database. But that gets back to whether you need transactional guarantees. The reason it's a pain to back up a database is precisely because it needs to make those guarantees.
If I had an admin roaming through my document server deleting document files out from under my database, that's a problem I would solve very quickly--with a completely non-technical "solution". After all, what's to prevent such a person from deleting pgsql data files??? -- Scott Ribe scott_ribe@killerbytes.com http://www.killerbytes.com/ (303) 722-0567 voice
Scott Ribe wrote: > If I had an admin roaming through my document server deleting document files > out from under my database, that's a problem I would solve very > quickly--with a completely non-technical "solution". > > After all, what's to prevent such a person from deleting pgsql data files??? > Yea, true. I don't actually have a rogue admin cleaning up too much. My point was its "less accessible" and "feels safer" when the photos are in the database. -Andy
On Friday 12 June 2009, Greg Stark <gsstark@mit.edu> wrote: > Also, it makes backups a pain since it's a lot easier to back up a > file system than a database. But that gets back to whether you need > transactional guarantees. The reason it's a pain to back up a database > is precisely because it needs to make those guarantees. It's far easier to backup and restore a database than millions of small files. Small files = random disk I/O. The real downside is the CPU time involved in storing and retrieving the files. If it isn't a show stopper, then putting them in the database makes all kinds of sense. -- WARNING: Do not look into laser with remaining eye.
> It's far easier to backup and restore a database than millions of small > files. Small files = random disk I/O. The real downside is the CPU time > involved in storing and retrieving the files. If it isn't a show stopper, > then putting them in the database makes all kinds of sense. On the contrary, I think backup is one of the primary reasons to move files *out* of the database. Decent incremental backup software greatly reduces the I/O & time needed for backup of files as compared to a pg dump. (Of course this assumes the managed files are long-lived.) -- Scott Ribe scott_ribe@killerbytes.com http://www.killerbytes.com/ (303) 722-0567 voice
On Friday 12 June 2009, Scott Ribe <scott_ribe@killerbytes.com> wrote: > > It's far easier to backup and restore a database than millions of small > > files. Small files = random disk I/O. The real downside is the CPU time > > involved in storing and retrieving the files. If it isn't a show > > stopper, then putting them in the database makes all kinds of sense. > > On the contrary, I think backup is one of the primary reasons to move > files *out* of the database. Decent incremental backup software greatly > reduces the I/O & time needed for backup of files as compared to a pg > dump. (Of course this assumes the managed files are long-lived.) We'll have to just disagree on that. You still have to do level 0 backups occasionally. Scanning a directory tree of millions of files to decide what to backup for an incremental can take forever. And restoring millions of small files can take days. But I concede there are good arguments for the filesystem approach; certainly it's not a one size fits all problem. If your files are mostly bigger than a few MB each, then the filesystem approach is probably better. And of course big database tables get unwieldy too, for indexing and vacuuming - I wouldn't necessarily put most files into the large object interface, just the ones too big to want to fetch all in one piece. -- WARNING: Do not look into laser with remaining eye.
On Fri, 2009-06-12 at 19:53 +1000, Yaroslav Tykhiy wrote: > DimitryASuplatov wrote: > > > > My task is to store a lot (10^5) of small ( <10 MB) text files in the > > database with the ability to restore them back to the hard drive on > > demand. > > I cannot but ask the community a related question here: Can such > design, that is, storing quite large objects of varying size in a > PostgreSQL database, be a good idea in the first place? I used to > believe that what RDBMS were really good at was storing a huge number > of relations, each of a small and mostly uniform size if expressed in > bytes; but today people tend to put big things, e.g., email or files, > in relational databases because it's convenient to them. That's > absolutely normal as typical data objects we have to deal with keep > growing in size, but how well can databases stand the pressure? And > can't it still be better to store large things as plain files and put > just their names in the database? File systems were designed for such > kind of job after all, unlike RDBMS. It depends a great deal on what you need. Using the DB allows you to access that data using the same tools, methods, connections, and security credentials you use for your other data. It also allows you to manage it in the same transactional environment, and verify its consistency. Using the file system can be faster and offers a wider variety of methods for accessing and manipulating the data. It can be easier to back up efficiently (differential/incremental backups etc) may take up less space, and more. To me, a nearly ideal option would be a file system that supported transactional operations and two phase commit. You could work with your data objects normally in the FS (at least for read access), but you could _ALSO_ modify them in tandem with the DB: - Begin DB trans - Begin FS trans ... do your work... - prepare db trans for commit - prepare fs trans for commit - commit db trans - commit fs trans A transaction manager could hide those from you, or the DB could take care of the FS 2PC as part of its own LOB and transaction management. In fact, I think that's how Microsoft have done it with transactional NTFS integration in MS SQL Server, which I must say sounds awfully nice. -- Craig Ringer
On Fri, 2009-06-12 at 09:07 -0700, Alan Hodgson wrote: > On Friday 12 June 2009, Scott Ribe <scott_ribe@killerbytes.com> wrote: > > > It's far easier to backup and restore a database than millions of small > > > files. Small files = random disk I/O. That depends on how you're backing up. If you want to back a file system up database style, use the filesystem dump utility. fs dump tools have gone somewhat out of fashion lately, because of space use concerns, inconvenience of partial restores, cross-hardware/version compat issues, etc, but they're actually really rather similar to the result you get when backing up a DB like Pg with a fs-level snapshot. If your dump tool supports incrementals, you also get results rather akin to PITR WAL logging. Personally, there's no way I'd back up a filesystem with dump utilities. I don't trust even dumpe2fs enough, the space requirements are prohibitive, and restores are nightmarish. I have similar problems (minus the trust issue) with backups of databases full of LOBs, though. > > On the contrary, I think backup is one of the primary reasons to move > > files *out* of the database. Decent incremental backup software greatly > > reduces the I/O & time needed for backup of files as compared to a pg > > dump. (Of course this assumes the managed files are long-lived.) > > We'll have to just disagree on that. You still have to do level 0 backups > occasionally. Scanning a directory tree of millions of files to decide what > to backup for an incremental can take forever. In particular, if your incremental backup tool is smart enough to track deleted files the resource requirements can be astonishing. In addition for looking for new/altered files, the tool needs to test to see if any previously backed up file has since vanished - and it can't really even rely on directory modification times to exclude very static data from checking. I use Bacula at at work, and backing up my Cyrus mail spools is a very I/O intensive and somewhat CPU-intensive operation even for incrementals, since the backup server and file daemon are exchanging file lists all the time, scanning the whole huge directory tree, etc. The adantage, though, is that the resulting daily backups are only about 1GB instead of 60 - 70GB. When you have to keep three full monthly backups plus daily incrementals for audit/history purposes, that matters. I can't see any way I could reasonably achieve a similar effect if I stored my mail in an RDBMS. Not, at least, and still have acceptable performance in the RDBMS. -- Craig Ringer
On Jun 12, 2009, at 11:53 AM, Yaroslav Tykhiy wrote: > I cannot but ask the community a related question here: Can such > design, that is, storing quite large objects of varying size in a > PostgreSQL database, be a good idea in the first place? I used to > believe that what RDBMS were really good at was storing a huge > number of relations, each of a small and mostly uniform size if > expressed in bytes; but today people tend to put big things, e.g., > email or files, in relational databases because it's convenient to > them. That's absolutely normal as typical data objects we have to > deal with keep growing in size, but how well can databases stand the > pressure? And can't it still be better to store large things as > plain files and put just their names in the database? File systems > were designed for such kind of job after all, unlike RDBMS. I've been thinking about this exact same problem. There's another drawback in storing files in the database BTW: They're not directly accessible from the file system. To illustrate, I was looking into storing images for a website into the database. It's much easier if those images are available to the web-server directly instead of having to go through a script that reads the image file from the database and streams the bytes to the client. What I came up with was to create a file system layer that needs to go through the database to be able to manipulate files. It's still a file system, so files are available, but the database gets to check its constraints against those operations as well and can throw an error that prevents the file-system operation from being performed. Apparently something like this shouldn't be too hard to implement using FuseFS. Alban Hertroys -- If you can't see the forest for the trees, cut the trees and you'll see there is no forest. !DSPAM:737,4a3388e3759153496917459!