Re: Storing files: 2.3TBytes, 17M file count - Mailing list pgsql-general

From
Subject Re: Storing files: 2.3TBytes, 17M file count
Date
Msg-id 20161128124700.9BE5112B@m0087798.ppops.net
Whole thread Raw
In response to Storing files: 2.3TBytes, 17M file count  (Thomas Güttler <guettliml@thomas-guettler.de>)
List pgsql-general
Thomas Güttler <guettliml@thomas-guettler.de> wrote:
> 
> Up to now, we don't store files in PostgreSQL.
> I was told, that you must not do this .... But this was 20 years ago.
> I have 2.3TBytes of files. File count is 17M
> Up to now we use rsync (via rsnapshot) to backup our data.
> But it takes longer and longer for rsync to detect the changes. Rsync checks many files. But daily only very few
filesreally change. More than 99.9% don't.
 
> Since we already store our structured data in postgres, I think about storing the files in PostgreSQL, too.
> What is the current state of the art?
> Is it feasible to store file in PostgreSQL?
> Are there already projects which use PostgreSQL as storage backend?
> I have the hope, that it would be easier to backup only the files which changed.

There are a lot of "it depends" in answering that. :)

The biggest question is "What kind of files are they?" Text or binary data?
If they're text, you could compress them and store them in
the DB (I did that successfully on 1 project with XML as that was the fastest
answer from our benchmarking of all combinations of in DB, on the file system,
compressed or not). Then again, our files were from 1-10KB each.

If I did the math right, your files average ~135GB each. That argues for leaving
them on the file system and storing the name. Have you considered adding data
for the filename and a timestamp, then having a scheduled task that selects
the files with a timestamp more recent than the last time the transfer ran
and rsync'ing (or whatever tool you want) only those? That's mostly application code
and not DB code, but I'd think that'd be the easiest and fastest all things
considered.

HTH,
Kevin

pgsql-general by date:

Previous
From: greigwise
Date:
Subject: Re: Query with large in clauses uses a lot of memory
Next
From: said assemlal
Date:
Subject: FOR UPDATE