Thread: Storing files: 2.3TBytes, 17M file count
Hi, PostgreSQL is rock solid and one of the most reliable parts of our toolchain. Thank you Up to now, we don't store files in PostgreSQL. I was told, that you must not do this .... But this was 20 years ago. I have 2.3TBytes of files. File count is 17M Up to now we use rsync (via rsnapshot) to backup our data. But it takes longer and longer for rsync to detect the changes. Rsync checks many files. But daily only very few files really change. More than 99.9% don't. Since we already store our structured data in postgres, I think about storing the files in PostgreSQL, too. What is the current state of the art? Is it feasible to store file in PostgreSQL? Are there already projects which use PostgreSQL as storage backend? I have the hope, that it would be easier to backup only the files which changed. Regards, Thomas Güttler Related question at rsnapshot mailing list: https://sourceforge.net/p/rsnapshot/mailman/rsnapshot-discuss/thread/57A1A2F3.5090409@thomas-guettler.de/ -- Thomas Guettler http://www.thomas-guettler.de/
On 11/28/2016 06:28 AM, Thomas Güttler wrote: > Hi, > > PostgreSQL is rock solid and one of the most reliable parts of our > toolchain. > > Thank you > > Up to now, we don't store files in PostgreSQL. > > I was told, that you must not do this .... But this was 20 years ago. > > > I have 2.3TBytes of files. File count is 17M > > Up to now we use rsync (via rsnapshot) to backup our data. > > But it takes longer and longer for rsync to detect > the changes. Rsync checks many files. But daily only > very few files really change. More than 99.9% don't. Are you rsyncing over all the files at one time? Or do break it down into segments over the day? > > Since we already store our structured data in postgres, I think > about storing the files in PostgreSQL, too. > > What is the current state of the art? I don't know. > > Is it feasible to store file in PostgreSQL? Yes, you can store a file in Postgres. Still I am not sure that stuffing 17M files into Postgres is going to perform any better then dealing with them on the file system. In fact in Postgres they would be still be on the file system but with an extra layer above them. > > Are there already projects which use PostgreSQL as storage backend? The closest I remember is Bacula: http://blog.bacula.org/documentation/documentation/ It uses a hybrid solution where the files are stored on a file server and data about the files is stored in a database. Postgres is one of the database backends it can work with. > > I have the hope, that it would be easier to backup only the files which > changed. Backup to where and how? Are you thinking of using replication? > > Regards, > Thomas Güttler > > > Related question at rsnapshot mailing list: > https://sourceforge.net/p/rsnapshot/mailman/rsnapshot-discuss/thread/57A1A2F3.5090409@thomas-guettler.de/ > > > > -- Adrian Klaver adrian.klaver@aklaver.com
Hi,
PostgreSQL is rock solid and one of the most reliable parts of our toolchain.
Thank you
Up to now, we don't store files in PostgreSQL.
I was told, that you must not do this .... But this was 20 years ago.
I have 2.3TBytes of files. File count is 17M
Up to now we use rsync (via rsnapshot) to backup our data.
But it takes longer and longer for rsync to detect
the changes. Rsync checks many files. But daily only
very few files really change. More than 99.9% don't.
Since we already store our structured data in postgres, I think
about storing the files in PostgreSQL, too.
What is the current state of the art?
Is it feasible to store file in PostgreSQL?
Are there already projects which use PostgreSQL as storage backend?
I have the hope, that it would be easier to backup only the files which changed.
Regards,
Thomas Güttler
Related question at rsnapshot mailing list:
https://sourceforge.net/p/rsnapshot/mailman/rsnapshot-discus s/thread/57A1A2F3.5090409@ thomas-guettler.de/
--
Thomas Guettler http://www.thomas-guettler.de/
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Thomas Güttler wrote: > Up to now we use rsync (via rsnapshot) to backup our data. > > But it takes longer and longer for rsync to detect > the changes. Rsync checks many files. But daily only > very few files really change. More than 99.9% don't. lsyncd+rsync has worked nicely for me on Linux in such cases, as opposed to rsync alone which is indeed very slow with large trees. Check out https://github.com/axkibe/lsyncd If you think of using Postgres large objects, be aware that they are stored in a single table (pg_largeobject), sliced as rows of 1/4 block in size each (typically 2048 bytes). 2.3 TB in a single database would mean more than 1.2 billion rows in that table, and as a system table it can't be partitioned or moved to another tablespace. OTOH with large objects, files can be stored and retrieved easily between client and server with efficient built-in functions at both ends. In particular, they don't need the binary<->text conversions or large memory allocations mentioned by Chris Travers upthread, that may happen when writing your own methods with bytea columns. But for the amount of data you have, the monolithic pg_largeobject would likely be problematic. Ideally there should be an extension implementing something like DATALINK (SQL99), with external storage. I wonder if an extension could provide custom WAL records replicating content changes to the external storage of a standby. That would be awesome. Best regards, -- Daniel Vérité PostgreSQL-powered mailer: http://www.manitou-mail.org Twitter: @DanielVerite
On Mon, 28 Nov 2016 15:28:28 +0100 Thomas Güttler <guettliml@thomas-guettler.de> wrote: > Hi, > > Up to now, we don't store files in PostgreSQL. > > I was told, that you must not do this .... But this was 20 years ago. > > > I have 2.3TBytes of files. File count is 17M > > Up to now we use rsync (via rsnapshot) to backup our data. > > But it takes longer and longer for rsync to detect > the changes. Rsync checks many files. But daily only > very few files really change. More than 99.9% don't. > > Since we already store our structured data in postgres, I think > about storing the files in PostgreSQL, too. > > What is the current state of the art? > > Is it feasible to store file in PostgreSQL? Yes and no, it's another level of indirection, slower than pure filesystem solution. Rsync checks last read/access time, last write/modification time and file hash before copying it. If no one of those metadata change, rsync don't copy it. File hash must be recalculated if access time and modification time change. > Are there already projects which use PostgreSQL as storage backend? > > I have the hope, that it would be easier to backup only the files > which changed. Rsync tries to backup only the files that changed. There are other tools like cpdup, don't know if it's ported to linux, It's similar to rsync. You can use a p2p system, unlike ftp, rsync, etc they store a full Tree Hash (Tiger Tree Hash often) of file content to allowmulti peer to peer copy. > Regards, > Thomas Güttler > > > Related question at rsnapshot mailing list: > https://sourceforge.net/p/rsnapshot/mailman/rsnapshot-discuss/thread/57A1A2F3.5090409@thomas-guettler.de/ > -- > Thomas Guettler http://www.thomas-guettler.de/ > --- --- Eduardo Morras <emorrasg@yahoo.es>
Thomas Güttler <guettliml@thomas-guettler.de> wrote: > > Up to now, we don't store files in PostgreSQL. > I was told, that you must not do this .... But this was 20 years ago. > I have 2.3TBytes of files. File count is 17M > Up to now we use rsync (via rsnapshot) to backup our data. > But it takes longer and longer for rsync to detect the changes. Rsync checks many files. But daily only very few filesreally change. More than 99.9% don't. > Since we already store our structured data in postgres, I think about storing the files in PostgreSQL, too. > What is the current state of the art? > Is it feasible to store file in PostgreSQL? > Are there already projects which use PostgreSQL as storage backend? > I have the hope, that it would be easier to backup only the files which changed. There are a lot of "it depends" in answering that. :) The biggest question is "What kind of files are they?" Text or binary data? If they're text, you could compress them and store them in the DB (I did that successfully on 1 project with XML as that was the fastest answer from our benchmarking of all combinations of in DB, on the file system, compressed or not). Then again, our files were from 1-10KB each. If I did the math right, your files average ~135GB each. That argues for leaving them on the file system and storing the name. Have you considered adding data for the filename and a timestamp, then having a scheduled task that selects the files with a timestamp more recent than the last time the transfer ran and rsync'ing (or whatever tool you want) only those? That's mostly application code and not DB code, but I'd think that'd be the easiest and fastest all things considered. HTH, Kevin
From: Thomas Güttler Sent: Monday, November 28, 2016 6:28 AM
...I have 2.3TBytes of files. File count is 17M
Since we already store our structured data in postgres, I think about storing the files in PostgreSQL, too.
Is it feasible to store file in PostgreSQL?
-------
I am doing something similar, but in reverse. The legacy mysql databases I’m converting into a modern Postgres data model, have very large genomic strings stored in 3 separate columns. Out of the 25 TB of legacy data storage (in 800 dbs across 4 servers, about 22b rows), those 3 columns consume 90% of the total space, and they are just used for reference, never used in searches or calculations. They range from 1k to several MB.
Since I am collapsing all 800 dbs into a single PG db, being very smart about storage was critical. Since we’re also migrating everything to AWS, we’re placing those 3 strings (per row) into a single json document and storing the document in S3 bins, with the pointer to the file being the globally unique PK for the row…super simple. The app tier knows to fetch the data from the db and large string json from the S3 bins. The retrieval time is surprisingly fast, this is all real time web app stuff.
This is a model that could work for anyone dealing with large objects (text or binary). The nice part is, the original 25TB of data storage drops to 5TB – a much more manageable number, allowing for significant growth, which is on the horizon.
Mike Sofen (Synthetic Genomics USA)
Am 28.11.2016 um 16:01 schrieb Adrian Klaver: > On 11/28/2016 06:28 AM, Thomas Güttler wrote: >> Hi, >> >> PostgreSQL is rock solid and one of the most reliable parts of our >> toolchain. >> >> Thank you >> >> Up to now, we don't store files in PostgreSQL. >> >> I was told, that you must not do this .... But this was 20 years ago. >> >> >> I have 2.3TBytes of files. File count is 17M >> >> Up to now we use rsync (via rsnapshot) to backup our data. >> >> But it takes longer and longer for rsync to detect >> the changes. Rsync checks many files. But daily only >> very few files really change. More than 99.9% don't. > > Are you rsyncing over all the files at one time? Yes, we rsyncing every night. > Or do break it down into segments over the day? No, up to now it is one rsync run. > The closest I remember is Bacula: > > http://blog.bacula.org/documentation/documentation/ > > It uses a hybrid solution where the files are stored on a file server and data about the files is stored in a database. > Postgres is one of the database backends it can work with. I heard of Bacula, but I was not aware of the fact, that they can use postfres for the meta data. >> >> I have the hope, that it would be easier to backup only the files which >> changed. > > Backup to where and how? > Are you thinking of using replication? No, replication is not the current issue. Plain old backup is my current issue. Backup where and how? ... That's what this question is about :-) -- Thomas Guettler http://www.thomas-guettler.de/
We reached the limit of inotify. Was: Storing files: 2.3TBytes, 17M file count
Am 28.11.2016 um 17:43 schrieb Daniel Verite: > Thomas Güttler wrote: > >> Up to now we use rsync (via rsnapshot) to backup our data. >> >> But it takes longer and longer for rsync to detect >> the changes. Rsync checks many files. But daily only >> very few files really change. More than 99.9% don't. > > lsyncd+rsync has worked nicely for me on Linux in such cases, > as opposed to rsync alone which is indeed very slow with large > trees. Check out https://github.com/axkibe/lsyncd We reached the limit where inotify does not work any more. At least that's what a team mate said. There are too many directories :-( AFAIK lsyncd uses inotify. > > If you think of using Postgres large objects, be aware that they > are stored in a single table (pg_largeobject), sliced > as rows of 1/4 block in size each (typically 2048 bytes). > 2.3 TB in a single database would mean more than 1.2 billion > rows in that table, and as a system table it can't be partitioned > or moved to another tablespace. > > OTOH with large objects, files can be stored and retrieved easily > between client and server with efficient built-in functions at both ends. > In particular, they don't need the binary<->text conversions or > large memory allocations mentioned by Chris Travers upthread, > that may happen when writing your own methods with bytea columns. > > But for the amount of data you have, the monolithic pg_largeobject > would likely be problematic. > > Ideally there should be an extension implementing something like > DATALINK (SQL99), with external storage. I wonder if an extension > could provide custom WAL records replicating content changes to the > external storage of a standby. That would be awesome. Thank you for the hints. There is some info about DATALINK here: https://wiki.postgresql.org/wiki/DATALINK Regards, Thomas Güttler -- Thomas Guettler http://www.thomas-guettler.de/
Am 29.11.2016 um 01:52 schrieb Mike Sofen: > From: Thomas Güttler Sent: Monday, November 28, 2016 6:28 AM > > ...I have 2.3TBytes of files. File count is 17M > > Since we already store our structured data in postgres, I think about storing the files in PostgreSQL, too. > > Is it feasible to store file in PostgreSQL? > > ------- > > I am doing something similar, but in reverse. The legacy mysql databases I’m converting into a modern Postgres data > model, have very large genomic strings stored in 3 separate columns. Out of the 25 TB of legacy data storage (in 800 > dbs across 4 servers, about 22b rows), those 3 columns consume 90% of the total space, and they are just used for > reference, never used in searches or calculations. They range from 1k to several MB. > > > > Since I am collapsing all 800 dbs into a single PG db, being very smart about storage was critical. Since we’re also > migrating everything to AWS, we’re placing those 3 strings (per row) into a single json document and storing the > document in S3 bins, with the pointer to the file being the globally unique PK for the row…super simple. The app tier > knows to fetch the data from the db and large string json from the S3 bins. The retrieval time is surprisingly fast, > this is all real time web app stuff. > > > > This is a model that could work for anyone dealing with large objects (text or binary). The nice part is, the original > 25TB of data storage drops to 5TB – a much more manageable number, allowing for significant growth, which is on the horizon. Thank you Mike for your feedback. Yes, I think I will drop my idea. Encoding binary (the file content) to text and decoding to binary again makes no sense. I was not aware that this is needed. I guess I will use some key-to-blob store like s3. AFAIK there are open source s3 implementations available. Thank you all for your feeback! Regards, Thomas -- Thomas Guettler http://www.thomas-guettler.de/
Am 29.11.2016 um 01:52 schrieb Mike Sofen:From: Thomas Güttler Sent: Monday, November 28, 2016 6:28 AM
...I have 2.3TBytes of files. File count is 17M
Since we already store our structured data in postgres, I think about storing the files in PostgreSQL, too.
Is it feasible to store file in PostgreSQL?
-------
I am doing something similar, but in reverse. The legacy mysql databases I’m converting into a modern Postgres data
model, have very large genomic strings stored in 3 separate columns. Out of the 25 TB of legacy data storage (in 800
dbs across 4 servers, about 22b rows), those 3 columns consume 90% of the total space, and they are just used for
reference, never used in searches or calculations. They range from 1k to several MB.
Since I am collapsing all 800 dbs into a single PG db, being very smart about storage was critical. Since we’re also
migrating everything to AWS, we’re placing those 3 strings (per row) into a single json document and storing the
document in S3 bins, with the pointer to the file being the globally unique PK for the row…super simple. The app tier
knows to fetch the data from the db and large string json from the S3 bins. The retrieval time is surprisingly fast,
this is all real time web app stuff.
This is a model that could work for anyone dealing with large objects (text or binary). The nice part is, the original
25TB of data storage drops to 5TB – a much more manageable number, allowing for significant growth, which is on the horizon.
Thank you Mike for your feedback.
Yes, I think I will drop my idea. Encoding binary (the file content) to text and decoding to binary again makes no sense. I was not aware that this is needed.
I guess I will use some key-to-blob store like s3. AFAIK there are open source s3 implementations available.
Thank you all for your feeback!
Regards, Thomas
--
Thomas Guettler http://www.thomas-guettler.de/
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
Am 29.11.2016 um 01:52 schrieb Mike Sofen:From: Thomas Güttler Sent: Monday, November 28, 2016 6:28 AM
...I have 2.3TBytes of files. File count is 17M
Since we already store our structured data in postgres, I think about storing the files in PostgreSQL, too.
Is it feasible to store file in PostgreSQL?
I guess I will use some key-to-blob store like s3. AFAIK there are open source s3 implementations available.
Thank you all for your feeback!
Regards, Thomas
--
Thomas Güttler <guettliml@thomas-guettler.de> writes: > I have 2.3TBytes of files. File count is 17M > > Up to now we use rsync (via rsnapshot) to backup our data. Isn't putting those files into your database going to make any sort of maintanance on your database cumbersome? How big is your database currently? Is it worth growing a lot to store files in it? If your database is e.g. 100 GB, then growing it to 2.4 TB just to store some files is probably not desirable. I would at least avoid it of possible. We currently have a similar situation with files stored on NFS and shared among a few servers that way, but we are about to decommission the NFS based storage system, and in that connection I'm looking into using unison to sync files across servers. That could be used for backup purposes as well. I saw you mentioned somewhere else that you had outgrown inotify. I tested unison with 10M files and it worked just fine. Of course not in the same directory, because that would surely hurt, not due to inotify but due to how most file systems handles directories with lots of files. -- Jacob
On 11/29/2016 01:50 AM, Thomas Güttler wrote: > > > Am 29.11.2016 um 01:52 schrieb Mike Sofen: >> From: Thomas Güttler Sent: Monday, November 28, 2016 6:28 AM >> >> ...I have 2.3TBytes of files. File count is 17M >> >> Since we already store our structured data in postgres, I think about >> storing the files in PostgreSQL, too. >> >> Is it feasible to store file in PostgreSQL? >> >> ------- >> >> I am doing something similar, but in reverse. The legacy mysql >> databases I’m converting into a modern Postgres data >> model, have very large genomic strings stored in 3 separate columns. >> Out of the 25 TB of legacy data storage (in 800 >> dbs across 4 servers, about 22b rows), those 3 columns consume 90% of >> the total space, and they are just used for >> reference, never used in searches or calculations. They range from 1k >> to several MB. >> >> >> >> Since I am collapsing all 800 dbs into a single PG db, being very >> smart about storage was critical. Since we’re also >> migrating everything to AWS, we’re placing those 3 strings (per row) >> into a single json document and storing the >> document in S3 bins, with the pointer to the file being the globally >> unique PK for the row…super simple. The app tier >> knows to fetch the data from the db and large string json from the S3 >> bins. The retrieval time is surprisingly fast, >> this is all real time web app stuff. >> >> >> >> This is a model that could work for anyone dealing with large objects >> (text or binary). The nice part is, the original >> 25TB of data storage drops to 5TB – a much more manageable number, >> allowing for significant growth, which is on the horizon. > > Thank you Mike for your feedback. > > Yes, I think I will drop my idea. Encoding binary (the file content) to > text and decoding to binary again makes no sense. I was not aware that > this is needed. > > I guess I will use some key-to-blob store like s3. AFAIK there are open > source s3 implementations available. Just be aware that doing deltas over file changes, like rsync, while possible is more convoluted and time/resource consuming with something like s3. > > Thank you all for your feeback! > > Regards, Thomas > > > > -- Adrian Klaver adrian.klaver@aklaver.com
On 11/29/2016 01:15 AM, Thomas Güttler wrote: > > > Am 28.11.2016 um 16:01 schrieb Adrian Klaver: >> On 11/28/2016 06:28 AM, Thomas Güttler wrote: >>> Hi, >>> >>> PostgreSQL is rock solid and one of the most reliable parts of our >>> toolchain. >>> >>> Thank you >>> >>> Up to now, we don't store files in PostgreSQL. >>> >>> I was told, that you must not do this .... But this was 20 years ago. >>> >>> >>> I have 2.3TBytes of files. File count is 17M >>> >>> Up to now we use rsync (via rsnapshot) to backup our data. >>> >>> But it takes longer and longer for rsync to detect >>> the changes. Rsync checks many files. But daily only >>> very few files really change. More than 99.9% don't. >> >> Are you rsyncing over all the files at one time? > > Yes, we rsyncing every night. > >> Or do break it down into segments over the day? > > No, up to now it is one rsync run. Unless everything is in a single directory, it would seem you could break this down into smaller jobs that are spread over the day. > >> The closest I remember is Bacula: >> >> http://blog.bacula.org/documentation/documentation/ >> >> It uses a hybrid solution where the files are stored on a file server >> and data about the files is stored in a database. >> Postgres is one of the database backends it can work with. > > I heard of Bacula, but I was not aware of the fact, that they can use > postfres for the meta data. > >>> >>> I have the hope, that it would be easier to backup only the files which >>> changed. >> >> Backup to where and how? >> Are you thinking of using replication? > > No, replication is not the current issue. Plain old backup is my current > issue. > > Backup where and how? ... That's what this question is about :-) > -- Adrian Klaver adrian.klaver@aklaver.com