Thread: Storing files: 2.3TBytes, 17M file count

Storing files: 2.3TBytes, 17M file count

From

Thomas Güttler

Date:

28 November 2016, 14:28:34

Hi,

PostgreSQL is rock solid and one of the most reliable parts of our toolchain.

    Thank you

Up to now, we don't store files in PostgreSQL.

I was told, that you must not do this .... But this was 20 years ago.


I have 2.3TBytes of files. File count is 17M

Up to now we use rsync (via rsnapshot) to backup our data.

But it takes longer and longer for rsync to detect
the changes. Rsync checks many files. But daily only
very few files really change. More than 99.9% don't.

Since we already store our structured data in postgres, I think
about storing the files in PostgreSQL, too.

What is the current state of the art?

Is it feasible to store file in PostgreSQL?

Are there already projects which use PostgreSQL as storage backend?

I have the hope, that it would be easier to backup only the files which changed.

Regards,
    Thomas Güttler


Related question at rsnapshot mailing list:
https://sourceforge.net/p/rsnapshot/mailman/rsnapshot-discuss/thread/57A1A2F3.5090409@thomas-guettler.de/



--
Thomas Guettler http://www.thomas-guettler.de/

Re: Storing files: 2.3TBytes, 17M file count

From

Adrian Klaver

Date:

28 November 2016, 15:01:26

On 11/28/2016 06:28 AM, Thomas Güttler wrote:
> Hi,
>
> PostgreSQL is rock solid and one of the most reliable parts of our
> toolchain.
>
>    Thank you
>
> Up to now, we don't store files in PostgreSQL.
>
> I was told, that you must not do this .... But this was 20 years ago.
>
>
> I have 2.3TBytes of files. File count is 17M
>
> Up to now we use rsync (via rsnapshot) to backup our data.
>
> But it takes longer and longer for rsync to detect
> the changes. Rsync checks many files. But daily only
> very few files really change. More than 99.9% don't.

Are you rsyncing over all the files at one time?

Or do break it down into segments over the day?

>
> Since we already store our structured data in postgres, I think
> about storing the files in PostgreSQL, too.
>
> What is the current state of the art?

I don't know.

>
> Is it feasible to store file in PostgreSQL?

Yes, you can store a file in Postgres. Still I am not sure that stuffing
17M files into Postgres is going to perform any better then dealing with
them on the file system. In fact in Postgres they would be still be on
the file system but with an extra layer above them.

>
> Are there already projects which use PostgreSQL as storage backend?

The closest I remember is Bacula:

http://blog.bacula.org/documentation/documentation/

It uses a hybrid solution where the files are stored on a file server
and data about the files is stored in a database. Postgres is one of the
database backends it can work with.

>
> I have the hope, that it would be easier to backup only the files which
> changed.

Backup to where and how?
Are you thinking of using replication?

>
> Regards,
>    Thomas Güttler
>
>
> Related question at rsnapshot mailing list:
> https://sourceforge.net/p/rsnapshot/mailman/rsnapshot-discuss/thread/57A1A2F3.5090409@thomas-guettler.de/
>
>
>
>

--
Adrian Klaver
adrian.klaver@aklaver.com

Re: Storing files: 2.3TBytes, 17M file count

From

Chris Travers

Date:

28 November 2016, 15:10:29

On Mon, Nov 28, 2016 at 3:28 PM, Thomas Güttler <guettliml@thomas-guettler.de> wrote:

Hi,

PostgreSQL is rock solid and one of the most reliable parts of our toolchain.

Thank you

Up to now, we don't store files in PostgreSQL.

I was told, that you must not do this .... But this was 20 years ago.

I have 2.3TBytes of files. File count is 17M

Up to now we use rsync (via rsnapshot) to backup our data.

But it takes longer and longer for rsync to detect
the changes. Rsync checks many files. But daily only
very few files really change. More than 99.9% don't.

Since we already store our structured data in postgres, I think
about storing the files in PostgreSQL, too.

What is the current state of the art?

Is it feasible to store file in PostgreSQL?

Are there already projects which use PostgreSQL as storage backend?

I have the hope, that it would be easier to backup only the files which changed.

There is a tradeoff. On one hand, as you note, it is easier to back things up if you are storing the files in PostgreSQL. Now, I have *not* looked at how this would work for binary format transfer so that might be different, but in most cases I have looked at, the downside is in the encoding and decoding.

If files come in as hexadecimal, then you already have twice as much transfer as space. Then typically driver-based encoding will copy in the process, meaning that you end up with many times the RAM used as the files. When I tested this in Perl, it was common for 8x the size of the file to be used in RAM in the course of decoding and sending it on. Driver, framework, and encoding may affect this, however.

Now, depending on what you are doing, that may not be a problem. It sounds like you have a large number of files, and they are up to a number MB in size. Since that memory usage would be short-term that may not be a problem but I cannot say for you whether it is or not.

So be aware of the tradeoff and decide appropriately.

Regards,
Thomas Güttler

Related question at rsnapshot mailing list:
https://sourceforge.net/p/rsnapshot/mailman/rsnapshot-discuss/thread/57A1A2F3.5090409@thomas-guettler.de/

--
Thomas Guettler http://www.thomas-guettler.de/

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Best Wishes,

Chris Travers

Efficito: Hosted Accounting and ERP. Robust and Flexible. No vendor lock-in.

http://www.efficito.com/learn_more

Re: Storing files: 2.3TBytes, 17M file count

From

"Daniel Verite"

Date:

28 November 2016, 16:43:45

    Thomas Güttler wrote:

> Up to now we use rsync (via rsnapshot) to backup our data.
>
> But it takes longer and longer for rsync to detect
> the changes. Rsync checks many files. But daily only
> very few files really change. More than 99.9% don't.

lsyncd+rsync has worked nicely for me on Linux in such cases,
as opposed to rsync alone which is indeed very slow with large
trees. Check out https://github.com/axkibe/lsyncd

If you think of using Postgres large objects, be aware that they
are stored in a single table (pg_largeobject), sliced
as rows of 1/4 block in size each (typically 2048 bytes).
2.3 TB in a single database would mean more than 1.2 billion
rows in that table, and as a system table it can't be partitioned
or moved to another tablespace.

OTOH with large objects, files can be stored and retrieved easily
between client and server with efficient built-in functions at both ends.
In particular, they don't need the binary<->text conversions or
large memory allocations mentioned by Chris Travers upthread,
that may happen when writing your own methods with bytea columns.

But for the amount of data you have, the monolithic pg_largeobject
would likely be problematic.

Ideally there should be an extension implementing something like
DATALINK (SQL99), with external storage. I wonder if an extension
could provide custom WAL records replicating content changes to the
external storage of a standby. That would be awesome.

Best regards,
--
Daniel Vérité
PostgreSQL-powered mailer: http://www.manitou-mail.org
Twitter: @DanielVerite

Re: Storing files: 2.3TBytes, 17M file count

From

Eduardo Morras

Date:

28 November 2016, 19:38:00

On Mon, 28 Nov 2016 15:28:28 +0100
Thomas Güttler <guettliml@thomas-guettler.de> wrote:

> Hi,
>
> Up to now, we don't store files in PostgreSQL.
>
> I was told, that you must not do this .... But this was 20 years ago.
>
>
> I have 2.3TBytes of files. File count is 17M
>
> Up to now we use rsync (via rsnapshot) to backup our data.
>
> But it takes longer and longer for rsync to detect
> the changes. Rsync checks many files. But daily only
> very few files really change. More than 99.9% don't.
>
> Since we already store our structured data in postgres, I think
> about storing the files in PostgreSQL, too.
>
> What is the current state of the art?
>
> Is it feasible to store file in PostgreSQL?

Yes and no, it's another level of indirection, slower than pure
filesystem solution.

Rsync checks last read/access time, last write/modification time and
file hash before copying it. If no one of those metadata change, rsync
don't copy it. File hash must be recalculated if access time and
modification time change.

> Are there already projects which use PostgreSQL as storage backend?
>
> I have the hope, that it would be easier to backup only the files
> which changed.

Rsync tries to backup only the files that changed. There are other
tools like cpdup, don't know if it's ported to linux, It's similar to
rsync.

You can use a p2p system, unlike ftp, rsync, etc they store a full Tree Hash (Tiger Tree Hash often) of file content to
allowmulti peer to peer copy. 

> Regards,
>     Thomas Güttler
>
>
> Related question at rsnapshot mailing list:
> https://sourceforge.net/p/rsnapshot/mailman/rsnapshot-discuss/thread/57A1A2F3.5090409@thomas-guettler.de/
> --
> Thomas Guettler http://www.thomas-guettler.de/
>

---   ---
Eduardo Morras <emorrasg@yahoo.es>

Re: Storing files: 2.3TBytes, 17M file count

From

Date:

28 November 2016, 20:47:16

Thomas Güttler <guettliml@thomas-guettler.de> wrote:
> 
> Up to now, we don't store files in PostgreSQL.
> I was told, that you must not do this .... But this was 20 years ago.
> I have 2.3TBytes of files. File count is 17M
> Up to now we use rsync (via rsnapshot) to backup our data.
> But it takes longer and longer for rsync to detect the changes. Rsync checks many files. But daily only very few
filesreally change. More than 99.9% don't.

> Since we already store our structured data in postgres, I think about storing the files in PostgreSQL, too.
> What is the current state of the art?
> Is it feasible to store file in PostgreSQL?
> Are there already projects which use PostgreSQL as storage backend?
> I have the hope, that it would be easier to backup only the files which changed.

There are a lot of "it depends" in answering that. :)

The biggest question is "What kind of files are they?" Text or binary data?
If they're text, you could compress them and store them in
the DB (I did that successfully on 1 project with XML as that was the fastest
answer from our benchmarking of all combinations of in DB, on the file system,
compressed or not). Then again, our files were from 1-10KB each.

If I did the math right, your files average ~135GB each. That argues for leaving
them on the file system and storing the name. Have you considered adding data
for the filename and a timestamp, then having a scheduled task that selects
the files with a timestamp more recent than the last time the transfer ran
and rsync'ing (or whatever tool you want) only those? That's mostly application code
and not DB code, but I'd think that'd be the easiest and fastest all things
considered.

HTH,
Kevin

Re: Storing files: 2.3TBytes, 17M file count

From

"Mike Sofen"

Date:

29 November 2016, 00:53:12

From: Thomas Güttler Sent: Monday, November 28, 2016 6:28 AM

...I have 2.3TBytes of files. File count is 17M

Since we already store our structured data in postgres, I think about storing the files in PostgreSQL, too.

Is it feasible to store file in PostgreSQL?

-------

I am doing something similar, but in reverse. The legacy mysql databases I’m converting into a modern Postgres data model, have very large genomic strings stored in 3 separate columns. Out of the 25 TB of legacy data storage (in 800 dbs across 4 servers, about 22b rows), those 3 columns consume 90% of the total space, and they are just used for reference, never used in searches or calculations. They range from 1k to several MB.

Since I am collapsing all 800 dbs into a single PG db, being very smart about storage was critical. Since we’re also migrating everything to AWS, we’re placing those 3 strings (per row) into a single json document and storing the document in S3 bins, with the pointer to the file being the globally unique PK for the row…super simple. The app tier knows to fetch the data from the db and large string json from the S3 bins. The retrieval time is surprisingly fast, this is all real time web app stuff.

This is a model that could work for anyone dealing with large objects (text or binary). The nice part is, the original 25TB of data storage drops to 5TB – a much more manageable number, allowing for significant growth, which is on the horizon.

Mike Sofen (Synthetic Genomics USA)

Re: Storing files: 2.3TBytes, 17M file count

From

Thomas Güttler

Date:

29 November 2016, 09:15:19


Am 28.11.2016 um 16:01 schrieb Adrian Klaver:
> On 11/28/2016 06:28 AM, Thomas Güttler wrote:
>> Hi,
>>
>> PostgreSQL is rock solid and one of the most reliable parts of our
>> toolchain.
>>
>>    Thank you
>>
>> Up to now, we don't store files in PostgreSQL.
>>
>> I was told, that you must not do this .... But this was 20 years ago.
>>
>>
>> I have 2.3TBytes of files. File count is 17M
>>
>> Up to now we use rsync (via rsnapshot) to backup our data.
>>
>> But it takes longer and longer for rsync to detect
>> the changes. Rsync checks many files. But daily only
>> very few files really change. More than 99.9% don't.
>
> Are you rsyncing over all the files at one time?

Yes, we rsyncing every night.

> Or do break it down into segments over the day?

No, up to now it is one rsync run.

> The closest I remember is Bacula:
>
> http://blog.bacula.org/documentation/documentation/
>
> It uses a hybrid solution where the files are stored on a file server and data about the files is stored in a
database.
> Postgres is one of the database backends it can work with.

I heard of Bacula, but I was not aware of the fact, that they can use postfres for the meta data.

>>
>> I have the hope, that it would be easier to backup only the files which
>> changed.
>
> Backup to where and how?
> Are you thinking of using replication?

No, replication is not the current issue. Plain old backup is my current issue.

Backup where and how? ... That's what this question is about :-)

--
Thomas Guettler http://www.thomas-guettler.de/

We reached the limit of inotify. Was: Storing files: 2.3TBytes, 17M file count

From

Thomas Güttler

Date:

29 November 2016, 09:22:21


Am 28.11.2016 um 17:43 schrieb Daniel Verite:
>     Thomas Güttler wrote:
>
>> Up to now we use rsync (via rsnapshot) to backup our data.
>>
>> But it takes longer and longer for rsync to detect
>> the changes. Rsync checks many files. But daily only
>> very few files really change. More than 99.9% don't.
>
> lsyncd+rsync has worked nicely for me on Linux in such cases,
> as opposed to rsync alone which is indeed very slow with large
> trees. Check out https://github.com/axkibe/lsyncd

We reached the limit where inotify does not work any more. At least
that's what a team mate said. There are too many directories :-(
AFAIK lsyncd uses inotify.

>
> If you think of using Postgres large objects, be aware that they
> are stored in a single table (pg_largeobject), sliced
> as rows of 1/4 block in size each (typically 2048 bytes).
> 2.3 TB in a single database would mean more than 1.2 billion
> rows in that table, and as a system table it can't be partitioned
> or moved to another tablespace.
>
> OTOH with large objects, files can be stored and retrieved easily
> between client and server with efficient built-in functions at both ends.
> In particular, they don't need the binary<->text conversions or
> large memory allocations mentioned by Chris Travers upthread,
> that may happen when writing your own methods with bytea columns.
>
> But for the amount of data you have, the monolithic pg_largeobject
> would likely be problematic.
>
> Ideally there should be an extension implementing something like
> DATALINK (SQL99), with external storage. I wonder if an extension
> could provide custom WAL records replicating content changes to the
> external storage of a standby. That would be awesome.

Thank you for the hints. There is some info about DATALINK here:

   https://wiki.postgresql.org/wiki/DATALINK

Regards,
   Thomas Güttler

--
Thomas Guettler http://www.thomas-guettler.de/

Re: Storing files: 2.3TBytes, 17M file count

From

Thomas Güttler

Date:

29 November 2016, 09:50:56


Am 29.11.2016 um 01:52 schrieb Mike Sofen:
> From: Thomas Güttler   Sent: Monday, November 28, 2016 6:28 AM
>
> ...I have 2.3TBytes of files. File count is 17M
>
> Since we already store our structured data in postgres, I think about storing the files in PostgreSQL, too.
>
> Is it feasible to store file in PostgreSQL?
>
> -------
>
> I am doing something similar, but in reverse.  The legacy mysql databases I’m converting into a modern Postgres data
> model, have very large genomic strings stored in 3 separate columns.  Out of the 25 TB of legacy data storage (in 800
> dbs across 4 servers, about 22b rows), those 3 columns consume 90% of the total space, and they are just used for
> reference, never used in searches or calculations.  They range from 1k to several MB.
>
>
>
> Since I am collapsing all 800 dbs into a single PG db, being very smart about storage was critical.  Since we’re also
> migrating everything to AWS, we’re placing those 3 strings (per row) into a single json document and storing the
> document in S3 bins, with the pointer to the file being the globally unique PK for the row…super simple.  The app
tier
> knows to fetch the data from the db and large string json from the S3 bins.  The retrieval time is surprisingly fast,
> this is all real time web app stuff.
>
>
>
> This is a model that could work for anyone dealing with large objects (text or binary).  The nice part is, the
original
> 25TB of data storage drops to 5TB – a much more manageable number, allowing for significant growth, which is on the
horizon.

Thank you Mike for your feedback.

Yes, I think I will drop my idea. Encoding binary (the file content) to text and decoding to binary again makes no
sense. I was not aware that this is needed.

I guess I will use some key-to-blob store like s3. AFAIK there are open source s3 implementations available.

Thank you all for your feeback!

  Regards, Thomas




--
Thomas Guettler http://www.thomas-guettler.de/

Re: Storing files: 2.3TBytes, 17M file count

From

Jerome Wagner

Date:

29 November 2016, 10:09:12

2 other options that you may want to look at :

- cephfs

This has nothing to do with postgres but is a distributed filesystem handling very large amount of files (thinks next generation NFS)

I haven't tried it myself yet but they reached a "stable" milestone regarding the distributed fs.

cf https://en.wikipedia.org/wiki/Ceph_(software)

- fuse based postgres storage

this can

- relieve the "pg_largeobject" beeing a system table issue (tablespace, ..)

- give you a fs-like access to your data

The closest I have seen to this thus far is https://github.com/andreasbaumann/pgfuse which would probably need some tinkering.

On Tue, Nov 29, 2016 at 10:50 AM, Thomas Güttler <guettliml@thomas-guettler.de> wrote:

Am 29.11.2016 um 01:52 schrieb Mike Sofen:
From: Thomas Güttler Sent: Monday, November 28, 2016 6:28 AM

...I have 2.3TBytes of files. File count is 17M

Since we already store our structured data in postgres, I think about storing the files in PostgreSQL, too.

Is it feasible to store file in PostgreSQL?

-------

I am doing something similar, but in reverse. The legacy mysql databases I’m converting into a modern Postgres data
model, have very large genomic strings stored in 3 separate columns. Out of the 25 TB of legacy data storage (in 800
dbs across 4 servers, about 22b rows), those 3 columns consume 90% of the total space, and they are just used for
reference, never used in searches or calculations. They range from 1k to several MB.

Since I am collapsing all 800 dbs into a single PG db, being very smart about storage was critical. Since we’re also
migrating everything to AWS, we’re placing those 3 strings (per row) into a single json document and storing the
document in S3 bins, with the pointer to the file being the globally unique PK for the row…super simple. The app tier
knows to fetch the data from the db and large string json from the S3 bins. The retrieval time is surprisingly fast,
this is all real time web app stuff.

This is a model that could work for anyone dealing with large objects (text or binary). The nice part is, the original
25TB of data storage drops to 5TB – a much more manageable number, allowing for significant growth, which is on the horizon.

Thank you Mike for your feedback.

Yes, I think I will drop my idea. Encoding binary (the file content) to text and decoding to binary again makes no sense. I was not aware that this is needed.

I guess I will use some key-to-blob store like s3. AFAIK there are open source s3 implementations available.

Thank you all for your feeback!

Regards, Thomas

--
Thomas Guettler http://www.thomas-guettler.de/

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

Re: Storing files: 2.3TBytes, 17M file count

From

Stuart Bishop

Date:

29 November 2016, 12:07:01

On 29 November 2016 at 16:50, Thomas Güttler <guettliml@thomas-guettler.de> wrote:

Am 29.11.2016 um 01:52 schrieb Mike Sofen:
From: Thomas Güttler Sent: Monday, November 28, 2016 6:28 AM

...I have 2.3TBytes of files. File count is 17M

Since we already store our structured data in postgres, I think about storing the files in PostgreSQL, too.

Is it feasible to store file in PostgreSQL?

I guess I will use some key-to-blob store like s3. AFAIK there are open source s3 implementations available.

Thank you all for your feeback!

Regards, Thomas

I have a similar setup. I have about 20TB of data in over 60 million files. It might be possible to store that in PG, but I think it would be a huge headache easily avoided. Files are GPG encrypted and backed up offsite to S3, with lifecycle rules to migrate that to Glacier storage. A tool like boto lets you sync things easily to S3, and maybe directly to glacier, and there are alternatives out there.

If your rsync is taking too long, it will be worse syncing to s3 though. If that is your bottleneck, then you need to fix it. Probably by knowing which files have changed and only resyncing them,for example using timestamps from the database or storing 'incoming' files in a separate area from your 'archive'. Once you have this sorted you can do your backups every few minutes and reduce your potential data loss.

Stuart Bishop <stuart@stuartbishop.net>
http://www.stuartbishop.net/

Re: Storing files: 2.3TBytes, 17M file count

From

Jacob Bunk Nielsen

Date:

29 November 2016, 12:35:34

Thomas Güttler <guettliml@thomas-guettler.de> writes:

> I have 2.3TBytes of files. File count is 17M
>
> Up to now we use rsync (via rsnapshot) to backup our data.

Isn't putting those files into your database going to make any sort of
maintanance on your database cumbersome? How big is your database
currently? Is it worth growing a lot to store files in it?

If your database is e.g. 100 GB, then growing it to 2.4 TB just to store
some files is probably not desirable. I would at least avoid it of
possible.

We currently have a similar situation with files stored on NFS and
shared among a few servers that way, but we are about to decommission
the NFS based storage system, and in that connection I'm looking into
using unison to sync files across servers. That could be used for backup
purposes as well.

I saw you mentioned somewhere else that you had outgrown inotify. I
tested unison with 10M files and it worked just fine. Of course not in the
same directory, because that would surely hurt, not due to inotify but
due to how most file systems handles directories with lots of files.

--
Jacob

Re: Storing files: 2.3TBytes, 17M file count

From

Adrian Klaver

Date:

29 November 2016, 15:27:30

On 11/29/2016 01:50 AM, Thomas Güttler wrote:
>
>
> Am 29.11.2016 um 01:52 schrieb Mike Sofen:
>> From: Thomas Güttler   Sent: Monday, November 28, 2016 6:28 AM
>>
>> ...I have 2.3TBytes of files. File count is 17M
>>
>> Since we already store our structured data in postgres, I think about
>> storing the files in PostgreSQL, too.
>>
>> Is it feasible to store file in PostgreSQL?
>>
>> -------
>>
>> I am doing something similar, but in reverse.  The legacy mysql
>> databases I’m converting into a modern Postgres data
>> model, have very large genomic strings stored in 3 separate columns.
>> Out of the 25 TB of legacy data storage (in 800
>> dbs across 4 servers, about 22b rows), those 3 columns consume 90% of
>> the total space, and they are just used for
>> reference, never used in searches or calculations.  They range from 1k
>> to several MB.
>>
>>
>>
>> Since I am collapsing all 800 dbs into a single PG db, being very
>> smart about storage was critical.  Since we’re also
>> migrating everything to AWS, we’re placing those 3 strings (per row)
>> into a single json document and storing the
>> document in S3 bins, with the pointer to the file being the globally
>> unique PK for the row…super simple.  The app tier
>> knows to fetch the data from the db and large string json from the S3
>> bins.  The retrieval time is surprisingly fast,
>> this is all real time web app stuff.
>>
>>
>>
>> This is a model that could work for anyone dealing with large objects
>> (text or binary).  The nice part is, the original
>> 25TB of data storage drops to 5TB – a much more manageable number,
>> allowing for significant growth, which is on the horizon.
>
> Thank you Mike for your feedback.
>
> Yes, I think I will drop my idea. Encoding binary (the file content) to
> text and decoding to binary again makes no sense. I was not aware that
> this is needed.
>
> I guess I will use some key-to-blob store like s3. AFAIK there are open
> source s3 implementations available.

Just be aware that doing deltas over file changes, like rsync, while
possible is more convoluted and time/resource consuming with something
like s3.

>
> Thank you all for your feeback!
>
>  Regards, Thomas
>
>
>
>


--
Adrian Klaver
adrian.klaver@aklaver.com

Re: Storing files: 2.3TBytes, 17M file count

From

Adrian Klaver

Date:

29 November 2016, 15:30:35

On 11/29/2016 01:15 AM, Thomas Güttler wrote:
>
>
> Am 28.11.2016 um 16:01 schrieb Adrian Klaver:
>> On 11/28/2016 06:28 AM, Thomas Güttler wrote:
>>> Hi,
>>>
>>> PostgreSQL is rock solid and one of the most reliable parts of our
>>> toolchain.
>>>
>>>    Thank you
>>>
>>> Up to now, we don't store files in PostgreSQL.
>>>
>>> I was told, that you must not do this .... But this was 20 years ago.
>>>
>>>
>>> I have 2.3TBytes of files. File count is 17M
>>>
>>> Up to now we use rsync (via rsnapshot) to backup our data.
>>>
>>> But it takes longer and longer for rsync to detect
>>> the changes. Rsync checks many files. But daily only
>>> very few files really change. More than 99.9% don't.
>>
>> Are you rsyncing over all the files at one time?
>
> Yes, we rsyncing every night.
>
>> Or do break it down into segments over the day?
>
> No, up to now it is one rsync run.

Unless everything is in a single directory, it would seem you could
break this down into smaller jobs that are spread over the day.

>
>> The closest I remember is Bacula:
>>
>> http://blog.bacula.org/documentation/documentation/
>>
>> It uses a hybrid solution where the files are stored on a file server
>> and data about the files is stored in a database.
>> Postgres is one of the database backends it can work with.
>
> I heard of Bacula, but I was not aware of the fact, that they can use
> postfres for the meta data.
>
>>>
>>> I have the hope, that it would be easier to backup only the files which
>>> changed.
>>
>> Backup to where and how?
>> Are you thinking of using replication?
>
> No, replication is not the current issue. Plain old backup is my current
> issue.
>
> Backup where and how? ... That's what this question is about :-)
>


--
Adrian Klaver
adrian.klaver@aklaver.com