Thread: incremental dumps

incremental dumps

From

hamann.w@t-online.de

Date:

01 August 2013, 09:00:17

Hi,
I want to store copies of our data on a remote machine as a security measure.
My first attempt was a full dump (which takes too long to upload)
followed by diffs between the pgdump files.
This provides readable / searchable versioned data (I could alway apply
the diffs on the remote machine and search the text file, without having
an instance of postgres running on that machine)

However, the diff files seem to be considerably larger than one would expect.
One obvious part of the problem is the fact that diff shows old and new text,
so e.g. changing the amount of stock for a product with a 1kB description
would generate at least 2kb of text in the diff file.

What would be a better approach?

Best regards
Wolfgang

Re: incremental dumps

From

Luca Ferrari

Date:

01 August 2013, 13:14:11

On Thu, Aug 1, 2013 at 10:59 AM,  <hamann.w@t-online.de> wrote:

> However, the diff files seem to be considerably larger than one would expect.
> One obvious part of the problem is the fact that diff shows old and new text,
> so e.g. changing the amount of stock for a product with a 1kB description
> would generate at least 2kb of text in the diff file.
>
> What would be a better approach?

I suppose wal archiving or PITR would be better, but assuming you want
text files I guess you need to change your database structure to
either:
1) include a watermark on data and dump only data since the last dump
(to do manually);
2) partition your tables and backup specific tables/partitions
depending on the timing.

Hope this helps.
Luca

Re: incremental dumps

From

Bèrto ëd Sèra

Date:

01 August 2013, 13:20:33

> suppose wal archiving or PITR would be better
+1, never re-invent the wheel, unless you really need to.

Bèrto

On 1 August 2013 14:14, Luca Ferrari <fluca1978@infinito.it> wrote:

On Thu, Aug 1, 2013 at 10:59 AM, <hamann.w@t-online.de> wrote:

> However, the diff files seem to be considerably larger than one would expect.
> One obvious part of the problem is the fact that diff shows old and new text,
> so e.g. changing the amount of stock for a product with a 1kB description
> would generate at least 2kb of text in the diff file.
>
> What would be a better approach?

I suppose wal archiving or PITR would be better, but assuming you want
text files I guess you need to change your database structure to
either:
1) include a watermark on data and dump only data since the last dump
(to do manually);
2) partition your tables and backup specific tables/partitions
depending on the timing.

Hope this helps.
Luca

--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general

--
==============================
If Pac-Man had affected us as kids, we'd all be running around in a darkened room munching pills and listening to repetitive music.

Re: incremental dumps

From

Martin Collins

Date:

02 August 2013, 13:54:11

On 08/01/2013 02:59 AM, hamann.w@t-online.de wrote:
>
> However, the diff files seem to be considerably larger than one would expect.
> One obvious part of the problem is the fact that diff shows old and new text,

You could try using
  diff --suppress-common-lines -ed
which in my experience creates the smallest diffs (actually ed scripts).
Apply them with
  patch -e

Martin

Re: incremental dumps

From

hamann.w@t-online.de

Date:

02 August 2013, 16:55:21

>> On 08/01/2013 02:59 AM, hamann.w@t-online.de wrote:
>> >
>> > However, the diff files seem to be considerably larger than one would expect.
>> > One obvious part of the problem is the fact that diff shows old and new text,
>>
>> You could try using
>>   diff --suppress-common-lines -ed
>> which in my experience creates the smallest diffs (actually ed scripts).
>> Apply them with
>>   patch -e
>>
>> Martin

Hi Martin,

thanks for the hint - this is probably one of the things to do.
I have something else in mind, but at present I just suspect that this might happen:
when I modify data and select _without an ordering_, I am pretty sure to get the data
in a different sequence than before. So I wonder whethet forcing the dump to honor
a particular ordering (if that is at all possible) would also reduce the size of dumps ... or the
time diff takes to produce them

Regards
Wolfgang Hamann

Re: incremental dumps

From

Luca Ferrari

Date:

05 August 2013, 09:44:15

On Fri, Aug 2, 2013 at 6:55 PM,  <hamann.w@t-online.de> wrote:

> thanks for the hint - this is probably one of the things to do.
> I have something else in mind, but at present I just suspect that this might happen:
> when I modify data and select _without an ordering_, I am pretty sure to get the data
> in a different sequence than before. So I wonder whethet forcing the dump to honor
> a particular ordering (if that is at all possible) would also reduce the size of dumps ... or the
> time diff takes to produce them
>

May I ask what is the final purpose? Because if it is to take a backup
I guess this is not the right way to go, while if it is keeping (and
rebuilding) an history of data, than using a more specific approach
(like logging) could give you less headaches.

Luca

Re: incremental dumps

From

hamann.w@t-online.de

Date:

05 August 2013, 11:09:58

Luca Ferrari wrote:
On Fri, Aug 2, 2013 at 6:55 PM,  <hamann.w@t-online.de> wrote:

> thanks for the hint - this is probably one of the things to do.
> I have something else in mind, but at present I just suspect that this might happen:
> when I modify data and select _without an ordering_, I am pretty sure to get the data
> in a different sequence than before. So I wonder whethet forcing the dump to honor
> a particular ordering (if that is at all possible) would also reduce the size of dumps ... or the
> time diff takes to produce them
>

May I ask what is the final purpose? Because if it is to take a backup
I guess this is not the right way to go, while if it is keeping (and
rebuilding) an history of data, than using a more specific approach
(like logging) could give you less headaches.

Luca
--------------
Hi Luca,

we recently decided to have off-site backups rather than burning piles of DVDs that are kept
on-site. The backup server sits in a data center and is fed nightly via rsync.
The link is not too fast.
One thought in favor of text files: if disaster really strikes (the regular machine goes on fire)
it is quite likely that a replacement would be installed with latest versions of all software.
Now, if I had binary files, I would probably have to install the old version of the software
just to be able to do a regular dump and then reload into newer one
With the planned setup, I would be able to look up previous states of the database as well.
(Sample scenario: when was the price of product xyz actually changed?)
This is likely not too convenient ... but loading successive dumps into a secondary installation
of the database is definitely worse.

Regards
Wolfgang

Re: incremental dumps

From

Jeff Janes

Date:

09 August 2013, 21:01:55

On Thu, Aug 1, 2013 at 1:59 AM,  <hamann.w@t-online.de> wrote:
> Hi,
> I want to store copies of our data on a remote machine as a security measure.

Can you describe what your security concerns are?  Are you worried
about long-lasting malicious tampering with the data that you need to
be able to recover from?  Simple loss of data from natural disaster?

> My first attempt was a full dump (which takes too long to upload)
> followed by diffs between the pgdump files.
> This provides readable / searchable versioned data (I could alway apply
> the diffs on the remote machine and search the text file, without having
> an instance of postgres running on that machine)

I think that optimizing with the intention of not using PostgreSQL is
probably the wrong approach.    You find it valuable to use PostgreSQL
on your production server, why would you not also find it valuable to
use it on the remote?  I like the ability to use pg_dump to get human
readable data, and I use it often.  But I usually start with a binary
image recovered to the point I want, and then take a "fresh" pg_dump
out of that for inspection, rather than trying to save pg_dumps at
every time that might be of interest.

>
> However, the diff files seem to be considerably larger than one would expect.
> One obvious part of the problem is the fact that diff shows old and new text,
> so e.g. changing the amount of stock for a product with a 1kB description
> would generate at least 2kb of text in the diff file.

Usually a 1kb product description would not be in the same table as a
current stock count would.  Anyway, there are diff tools that are not
line-oriented which could compress well on this type of data, but if
the rows are not in the same order between dumps, they do poorly.
There is some more discussion of this here:

http://www.postgresql.org/message-id/CAMkU=1z0+=M-2g-N3+y=d-QJu-qcYTGXbw_h1E5g5Gr+rCny6A@mail.gmail.com

Cheers,

Jeff

Re: incremental dumps

From

Michael Nolan

Date:

09 August 2013, 22:11:46

On 8/1/13, hamann.w@t-online.de <hamann.w@t-online.de> wrote:
> Hi,
> I want to store copies of our data on a remote machine as a security
> measure.


> Wolfgang

2 questions:

1.  How secure is the remote site?
2.  How much data are we talking about?
--
Mike Nolan

Re: incremental dumps

From

hamann.w@t-online.de

Date:

10 August 2013, 06:00:35

>> On 8/1/13, hamann.w@t-online.de <hamann.w@t-online.de> wrote:
>> > Hi,
>> > I want to store copies of our data on a remote machine as a security
>> > measure.
>>
>>
>> > Wolfgang
>>
>> 2 questions:
>>
>> 1.  How secure is the remote site?
>> 2.  How much data are we talking about?
>> --
>> Mike Nolan

Hi Mike,

currently the source uses some 20 GB in a database partition and about 700 GB
in a general data partition. For the database, a diff -e grows to about 10% of the size
of a full dump in a week
The remote site is a raid box at a hosting center, with paid backup

Regards
Wolfgang

Re: incremental dumps

From

Michael Nolan

Date:

11 August 2013, 15:30:20

On 8/10/13, hamann.w@t-online.de <hamann.w@t-online.de> wrote:

> currently the source uses some 20 GB in a database partition and about 700
> GB
> in a general data partition. For the database, a diff -e grows to about 10%
> of the size
> of a full dump in a week
> The remote site is a raid box at a hosting center, with paid backup
>
> Regards
> Wolfgang

It sounds like you have catastrophic failure covered, but what about
data integrity and data security?

You may need to 'roll your own' solution, possibly using something like Slony.

Having a timestamp field that indicates when the row was inserted or
last updated may help.

A true incremental backup would IMHO be a very useful tool for
database administrators, but there are a number of technical
challenges involved, especially dealing with deleted records.
--
Mike Nolan