Thread: incremental dumps
Hi, I want to store copies of our data on a remote machine as a security measure. My first attempt was a full dump (which takes too long to upload) followed by diffs between the pgdump files. This provides readable / searchable versioned data (I could alway apply the diffs on the remote machine and search the text file, without having an instance of postgres running on that machine) However, the diff files seem to be considerably larger than one would expect. One obvious part of the problem is the fact that diff shows old and new text, so e.g. changing the amount of stock for a product with a 1kB description would generate at least 2kb of text in the diff file. What would be a better approach? Best regards Wolfgang
On Thu, Aug 1, 2013 at 10:59 AM, <hamann.w@t-online.de> wrote: > However, the diff files seem to be considerably larger than one would expect. > One obvious part of the problem is the fact that diff shows old and new text, > so e.g. changing the amount of stock for a product with a 1kB description > would generate at least 2kb of text in the diff file. > > What would be a better approach? I suppose wal archiving or PITR would be better, but assuming you want text files I guess you need to change your database structure to either: 1) include a watermark on data and dump only data since the last dump (to do manually); 2) partition your tables and backup specific tables/partitions depending on the timing. Hope this helps. Luca
> suppose wal archiving or PITR would be better
+1, never re-invent the wheel, unless you really need to.
+1, never re-invent the wheel, unless you really need to.
Bèrto
On 1 August 2013 14:14, Luca Ferrari <fluca1978@infinito.it> wrote:
On Thu, Aug 1, 2013 at 10:59 AM, <hamann.w@t-online.de> wrote:I suppose wal archiving or PITR would be better, but assuming you want
> However, the diff files seem to be considerably larger than one would expect.
> One obvious part of the problem is the fact that diff shows old and new text,
> so e.g. changing the amount of stock for a product with a 1kB description
> would generate at least 2kb of text in the diff file.
>
> What would be a better approach?
text files I guess you need to change your database structure to
either:
1) include a watermark on data and dump only data since the last dump
(to do manually);
2) partition your tables and backup specific tables/partitions
depending on the timing.
Hope this helps.
Luca
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
==============================
If Pac-Man had affected us as kids, we'd all be running around in a darkened room munching pills and listening to repetitive music.
On 08/01/2013 02:59 AM, hamann.w@t-online.de wrote: > > However, the diff files seem to be considerably larger than one would expect. > One obvious part of the problem is the fact that diff shows old and new text, You could try using diff --suppress-common-lines -ed which in my experience creates the smallest diffs (actually ed scripts). Apply them with patch -e Martin
>> On 08/01/2013 02:59 AM, hamann.w@t-online.de wrote: >> > >> > However, the diff files seem to be considerably larger than one would expect. >> > One obvious part of the problem is the fact that diff shows old and new text, >> >> You could try using >> diff --suppress-common-lines -ed >> which in my experience creates the smallest diffs (actually ed scripts). >> Apply them with >> patch -e >> >> Martin Hi Martin, thanks for the hint - this is probably one of the things to do. I have something else in mind, but at present I just suspect that this might happen: when I modify data and select _without an ordering_, I am pretty sure to get the data in a different sequence than before. So I wonder whethet forcing the dump to honor a particular ordering (if that is at all possible) would also reduce the size of dumps ... or the time diff takes to produce them Regards Wolfgang Hamann
On Fri, Aug 2, 2013 at 6:55 PM, <hamann.w@t-online.de> wrote: > thanks for the hint - this is probably one of the things to do. > I have something else in mind, but at present I just suspect that this might happen: > when I modify data and select _without an ordering_, I am pretty sure to get the data > in a different sequence than before. So I wonder whethet forcing the dump to honor > a particular ordering (if that is at all possible) would also reduce the size of dumps ... or the > time diff takes to produce them > May I ask what is the final purpose? Because if it is to take a backup I guess this is not the right way to go, while if it is keeping (and rebuilding) an history of data, than using a more specific approach (like logging) could give you less headaches. Luca
Luca Ferrari wrote: On Fri, Aug 2, 2013 at 6:55 PM, <hamann.w@t-online.de> wrote: > thanks for the hint - this is probably one of the things to do. > I have something else in mind, but at present I just suspect that this might happen: > when I modify data and select _without an ordering_, I am pretty sure to get the data > in a different sequence than before. So I wonder whethet forcing the dump to honor > a particular ordering (if that is at all possible) would also reduce the size of dumps ... or the > time diff takes to produce them > May I ask what is the final purpose? Because if it is to take a backup I guess this is not the right way to go, while if it is keeping (and rebuilding) an history of data, than using a more specific approach (like logging) could give you less headaches. Luca -------------- Hi Luca, we recently decided to have off-site backups rather than burning piles of DVDs that are kept on-site. The backup server sits in a data center and is fed nightly via rsync. The link is not too fast. One thought in favor of text files: if disaster really strikes (the regular machine goes on fire) it is quite likely that a replacement would be installed with latest versions of all software. Now, if I had binary files, I would probably have to install the old version of the software just to be able to do a regular dump and then reload into newer one With the planned setup, I would be able to look up previous states of the database as well. (Sample scenario: when was the price of product xyz actually changed?) This is likely not too convenient ... but loading successive dumps into a secondary installation of the database is definitely worse. Regards Wolfgang
On Thu, Aug 1, 2013 at 1:59 AM, <hamann.w@t-online.de> wrote: > Hi, > I want to store copies of our data on a remote machine as a security measure. Can you describe what your security concerns are? Are you worried about long-lasting malicious tampering with the data that you need to be able to recover from? Simple loss of data from natural disaster? > My first attempt was a full dump (which takes too long to upload) > followed by diffs between the pgdump files. > This provides readable / searchable versioned data (I could alway apply > the diffs on the remote machine and search the text file, without having > an instance of postgres running on that machine) I think that optimizing with the intention of not using PostgreSQL is probably the wrong approach. You find it valuable to use PostgreSQL on your production server, why would you not also find it valuable to use it on the remote? I like the ability to use pg_dump to get human readable data, and I use it often. But I usually start with a binary image recovered to the point I want, and then take a "fresh" pg_dump out of that for inspection, rather than trying to save pg_dumps at every time that might be of interest. > > However, the diff files seem to be considerably larger than one would expect. > One obvious part of the problem is the fact that diff shows old and new text, > so e.g. changing the amount of stock for a product with a 1kB description > would generate at least 2kb of text in the diff file. Usually a 1kb product description would not be in the same table as a current stock count would. Anyway, there are diff tools that are not line-oriented which could compress well on this type of data, but if the rows are not in the same order between dumps, they do poorly. There is some more discussion of this here: http://www.postgresql.org/message-id/CAMkU=1z0+=M-2g-N3+y=d-QJu-qcYTGXbw_h1E5g5Gr+rCny6A@mail.gmail.com Cheers, Jeff
On 8/1/13, hamann.w@t-online.de <hamann.w@t-online.de> wrote: > Hi, > I want to store copies of our data on a remote machine as a security > measure. > Wolfgang 2 questions: 1. How secure is the remote site? 2. How much data are we talking about? -- Mike Nolan
>> On 8/1/13, hamann.w@t-online.de <hamann.w@t-online.de> wrote: >> > Hi, >> > I want to store copies of our data on a remote machine as a security >> > measure. >> >> >> > Wolfgang >> >> 2 questions: >> >> 1. How secure is the remote site? >> 2. How much data are we talking about? >> -- >> Mike Nolan Hi Mike, currently the source uses some 20 GB in a database partition and about 700 GB in a general data partition. For the database, a diff -e grows to about 10% of the size of a full dump in a week The remote site is a raid box at a hosting center, with paid backup Regards Wolfgang
On 8/10/13, hamann.w@t-online.de <hamann.w@t-online.de> wrote: > currently the source uses some 20 GB in a database partition and about 700 > GB > in a general data partition. For the database, a diff -e grows to about 10% > of the size > of a full dump in a week > The remote site is a raid box at a hosting center, with paid backup > > Regards > Wolfgang It sounds like you have catastrophic failure covered, but what about data integrity and data security? You may need to 'roll your own' solution, possibly using something like Slony. Having a timestamp field that indicates when the row was inserted or last updated may help. A true incremental backup would IMHO be a very useful tool for database administrators, but there are a number of technical challenges involved, especially dealing with deleted records. -- Mike Nolan