Thread: pg_dump order of rows
Hello, today I stumbled across a interesting question about the order rows are dumped out while exporting a database with pg_dump.I know questions like this are around this list sometimes, but I think this is a bit more special. First of all I know that dumping a database is a somewhat nondeterministic process. It's hard to determine in which orderobjects are dumped. If my understanding is correct when it's about dumping the rows from a specific table the rows willappear in the dump in disk-order. This order is changed everytime there are updates to rows in that table and how oftena vacuum occurs to release old row versions. From some testing I "verified" this by experimentation - just created a new table with a known order of rows and dumped it.The dump was ordered in the same way the rows were inserted. Same again with updating some of the rows those rows appeardat the end of the dump. At last I vacuumed the database and updated some rows they appeared in the "spaces" the previousupdated rows left behind. Exactly what I expected :-) Now for my question - we have a case where rows are inserted in order and are updated rarely (about 10% of the rows and oftenin order) and we are seeking for a solution to make the dumop (e.g. the backup) more random at database level (by influencingthe on disk order)? Obvious way would be to update all rows randomly by software before dumpim
2010/12/11 jan <jan@dafuer.de>
Can you please explain why? ( just curious, this seems a very strange requirement )
you _could_ _try_ to use postgres feature of synchronised seqscans to achieve this, but I'm not sure if it will work.
(after a minute of googling ... no it will not work - http://archives.postgresql.org/pgsql-committers/2008-01/msg00358.php )
cheers,
Filip
we are seeking for a solution to make the dumop (e.g. the backup) more random
Can you please explain why? ( just curious, this seems a very strange requirement )
you _could_ _try_ to use postgres feature of synchronised seqscans to achieve this, but I'm not sure if it will work.
(after a minute of googling ... no it will not work - http://archives.postgresql.org/pgsql-committers/2008-01/msg00358.php )
cheers,
Filip
> Can you please explain why? ( just curious, this seems a very strange > requirement ) I'm on some research for election systems - an interesting field. On one hand there is a requirement of realiablility, so there is the need for stable and proven software and backups of systems for example. On the other hand transparency is also very important, so backing up using a simple pg_dump is more comprehensible than using a dedicated tool to users and administrators. Of course there is no unencrypted data in the database - but maintaining the order of insertions in a dumpfile to a specific table could theoretically leak some kind of information, hence the question. I'm not aware of any database where I can influence the on-disk order directly, but there may be some hints I'm probably missing (even a vacuum full will only compact the datafile as far as I know but not reoder it accoring to an index or something). Jan
On 2010-12-13 09:12, Jan Kesten wrote: > > I'm not aware of any database where I can influence the on-disk order > directly, but there may be some hints I'm probably missing (even a > vacuum full will only compact the datafile as far as I know but not > reoder it accoring to an index or something). CLUSTER should be able to do that for you. See http://www.postgresql.org/docs/9.0/static/sql-cluster.html. / Hampus
On Mon, Dec 13, 2010 at 4:26 AM, Hampus Wessman <hampus.wessman@gmail.com> wrote: > CLUSTER should be able to do that for you. See > http://www.postgresql.org/docs/9.0/static/sql-cluster.html. > That's the exact opposite of what he wished to accomplish. The only thing I can suggest is to do random order of updates, perhaps while also setting the fill factor for the pages to something large so there's no room to re-use a page, causing more shuffling. I suspect you'll have to do this for every table you've got, else you could leak some information about what rows were updated since the last import if the person has access to the file used to import the data originally. You'll also have to do this every time you dump the DB, I'd suspect...
Not at all. He wanted a way to influence the on-disk order of the rows in a table, so he could make sure they were not left in the order of insertion. That should be possible with CLUSTER. From the documentation: "When a table is clustered, it is physically reordered based on the index information" (again, see http://www.postgresql.org/docs/9.0/static/sql-cluster.html for full details). Now, if one would give CLUSTER an index that put the data back in insertion order, then it wouldn't be very useful (obviously). On the other hand, if there is such an index, then the insertion order of the rows can easily be restored no matter how they are reordered... (making the whole thing pointless in this case). So, what index to choose? He can either cluster on an existing index, if that would remove all trace of the insertion order (any field with unique values would work equally well), or he can add a field with random values, index that field and cluster on that index. CLUSTER is just the tool... Unless I've missed something fundamental about the CLUSTER command it should be perfectly possible to reorder the data on disk in any way you want with it, if you first create an appropriate index to cluster on. The downsides I see is (1 & 2 directly from the documentation): 1. When a table is being clustered, an ACCESS EXCLUSIVE lock is acquired on it. 2. You need free space on disk at least equal to the sum of the table size and the index sizes. 3. Rows inserted after CLUSTER finishes will not be "protected". You would need to run it before a dump and make sure nothing is written to the table between the clustering and the dumping, if you want to be really sure... Any reason why this wouldn't work? On 2010-12-13 14:53, Vick Khera wrote: > On Mon, Dec 13, 2010 at 4:26 AM, Hampus Wessman > <hampus.wessman@gmail.com> wrote: >> CLUSTER should be able to do that for you. See >> http://www.postgresql.org/docs/9.0/static/sql-cluster.html. >> > That's the exact opposite of what he wished to accomplish. > > The only thing I can suggest is to do random order of updates, perhaps > while also setting the fill factor for the pages to something large so > there's no room to re-use a page, causing more shuffling. > > I suspect you'll have to do this for every table you've got, else you > could leak some information about what rows were updated since the > last import if the person has access to the file used to import the > data originally. You'll also have to do this every time you dump the > DB, I'd suspect... >
> I'm not aware of any database where I can influence the on-disk order directly Hello, You may want to export your main tables using COPY instead of pg_dump. Doing this, you could disorder your tables, e.g. with hashtext: COPY (select * from foo order by hashtext(some_col) )TO <file> Getting a consistent dump of all your data may be an issue thought... regards, Marc Mamin