Thread: pg_dump order of rows

pg_dump order of rows

From

jan

Date:

10 December 2010, 19:02:34

Hello,

today I stumbled across a interesting question about the order rows are dumped out while exporting a database with
pg_dump.I know questions like this are around this list sometimes, but I think this is a bit more special. 

First of all I know that dumping a database is a somewhat nondeterministic process. It's hard to determine in which
orderobjects are dumped. If my understanding is correct when it's about dumping the rows from a specific table the rows
willappear in the dump in disk-order. This order is changed everytime there are updates to rows in that table and how
oftena vacuum occurs to release old row versions. 

From some testing I "verified" this by experimentation - just created a new table with a known order of rows and dumped
it.The dump was ordered in the same way the rows were inserted. Same again with updating some of the rows those rows
appeardat the end of the dump. At last I vacuumed the database and updated some rows they appeared in the "spaces" the
previousupdated rows left behind. Exactly what I expected :-) 

Now for my question - we have a case where rows are inserted in order and are updated rarely (about 10% of the rows and
oftenin order) and we are seeking for a solution to make the dumop (e.g. the backup) more random at database level (by
influencingthe on disk order)? Obvious way would be to update all rows randomly by software before dumpim

Re: pg_dump order of rows

From

Filip Rembiałkowski

Date:

11 December 2010, 09:21:17

2010/12/11 jan <jan@dafuer.de>

we are seeking for a solution to make the dumop (e.g. the backup) more random

Can you please explain why? ( just curious, this seems a very strange requirement )

you _could_ _try_ to use postgres feature of synchronised seqscans to achieve this, but I'm not sure if it will work.
(after a minute of googling ... no it will not work - http://archives.postgresql.org/pgsql-committers/2008-01/msg00358.php )

cheers,
Filip

Re: pg_dump order of rows

From

Jan Kesten

Date:

13 December 2010, 04:13:31

> Can you please explain why? ( just curious, this seems a very strange
> requirement )

I'm on some research for election systems - an interesting field. On one
hand there is a requirement of realiablility, so there is the need for
stable and proven software and backups of systems for example. On the
other hand transparency is also very important, so backing up using a
simple pg_dump is more comprehensible than using a dedicated tool to
users and administrators. Of course there is no unencrypted data in the
database - but maintaining the order of insertions in a dumpfile to a
specific table could theoretically leak some kind of information, hence
the question.

I'm not aware of any database where I can influence the on-disk order
directly, but there may be some hints I'm probably missing (even a
vacuum full will only compact the datafile as far as I know but not
reoder it accoring to an index or something).

Jan

Re: pg_dump order of rows

From

Hampus Wessman

Date:

13 December 2010, 05:26:46

On 2010-12-13 09:12, Jan Kesten wrote:
>
> I'm not aware of any database where I can influence the on-disk order
> directly, but there may be some hints I'm probably missing (even a
> vacuum full will only compact the datafile as far as I know but not
> reoder it accoring to an index or something).
CLUSTER should be able to do that for you. See
http://www.postgresql.org/docs/9.0/static/sql-cluster.html.
/ Hampus

Re: pg_dump order of rows

From

Vick Khera

Date:

13 December 2010, 09:53:52

On Mon, Dec 13, 2010 at 4:26 AM, Hampus Wessman
<hampus.wessman@gmail.com> wrote:
> CLUSTER should be able to do that for you. See
> http://www.postgresql.org/docs/9.0/static/sql-cluster.html.
>

That's the exact opposite of what he wished to accomplish.

The only thing I can suggest is to do random order of updates, perhaps
while also setting the fill factor for the pages to something large so
there's no room to re-use a page, causing more shuffling.

I suspect you'll have to do this for every table you've got, else you
could leak some information about what rows were updated since the
last import if the person has access to the file used to import the
data originally.  You'll also have to do this every time you dump the
DB, I'd suspect...

Re: pg_dump order of rows

From

Hampus Wessman

Date:

13 December 2010, 11:10:42

Not at all. He wanted a way to influence the on-disk order of the rows
in a table, so he could make sure they were not left in the order of
insertion. That should be possible with CLUSTER. From the documentation:
"When a table is clustered, it is physically reordered based on the
index information" (again, see
http://www.postgresql.org/docs/9.0/static/sql-cluster.html for full
details).

Now, if one would give CLUSTER an index that put the data back in
insertion order, then it wouldn't be very useful (obviously). On the
other hand, if there is such an index, then the insertion order of the
rows can easily be restored no matter how they are reordered... (making
the whole thing pointless in this case).

So, what index to choose? He can either cluster on an existing index, if
that would remove all trace of the insertion order (any field with
unique values would work equally well), or he can add a field with
random values, index that field and cluster on that index. CLUSTER is
just the tool... Unless I've missed something fundamental about the
CLUSTER command it should be perfectly possible to reorder the data on
disk in any way you want with it, if you first create an appropriate
index to cluster on.

The downsides I see is (1 & 2 directly from the documentation):
1. When a table is being clustered, an ACCESS EXCLUSIVE lock is acquired
on it.
2. You need free space on disk at least equal to the sum of the table
size and the index sizes.
3. Rows inserted after CLUSTER finishes will not be "protected". You
would need to run it before a dump and make sure nothing is written to
the table between the clustering and the dumping, if you want to be
really sure...

Any reason why this wouldn't work?

On 2010-12-13 14:53, Vick Khera wrote:
> On Mon, Dec 13, 2010 at 4:26 AM, Hampus Wessman
> <hampus.wessman@gmail.com>  wrote:
>> CLUSTER should be able to do that for you. See
>> http://www.postgresql.org/docs/9.0/static/sql-cluster.html.
>>
> That's the exact opposite of what he wished to accomplish.
>
> The only thing I can suggest is to do random order of updates, perhaps
> while also setting the fill factor for the pages to something large so
> there's no room to re-use a page, causing more shuffling.
>
> I suspect you'll have to do this for every table you've got, else you
> could leak some information about what rows were updated since the
> last import if the person has access to the file used to import the
> data originally.  You'll also have to do this every time you dump the
> DB, I'd suspect...
>

Re: pg_dump order of rows

From

"Marc Mamin"

Date:

13 December 2010, 11:10:46

> I'm not aware of any database where I can influence the on-disk order directly


Hello, 

You may want to export your main tables using COPY instead of pg_dump.

Doing this, you could disorder your tables, e.g. with hashtext:

COPY (select * from foo order by hashtext(some_col) )TO <file>

Getting a consistent dump of all your data may be an issue thought...

regards,

Marc Mamin