Thread: Adding WHERE clause to pg_dump

Adding WHERE clause to pg_dump

From
Simon Riggs
Date:
Attached patch implements WHERE clauses for pg_dump.

This is useful for producing data samples of a database

e.g. pg_dump -w "ctid < '(1000,1)' or random() < 0.1"

and can also be used for taking incremental backups, if data columns
exist to make a partial dump sensible.

e.g. pg_dump -w "last_update_timestamp > ...."

Columns such as this are very common because of optimistic locking
techniques in many databases.

This is designed to be used in conjunction with the TOM utility, and the
forthcoming patch to implement stats hooks. Taken together these
features will allow the ability to take a cut-down database environment
for testing, yet with statistics matching the main production database.

It was easier to write it and then discuss, since I needed to check the
feasibility of the idea before presenting it.

--
 Simon Riggs           www.2ndQuadrant.com
 PostgreSQL Training, Services and Support

Attachment

Re: Adding WHERE clause to pg_dump

From
Tom Lane
Date:
Simon Riggs <simon@2ndquadrant.com> writes:
> Attached patch implements WHERE clauses for pg_dump.

I still have serious reservations about adding such an ugly,
non-orthogonal wart to pg_dump.  Why is it not appropriate to just
do a COPY (SELECT ...) TO STDOUT when you need this?
        regards, tom lane


Re: Adding WHERE clause to pg_dump

From
Simon Riggs
Date:
On Fri, 2008-07-25 at 13:31 -0400, Tom Lane wrote:
> Simon Riggs <simon@2ndquadrant.com> writes:
> > Attached patch implements WHERE clauses for pg_dump.
> 
> I still have serious reservations about adding such an ugly,
> non-orthogonal wart to pg_dump.  Why is it not appropriate to just
> do a COPY (SELECT ...) TO STDOUT when you need this?

So you can dump a coherent sample database in one command, not 207.

Every user of PostgreSQL wants a dev/test database. If the database is
large it isn't practical to take a complete copy. Nor is it practical to
hand-write a data sampling extraction program and if you do, its usually
imperfect in many ways.

Adding this feature gives a very fast capability to create sample
databases, or incremental backups for many cases.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Adding WHERE clause to pg_dump

From
"Joshua D. Drake"
Date:
On Fri, 2008-07-25 at 19:33 +0100, Simon Riggs wrote:
> On Fri, 2008-07-25 at 13:31 -0400, Tom Lane wrote:
> > Simon Riggs <simon@2ndquadrant.com> writes:
> > > Attached patch implements WHERE clauses for pg_dump.
> > 
> > I still have serious reservations about adding such an ugly,
> > non-orthogonal wart to pg_dump.  Why is it not appropriate to just
> > do a COPY (SELECT ...) TO STDOUT when you need this?
> 
> So you can dump a coherent sample database in one command, not 207.
> 
> Every user of PostgreSQL wants a dev/test database. If the database is
> large it isn't practical to take a complete copy. Nor is it practical to
> hand-write a data sampling extraction program and if you do, its usually
> imperfect in many ways.
> 
> Adding this feature gives a very fast capability to create sample
> databases, or incremental backups for many cases.

Not sure I buy this argument. I am all for usability and I would be the
first to shout about the general ridiculousness of pg_dump/all/restore
but in this case I think Tom is right. This feature could easily be done
in a script without harassing pg_dump.

Sincerely,

Joshua D. Drake




-- 
The PostgreSQL Company since 1997: http://www.commandprompt.com/ 
PostgreSQL Community Conference: http://www.postgresqlconference.org/
United States PostgreSQL Association: http://www.postgresql.us/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate





Re: Adding WHERE clause to pg_dump

From
Simon Riggs
Date:
On Fri, 2008-07-25 at 11:46 -0700, Joshua D. Drake wrote:
> On Fri, 2008-07-25 at 19:33 +0100, Simon Riggs wrote:
> > On Fri, 2008-07-25 at 13:31 -0400, Tom Lane wrote:
> > > Simon Riggs <simon@2ndquadrant.com> writes:
> > > > Attached patch implements WHERE clauses for pg_dump.
> > > 
> > > I still have serious reservations about adding such an ugly,
> > > non-orthogonal wart to pg_dump.  Why is it not appropriate to just
> > > do a COPY (SELECT ...) TO STDOUT when you need this?
> > 
> > So you can dump a coherent sample database in one command, not 207.
> > 
> > Every user of PostgreSQL wants a dev/test database. If the database is
> > large it isn't practical to take a complete copy. Nor is it practical to
> > hand-write a data sampling extraction program and if you do, its usually
> > imperfect in many ways.
> > 
> > Adding this feature gives a very fast capability to create sample
> > databases, or incremental backups for many cases.
> 
> Not sure I buy this argument. I am all for usability and I would be the
> first to shout about the general ridiculousness of pg_dump/all/restore
> but in this case I think Tom is right. This feature could easily be done
> in a script without harassing pg_dump.

You can do it, yes. But it takes a lot longer. If the time to implement
was similar, then I would immediately agree "feature available already".

pg_dump is not "harassed" by this. What is lost by adding this feature?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Adding WHERE clause to pg_dump

From
"Joshua D. Drake"
Date:
On Fri, 2008-07-25 at 20:26 +0100, Simon Riggs wrote:
> On Fri, 2008-07-25 at 11:46 -0700, Joshua D. Drake wrote:
> > On Fri, 2008-07-25 at 19:33 +0100, Simon Riggs wrote:

> > > Adding this feature gives a very fast capability to create sample
> > > databases, or incremental backups for many cases.
> > 
> > Not sure I buy this argument. I am all for usability and I would be the
> > first to shout about the general ridiculousness of pg_dump/all/restore
> > but in this case I think Tom is right. This feature could easily be done
> > in a script without harassing pg_dump.
> 
> You can do it, yes. But it takes a lot longer. If the time to implement
> was similar, then I would immediately agree "feature available already".
> 
> pg_dump is not "harassed" by this. What is lost by adding this feature?

Gained. Code complexity. Right now pg_dump does, copy. You are
introducing a whole other level of complexity by adding WHERE clause
capability. Secondly I don't think it would actually add anything but
complexity to the user.

How do we deal with this?

pg_dump -w "last_update_timestamp < ..." -t 'table*'

What I see is a recipe for inconsistent, un-restorable backups without a
user realizing what they have done. The only way to deal with the above
is:

1. Wildcards aren't allowed if you have -w
2. You dump everything, if the WHERE clause isn't relevant you just dump
the whole table

I don't like either.

I do see utility if you know what you are doing but I think it makes
more sense to have it outside of pg_dump.

Sincerely,

Joshua D. Drake



-- 
The PostgreSQL Company since 1997: http://www.commandprompt.com/ 
PostgreSQL Community Conference: http://www.postgresqlconference.org/
United States PostgreSQL Association: http://www.postgresql.us/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate





Re: Adding WHERE clause to pg_dump

From
daveg
Date:
On Fri, Jul 25, 2008 at 08:26:35PM +0100, Simon Riggs wrote:
> 
> On Fri, 2008-07-25 at 11:46 -0700, Joshua D. Drake wrote:
> > On Fri, 2008-07-25 at 19:33 +0100, Simon Riggs wrote:
> > > On Fri, 2008-07-25 at 13:31 -0400, Tom Lane wrote:
> > > > Simon Riggs <simon@2ndquadrant.com> writes:
> > > > > Attached patch implements WHERE clauses for pg_dump.
> > > > 
> > > > I still have serious reservations about adding such an ugly,
> > > > non-orthogonal wart to pg_dump.  Why is it not appropriate to just
> > > > do a COPY (SELECT ...) TO STDOUT when you need this?
> > > 
> > > So you can dump a coherent sample database in one command, not 207.
> > > 
> > > Every user of PostgreSQL wants a dev/test database. If the database is
> > > large it isn't practical to take a complete copy. Nor is it practical to
> > > hand-write a data sampling extraction program and if you do, its usually
> > > imperfect in many ways.
> > > 
> > > Adding this feature gives a very fast capability to create sample
> > > databases, or incremental backups for many cases.
> > 
> > Not sure I buy this argument. I am all for usability and I would be the
> > first to shout about the general ridiculousness of pg_dump/all/restore
> > but in this case I think Tom is right. This feature could easily be done
> > in a script without harassing pg_dump.
> 
> You can do it, yes. But it takes a lot longer. If the time to implement
> was similar, then I would immediately agree "feature available already".
> 
> pg_dump is not "harassed" by this. What is lost by adding this feature?

This was discussed at the beginning of June on patches, Dave Durham submitted
a patch to add where clauses via a -w option and then in response to feedback
to add it to each each table of -t. See discussion here:
 http://archives.postgresql.org/pgsql-patches/2008-06/msg00001.php

and final patch here:
 http://archives.postgresql.org/pgsql-patches/2008-06/msg00026.php.

We now have two patches on this topic from different submitters with
different use cases supplied as justification. I have yet another use case
not mentioned by either of the submitters and will probably hand patch
pg_dump locally to do so.

I don't think at this point we should wave this off under the impression
that no one really wants or needs it as obviously some people want it enough
to code it. The other objections seem to be based on the themes:
- code complexity.
    Davy's patch is quite simple. I have looked at Simon's yet.
- we need an ETL tool so this should be preempted by that.- pg_dump should be made into a library so this can be done
separately.
    We don't generally allow imaginary futures to prevent us from adding    useful functionality on other topics.
- This can be done with a script.
    Not really. The script would pretty much have to contain most of    pg_dump. That's more than a script.
- users could make partial dumps and be confused and lose data.
     Yes, but they can already do that with -n, -t, and the new pre-data     and post-data switches. This is one more
casewhere the default is     a full dump but you one can specificly request less.
 

I think that once COPY sprouted a WHERE clause it becomes almost inevitable
that pg_dump will take advantage of them. How many patches on this topic do
we want to ignore?

As you may have guessed by this point:
+1

-dg 

-- 
David Gould       daveg@sonic.net      510 536 1443    510 282 0869
If simplicity worked, the world would be overrun with insects.


Re: Adding WHERE clause to pg_dump

From
"Joshua D. Drake"
Date:
On Fri, 2008-07-25 at 14:11 -0700, daveg wrote:
> On Fri, Jul 25, 2008 at 08:26:35PM +0100, Simon Riggs wrote:

>  - This can be done with a script.
> 
>      Not really. The script would pretty much have to contain most of
>      pg_dump. That's more than a script.
> 

Yes really. :) The only thing pg_dump is buying you here is easy of
schema pull. In a situation like this you would pull a pg_dump -s then
only restore data that you want based on a single transaction snapshot
of the objects you are going to query.

>  - users could make partial dumps and be confused and lose data.
> 
>       Yes, but they can already do that with -n, -t, and the new pre-data
>       and post-data switches. This is one more case where the default is
>       a full dump but you one can specificly request less.

No they actually can't. You are guaranteed that regardless of a -n or -t
flag that the data you receive is consistent. You can't guarantee that
with -w because you could pull different data based on an arbitrary
conditional that can not apply to all objects.

Joshua D. Drake

-- 
The PostgreSQL Company since 1997: http://www.commandprompt.com/ 
PostgreSQL Community Conference: http://www.postgresqlconference.org/
United States PostgreSQL Association: http://www.postgresql.us/
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate





Re: Adding WHERE clause to pg_dump

From
Simon Riggs
Date:
On Fri, 2008-07-25 at 14:11 -0700, daveg wrote:
> On Fri, Jul 25, 2008 at 08:26:35PM +0100, Simon Riggs wrote:
> > 
> > On Fri, 2008-07-25 at 11:46 -0700, Joshua D. Drake wrote:
> > > On Fri, 2008-07-25 at 19:33 +0100, Simon Riggs wrote:
> > > > On Fri, 2008-07-25 at 13:31 -0400, Tom Lane wrote:
> > > > > Simon Riggs <simon@2ndquadrant.com> writes:
> > > > > > Attached patch implements WHERE clauses for pg_dump.
> > > > > 
> > > > > I still have serious reservations about adding such an ugly,
> > > > > non-orthogonal wart to pg_dump.  Why is it not appropriate to just
> > > > > do a COPY (SELECT ...) TO STDOUT when you need this?
> > > > 
> > > > So you can dump a coherent sample database in one command, not 207.
> > > > 
> > > > Every user of PostgreSQL wants a dev/test database. If the database is
> > > > large it isn't practical to take a complete copy. Nor is it practical to
> > > > hand-write a data sampling extraction program and if you do, its usually
> > > > imperfect in many ways.
> > > > 
> > > > Adding this feature gives a very fast capability to create sample
> > > > databases, or incremental backups for many cases.
> > > 
> > > Not sure I buy this argument. I am all for usability and I would be the
> > > first to shout about the general ridiculousness of pg_dump/all/restore
> > > but in this case I think Tom is right. This feature could easily be done
> > > in a script without harassing pg_dump.
> > 
> > You can do it, yes. But it takes a lot longer. If the time to implement
> > was similar, then I would immediately agree "feature available already".
> > 
> > pg_dump is not "harassed" by this. What is lost by adding this feature?
> 
> This was discussed at the beginning of June on patches, Dave Durham submitted
> a patch to add where clauses via a -w option and then in response to feedback
> to add it to each each table of -t. See discussion here:
> 
>   http://archives.postgresql.org/pgsql-patches/2008-06/msg00001.php
> 
> and final patch here:
> 
>   http://archives.postgresql.org/pgsql-patches/2008-06/msg00026.php.
> 
> We now have two patches on this topic from different submitters with
> different use cases supplied as justification. 

Well, that is truly bizarre.

I had no idea about the existence of the other patch. I guess I must
have been busy that week.

This was designed a while back in conjunction with other related
thoughts. I still want an easy way to create a data sample for creating
dev databases from large production systems.

I defer and apologise to the previous submitter, since he got there
first, and apologise again for the noise.

(Cheeky code review: Davy's patch fails if used with -o option, plus I
think it outputs the wrong text into the dump file, AFAICS).

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Adding WHERE clause to pg_dump

From
Simon Riggs
Date:
On Fri, 2008-07-25 at 14:29 -0700, Joshua D. Drake wrote:
> 
> >  - users could make partial dumps and be confused and lose data.
> > 
> >       Yes, but they can already do that with -n, -t, and the new
> pre-data
> >       and post-data switches. This is one more case where the
> default is
> >       a full dump but you one can specificly request less.
> 
> No they actually can't. You are guaranteed that regardless of a -n or
> -t
> flag that the data you receive is consistent. You can't guarantee that
> with -w because you could pull different data based on an arbitrary
> conditional that can not apply to all objects.

But are you guaranteed that you have all tables in FK relationships? No.
(But I like that capability also - its useful).

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Adding WHERE clause to pg_dump

From
Davy Durham
Date:
Simon Riggs wrote:

>
>Well, that is truly bizarre.
>
>I had no idea about the existence of the other patch. I guess I must
>have been busy that week.
>
>This was designed a while back in conjunction with other related
>thoughts. I still want an easy way to create a data sample for creating
>dev databases from large production systems.
>
>I defer and apologise to the previous submitter, since he got there
>first, and apologise again for the noise.
>
>(Cheeky code review: Davy's patch fails if used with -o option, plus I
>think it outputs the wrong text into the dump file, AFAICS).
>
>  
>
Are you using my patch at 
http://archives.postgresql.org/pgsql-patches/2008-06/msg00026.php ?  
I'll be glad to fix it.


Re: Adding WHERE clause to pg_dump

From
Gregory Stark
Date:
"Joshua D. Drake" <jd@commandprompt.com> writes:

> How do we deal with this?
>
> pg_dump -w "last_update_timestamp < ..." -t 'table*'
>
> What I see is a recipe for inconsistent, un-restorable backups without a
> user realizing what they have done. The only way to deal with the above
> is:
>
> 1. Wildcards aren't allowed if you have -w
> 2. You dump everything, if the WHERE clause isn't relevant you just dump
> the whole table

There's always 
 3. Apply the WHERE clause to all tables and if there's a table missing    columns referenced in the where clause then
failwith the appropriate    error.
 

Which seems like the right option to me. The tricky bit would be how to deal
with cases where you want a different where clause for different tables. But
even if it doesn't handle all cases that doesn't mean a partial solution is
unreasonable.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production
Tuning


Re: Adding WHERE clause to pg_dump

From
daveg
Date:
On Fri, Jul 25, 2008 at 11:17:20PM +0100, Gregory Stark wrote:
> "Joshua D. Drake" <jd@commandprompt.com> writes:
> 
> > How do we deal with this?
> >
> > pg_dump -w "last_update_timestamp < ..." -t 'table*'
> >
> > What I see is a recipe for inconsistent, un-restorable backups without a
> > user realizing what they have done. The only way to deal with the above
> > is:
> >
> > 1. Wildcards aren't allowed if you have -w
> > 2. You dump everything, if the WHERE clause isn't relevant you just dump
> > the whole table
> 
> There's always 
> 
>   3. Apply the WHERE clause to all tables and if there's a table missing
>      columns referenced in the where clause then fail with the appropriate
>      error.
> 
> Which seems like the right option to me. The tricky bit would be how to deal
> with cases where you want a different where clause for different tables. But
> even if it doesn't handle all cases that doesn't mean a partial solution is
> unreasonable.

Actually, Davy's patch does deal with the case "where you want a different
where clause for different tables".

-dg


-- 
David Gould       daveg@sonic.net      510 536 1443    510 282 0869
If simplicity worked, the world would be overrun with insects.


Re: Adding WHERE clause to pg_dump

From
Simon Riggs
Date:
On Fri, 2008-07-25 at 12:38 -0700, Joshua D. Drake wrote:

> Gained. Code complexity. 

Hardly, patch is very small. I would recognise that as a factor
otherwise.

> What I see is a recipe for inconsistent, un-restorable backups without a
> user realizing what they have done.

I agree on the backup side, but then who would extract just a portion of
their data for backup? It would be no backup at all. 

If you did use this as part of an incremental backup scheme, then they
would have to test it (just like any backup method). Incremental backups
rarely have self-consistency except as part of a greater whole.

As a dev tool it makes sense.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Adding WHERE clause to pg_dump

From
Andrew Dunstan
Date:

Simon Riggs wrote:
> On Fri, 2008-07-25 at 12:38 -0700, Joshua D. Drake wrote:
>
>   
>> Gained. Code complexity. 
>>     
>
> Hardly, patch is very small. I would recognise that as a factor
> otherwise.
>
>   
>> What I see is a recipe for inconsistent, un-restorable backups without a
>> user realizing what they have done.
>>     
>
> I agree on the backup side, but then who would extract just a portion of
> their data for backup? It would be no backup at all. 
>
> If you did use this as part of an incremental backup scheme, then they
> would have to test it (just like any backup method). Incremental backups
> rarely have self-consistency except as part of a greater whole.
>
> As a dev tool it makes sense.
>
>   


I think we have yet another case for moving the core bits of pg_dump 
into a library that can then be used by lots of clients. Until we do 
that we're going to get continual pressure to add extra cases to pg_dump 
unrelated to its principal functionality.


cheers

andrew



Re: Adding WHERE clause to pg_dump

From
Simon Riggs
Date:
On Sat, 2008-07-26 at 07:47 -0400, Andrew Dunstan wrote:
> 
> Simon Riggs wrote:

> > As a dev tool it makes sense.
> >

> I think we have yet another case for moving the core bits of pg_dump 
> into a library that can then be used by lots of clients. Until we do 
> that we're going to get continual pressure to add extra cases to pg_dump 
> unrelated to its principal functionality.

That's a good idea and I support that.

I'm slightly suprised at the "principal functionality" bit. In a world
where PITR exists the role and importance of pg_dump has waned
considerably. What *is* its principal function? Does it have just one?

One man's dev system is another man's data warehouse, or another man's
backup. The meaning of a dump is defined by the user making the data
dump, not the tool used.

Is this one option sufficient to make us invent pg_make_dev_database?
(With all pg_dump options, plus -w). If that's what we need, fine by me.
I'm always interested in the capability not the structure/naming.

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Adding WHERE clause to pg_dump

From
Andrew Dunstan
Date:

Simon Riggs wrote:
> In a world
> where PITR exists the role and importance of pg_dump has waned
> considerably. What *is* its principal function? Does it have just one?
>
>
>   

I think that's probably a rather narrow perspective.

PITR doesn't work across versions or architectures or OSes. And if 
you're using it for failover, then using it for standalone backups as 
well means you will need a custom archive_command which can be a bit 
tricky to get right. And a custom dump is almost always far smaller than 
a PITR dump, even when it's compressed.

I suspect that the vast majority of our users are still using pg_dump to 
make normal backups, and that it works quite happily for them. It's 
really only when databases get pretty large that this becomes 
unmanageable. I think using pg_dump for backups and PITR for failover is 
a good combination for a great many users.

So, IMNSHO, making a full database backup is still pg_dump's principal 
function.

cheers

andrew




Re: Adding WHERE clause to pg_dump

From
Simon Riggs
Date:
On Sat, 2008-07-26 at 09:08 -0400, Andrew Dunstan wrote:

> So, IMNSHO, making a full database backup is still pg_dump's principal 
> function.

Making copies for development databases is also a common use case, and
if not more common than backups, at least not far behind. This was my
stated use case.

>From my perspective, this should be fairly simple
* do we agree the use case is a problem we care about?
* do we agree the proposal would help that use case?
* whats the best way to package that feature?

If we wish to protect pg_dump's role, then lets have another utility or
some packaging that can be used for its other hidden roles. That sounds
like we might all agree on that. pg_dev_dump? How should it look?

-- Simon Riggs           www.2ndQuadrant.comPostgreSQL Training, Services and Support



Re: Adding WHERE clause to pg_dump

From
Andrew Dunstan
Date:

Simon Riggs wrote:
> If we wish to protect pg_dump's role, then lets have another utility or
> some packaging that can be used for its other hidden roles. That sounds
> like we might all agree on that. pg_dev_dump? How should it look?
>
>   

Actually, if we libraryise pg_dump and add some corresponding \ commands 
to psql, then this would possibly be unnecessary .

cheers

andrew