Thread: Adding WHERE clause to pg_dump
Attached patch implements WHERE clauses for pg_dump. This is useful for producing data samples of a database e.g. pg_dump -w "ctid < '(1000,1)' or random() < 0.1" and can also be used for taking incremental backups, if data columns exist to make a partial dump sensible. e.g. pg_dump -w "last_update_timestamp > ...." Columns such as this are very common because of optimistic locking techniques in many databases. This is designed to be used in conjunction with the TOM utility, and the forthcoming patch to implement stats hooks. Taken together these features will allow the ability to take a cut-down database environment for testing, yet with statistics matching the main production database. It was easier to write it and then discuss, since I needed to check the feasibility of the idea before presenting it. -- Simon Riggs www.2ndQuadrant.com PostgreSQL Training, Services and Support
Attachment
Simon Riggs <simon@2ndquadrant.com> writes: > Attached patch implements WHERE clauses for pg_dump. I still have serious reservations about adding such an ugly, non-orthogonal wart to pg_dump. Why is it not appropriate to just do a COPY (SELECT ...) TO STDOUT when you need this? regards, tom lane
On Fri, 2008-07-25 at 13:31 -0400, Tom Lane wrote: > Simon Riggs <simon@2ndquadrant.com> writes: > > Attached patch implements WHERE clauses for pg_dump. > > I still have serious reservations about adding such an ugly, > non-orthogonal wart to pg_dump. Why is it not appropriate to just > do a COPY (SELECT ...) TO STDOUT when you need this? So you can dump a coherent sample database in one command, not 207. Every user of PostgreSQL wants a dev/test database. If the database is large it isn't practical to take a complete copy. Nor is it practical to hand-write a data sampling extraction program and if you do, its usually imperfect in many ways. Adding this feature gives a very fast capability to create sample databases, or incremental backups for many cases. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Fri, 2008-07-25 at 19:33 +0100, Simon Riggs wrote: > On Fri, 2008-07-25 at 13:31 -0400, Tom Lane wrote: > > Simon Riggs <simon@2ndquadrant.com> writes: > > > Attached patch implements WHERE clauses for pg_dump. > > > > I still have serious reservations about adding such an ugly, > > non-orthogonal wart to pg_dump. Why is it not appropriate to just > > do a COPY (SELECT ...) TO STDOUT when you need this? > > So you can dump a coherent sample database in one command, not 207. > > Every user of PostgreSQL wants a dev/test database. If the database is > large it isn't practical to take a complete copy. Nor is it practical to > hand-write a data sampling extraction program and if you do, its usually > imperfect in many ways. > > Adding this feature gives a very fast capability to create sample > databases, or incremental backups for many cases. Not sure I buy this argument. I am all for usability and I would be the first to shout about the general ridiculousness of pg_dump/all/restore but in this case I think Tom is right. This feature could easily be done in a script without harassing pg_dump. Sincerely, Joshua D. Drake -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ United States PostgreSQL Association: http://www.postgresql.us/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
On Fri, 2008-07-25 at 11:46 -0700, Joshua D. Drake wrote: > On Fri, 2008-07-25 at 19:33 +0100, Simon Riggs wrote: > > On Fri, 2008-07-25 at 13:31 -0400, Tom Lane wrote: > > > Simon Riggs <simon@2ndquadrant.com> writes: > > > > Attached patch implements WHERE clauses for pg_dump. > > > > > > I still have serious reservations about adding such an ugly, > > > non-orthogonal wart to pg_dump. Why is it not appropriate to just > > > do a COPY (SELECT ...) TO STDOUT when you need this? > > > > So you can dump a coherent sample database in one command, not 207. > > > > Every user of PostgreSQL wants a dev/test database. If the database is > > large it isn't practical to take a complete copy. Nor is it practical to > > hand-write a data sampling extraction program and if you do, its usually > > imperfect in many ways. > > > > Adding this feature gives a very fast capability to create sample > > databases, or incremental backups for many cases. > > Not sure I buy this argument. I am all for usability and I would be the > first to shout about the general ridiculousness of pg_dump/all/restore > but in this case I think Tom is right. This feature could easily be done > in a script without harassing pg_dump. You can do it, yes. But it takes a lot longer. If the time to implement was similar, then I would immediately agree "feature available already". pg_dump is not "harassed" by this. What is lost by adding this feature? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Fri, 2008-07-25 at 20:26 +0100, Simon Riggs wrote: > On Fri, 2008-07-25 at 11:46 -0700, Joshua D. Drake wrote: > > On Fri, 2008-07-25 at 19:33 +0100, Simon Riggs wrote: > > > Adding this feature gives a very fast capability to create sample > > > databases, or incremental backups for many cases. > > > > Not sure I buy this argument. I am all for usability and I would be the > > first to shout about the general ridiculousness of pg_dump/all/restore > > but in this case I think Tom is right. This feature could easily be done > > in a script without harassing pg_dump. > > You can do it, yes. But it takes a lot longer. If the time to implement > was similar, then I would immediately agree "feature available already". > > pg_dump is not "harassed" by this. What is lost by adding this feature? Gained. Code complexity. Right now pg_dump does, copy. You are introducing a whole other level of complexity by adding WHERE clause capability. Secondly I don't think it would actually add anything but complexity to the user. How do we deal with this? pg_dump -w "last_update_timestamp < ..." -t 'table*' What I see is a recipe for inconsistent, un-restorable backups without a user realizing what they have done. The only way to deal with the above is: 1. Wildcards aren't allowed if you have -w 2. You dump everything, if the WHERE clause isn't relevant you just dump the whole table I don't like either. I do see utility if you know what you are doing but I think it makes more sense to have it outside of pg_dump. Sincerely, Joshua D. Drake -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ United States PostgreSQL Association: http://www.postgresql.us/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
On Fri, Jul 25, 2008 at 08:26:35PM +0100, Simon Riggs wrote: > > On Fri, 2008-07-25 at 11:46 -0700, Joshua D. Drake wrote: > > On Fri, 2008-07-25 at 19:33 +0100, Simon Riggs wrote: > > > On Fri, 2008-07-25 at 13:31 -0400, Tom Lane wrote: > > > > Simon Riggs <simon@2ndquadrant.com> writes: > > > > > Attached patch implements WHERE clauses for pg_dump. > > > > > > > > I still have serious reservations about adding such an ugly, > > > > non-orthogonal wart to pg_dump. Why is it not appropriate to just > > > > do a COPY (SELECT ...) TO STDOUT when you need this? > > > > > > So you can dump a coherent sample database in one command, not 207. > > > > > > Every user of PostgreSQL wants a dev/test database. If the database is > > > large it isn't practical to take a complete copy. Nor is it practical to > > > hand-write a data sampling extraction program and if you do, its usually > > > imperfect in many ways. > > > > > > Adding this feature gives a very fast capability to create sample > > > databases, or incremental backups for many cases. > > > > Not sure I buy this argument. I am all for usability and I would be the > > first to shout about the general ridiculousness of pg_dump/all/restore > > but in this case I think Tom is right. This feature could easily be done > > in a script without harassing pg_dump. > > You can do it, yes. But it takes a lot longer. If the time to implement > was similar, then I would immediately agree "feature available already". > > pg_dump is not "harassed" by this. What is lost by adding this feature? This was discussed at the beginning of June on patches, Dave Durham submitted a patch to add where clauses via a -w option and then in response to feedback to add it to each each table of -t. See discussion here: http://archives.postgresql.org/pgsql-patches/2008-06/msg00001.php and final patch here: http://archives.postgresql.org/pgsql-patches/2008-06/msg00026.php. We now have two patches on this topic from different submitters with different use cases supplied as justification. I have yet another use case not mentioned by either of the submitters and will probably hand patch pg_dump locally to do so. I don't think at this point we should wave this off under the impression that no one really wants or needs it as obviously some people want it enough to code it. The other objections seem to be based on the themes: - code complexity. Davy's patch is quite simple. I have looked at Simon's yet. - we need an ETL tool so this should be preempted by that.- pg_dump should be made into a library so this can be done separately. We don't generally allow imaginary futures to prevent us from adding useful functionality on other topics. - This can be done with a script. Not really. The script would pretty much have to contain most of pg_dump. That's more than a script. - users could make partial dumps and be confused and lose data. Yes, but they can already do that with -n, -t, and the new pre-data and post-data switches. This is one more casewhere the default is a full dump but you one can specificly request less. I think that once COPY sprouted a WHERE clause it becomes almost inevitable that pg_dump will take advantage of them. How many patches on this topic do we want to ignore? As you may have guessed by this point: +1 -dg -- David Gould daveg@sonic.net 510 536 1443 510 282 0869 If simplicity worked, the world would be overrun with insects.
On Fri, 2008-07-25 at 14:11 -0700, daveg wrote: > On Fri, Jul 25, 2008 at 08:26:35PM +0100, Simon Riggs wrote: > - This can be done with a script. > > Not really. The script would pretty much have to contain most of > pg_dump. That's more than a script. > Yes really. :) The only thing pg_dump is buying you here is easy of schema pull. In a situation like this you would pull a pg_dump -s then only restore data that you want based on a single transaction snapshot of the objects you are going to query. > - users could make partial dumps and be confused and lose data. > > Yes, but they can already do that with -n, -t, and the new pre-data > and post-data switches. This is one more case where the default is > a full dump but you one can specificly request less. No they actually can't. You are guaranteed that regardless of a -n or -t flag that the data you receive is consistent. You can't guarantee that with -w because you could pull different data based on an arbitrary conditional that can not apply to all objects. Joshua D. Drake -- The PostgreSQL Company since 1997: http://www.commandprompt.com/ PostgreSQL Community Conference: http://www.postgresqlconference.org/ United States PostgreSQL Association: http://www.postgresql.us/ Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
On Fri, 2008-07-25 at 14:11 -0700, daveg wrote: > On Fri, Jul 25, 2008 at 08:26:35PM +0100, Simon Riggs wrote: > > > > On Fri, 2008-07-25 at 11:46 -0700, Joshua D. Drake wrote: > > > On Fri, 2008-07-25 at 19:33 +0100, Simon Riggs wrote: > > > > On Fri, 2008-07-25 at 13:31 -0400, Tom Lane wrote: > > > > > Simon Riggs <simon@2ndquadrant.com> writes: > > > > > > Attached patch implements WHERE clauses for pg_dump. > > > > > > > > > > I still have serious reservations about adding such an ugly, > > > > > non-orthogonal wart to pg_dump. Why is it not appropriate to just > > > > > do a COPY (SELECT ...) TO STDOUT when you need this? > > > > > > > > So you can dump a coherent sample database in one command, not 207. > > > > > > > > Every user of PostgreSQL wants a dev/test database. If the database is > > > > large it isn't practical to take a complete copy. Nor is it practical to > > > > hand-write a data sampling extraction program and if you do, its usually > > > > imperfect in many ways. > > > > > > > > Adding this feature gives a very fast capability to create sample > > > > databases, or incremental backups for many cases. > > > > > > Not sure I buy this argument. I am all for usability and I would be the > > > first to shout about the general ridiculousness of pg_dump/all/restore > > > but in this case I think Tom is right. This feature could easily be done > > > in a script without harassing pg_dump. > > > > You can do it, yes. But it takes a lot longer. If the time to implement > > was similar, then I would immediately agree "feature available already". > > > > pg_dump is not "harassed" by this. What is lost by adding this feature? > > This was discussed at the beginning of June on patches, Dave Durham submitted > a patch to add where clauses via a -w option and then in response to feedback > to add it to each each table of -t. See discussion here: > > http://archives.postgresql.org/pgsql-patches/2008-06/msg00001.php > > and final patch here: > > http://archives.postgresql.org/pgsql-patches/2008-06/msg00026.php. > > We now have two patches on this topic from different submitters with > different use cases supplied as justification. Well, that is truly bizarre. I had no idea about the existence of the other patch. I guess I must have been busy that week. This was designed a while back in conjunction with other related thoughts. I still want an easy way to create a data sample for creating dev databases from large production systems. I defer and apologise to the previous submitter, since he got there first, and apologise again for the noise. (Cheeky code review: Davy's patch fails if used with -o option, plus I think it outputs the wrong text into the dump file, AFAICS). -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
On Fri, 2008-07-25 at 14:29 -0700, Joshua D. Drake wrote: > > > - users could make partial dumps and be confused and lose data. > > > > Yes, but they can already do that with -n, -t, and the new > pre-data > > and post-data switches. This is one more case where the > default is > > a full dump but you one can specificly request less. > > No they actually can't. You are guaranteed that regardless of a -n or > -t > flag that the data you receive is consistent. You can't guarantee that > with -w because you could pull different data based on an arbitrary > conditional that can not apply to all objects. But are you guaranteed that you have all tables in FK relationships? No. (But I like that capability also - its useful). -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > >Well, that is truly bizarre. > >I had no idea about the existence of the other patch. I guess I must >have been busy that week. > >This was designed a while back in conjunction with other related >thoughts. I still want an easy way to create a data sample for creating >dev databases from large production systems. > >I defer and apologise to the previous submitter, since he got there >first, and apologise again for the noise. > >(Cheeky code review: Davy's patch fails if used with -o option, plus I >think it outputs the wrong text into the dump file, AFAICS). > > > Are you using my patch at http://archives.postgresql.org/pgsql-patches/2008-06/msg00026.php ? I'll be glad to fix it.
"Joshua D. Drake" <jd@commandprompt.com> writes: > How do we deal with this? > > pg_dump -w "last_update_timestamp < ..." -t 'table*' > > What I see is a recipe for inconsistent, un-restorable backups without a > user realizing what they have done. The only way to deal with the above > is: > > 1. Wildcards aren't allowed if you have -w > 2. You dump everything, if the WHERE clause isn't relevant you just dump > the whole table There's always 3. Apply the WHERE clause to all tables and if there's a table missing columns referenced in the where clause then failwith the appropriate error. Which seems like the right option to me. The tricky bit would be how to deal with cases where you want a different where clause for different tables. But even if it doesn't handle all cases that doesn't mean a partial solution is unreasonable. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's On-Demand Production Tuning
On Fri, Jul 25, 2008 at 11:17:20PM +0100, Gregory Stark wrote: > "Joshua D. Drake" <jd@commandprompt.com> writes: > > > How do we deal with this? > > > > pg_dump -w "last_update_timestamp < ..." -t 'table*' > > > > What I see is a recipe for inconsistent, un-restorable backups without a > > user realizing what they have done. The only way to deal with the above > > is: > > > > 1. Wildcards aren't allowed if you have -w > > 2. You dump everything, if the WHERE clause isn't relevant you just dump > > the whole table > > There's always > > 3. Apply the WHERE clause to all tables and if there's a table missing > columns referenced in the where clause then fail with the appropriate > error. > > Which seems like the right option to me. The tricky bit would be how to deal > with cases where you want a different where clause for different tables. But > even if it doesn't handle all cases that doesn't mean a partial solution is > unreasonable. Actually, Davy's patch does deal with the case "where you want a different where clause for different tables". -dg -- David Gould daveg@sonic.net 510 536 1443 510 282 0869 If simplicity worked, the world would be overrun with insects.
On Fri, 2008-07-25 at 12:38 -0700, Joshua D. Drake wrote: > Gained. Code complexity. Hardly, patch is very small. I would recognise that as a factor otherwise. > What I see is a recipe for inconsistent, un-restorable backups without a > user realizing what they have done. I agree on the backup side, but then who would extract just a portion of their data for backup? It would be no backup at all. If you did use this as part of an incremental backup scheme, then they would have to test it (just like any backup method). Incremental backups rarely have self-consistency except as part of a greater whole. As a dev tool it makes sense. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > On Fri, 2008-07-25 at 12:38 -0700, Joshua D. Drake wrote: > > >> Gained. Code complexity. >> > > Hardly, patch is very small. I would recognise that as a factor > otherwise. > > >> What I see is a recipe for inconsistent, un-restorable backups without a >> user realizing what they have done. >> > > I agree on the backup side, but then who would extract just a portion of > their data for backup? It would be no backup at all. > > If you did use this as part of an incremental backup scheme, then they > would have to test it (just like any backup method). Incremental backups > rarely have self-consistency except as part of a greater whole. > > As a dev tool it makes sense. > > I think we have yet another case for moving the core bits of pg_dump into a library that can then be used by lots of clients. Until we do that we're going to get continual pressure to add extra cases to pg_dump unrelated to its principal functionality. cheers andrew
On Sat, 2008-07-26 at 07:47 -0400, Andrew Dunstan wrote: > > Simon Riggs wrote: > > As a dev tool it makes sense. > > > I think we have yet another case for moving the core bits of pg_dump > into a library that can then be used by lots of clients. Until we do > that we're going to get continual pressure to add extra cases to pg_dump > unrelated to its principal functionality. That's a good idea and I support that. I'm slightly suprised at the "principal functionality" bit. In a world where PITR exists the role and importance of pg_dump has waned considerably. What *is* its principal function? Does it have just one? One man's dev system is another man's data warehouse, or another man's backup. The meaning of a dump is defined by the user making the data dump, not the tool used. Is this one option sufficient to make us invent pg_make_dev_database? (With all pg_dump options, plus -w). If that's what we need, fine by me. I'm always interested in the capability not the structure/naming. -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > In a world > where PITR exists the role and importance of pg_dump has waned > considerably. What *is* its principal function? Does it have just one? > > > I think that's probably a rather narrow perspective. PITR doesn't work across versions or architectures or OSes. And if you're using it for failover, then using it for standalone backups as well means you will need a custom archive_command which can be a bit tricky to get right. And a custom dump is almost always far smaller than a PITR dump, even when it's compressed. I suspect that the vast majority of our users are still using pg_dump to make normal backups, and that it works quite happily for them. It's really only when databases get pretty large that this becomes unmanageable. I think using pg_dump for backups and PITR for failover is a good combination for a great many users. So, IMNSHO, making a full database backup is still pg_dump's principal function. cheers andrew
On Sat, 2008-07-26 at 09:08 -0400, Andrew Dunstan wrote: > So, IMNSHO, making a full database backup is still pg_dump's principal > function. Making copies for development databases is also a common use case, and if not more common than backups, at least not far behind. This was my stated use case. >From my perspective, this should be fairly simple * do we agree the use case is a problem we care about? * do we agree the proposal would help that use case? * whats the best way to package that feature? If we wish to protect pg_dump's role, then lets have another utility or some packaging that can be used for its other hidden roles. That sounds like we might all agree on that. pg_dev_dump? How should it look? -- Simon Riggs www.2ndQuadrant.comPostgreSQL Training, Services and Support
Simon Riggs wrote: > If we wish to protect pg_dump's role, then lets have another utility or > some packaging that can be used for its other hidden roles. That sounds > like we might all agree on that. pg_dev_dump? How should it look? > > Actually, if we libraryise pg_dump and add some corresponding \ commands to psql, then this would possibly be unnecessary . cheers andrew