Thread: Feature: give pg_dump a WHERE clause expression
Greetings, I have developed a fairly simple patch to the pg_dump utility. It is against version 8.3.1 source code. I have added a new parameter, -w/--where=EXPR This lets you specify an expression that will be used in a WHERE clause when the data is dumped. I have implemented and tested that it works when generating either COPY statements (the default), or INSERT statements (-d and -D). These two modes of operation have two different sections of code that select the data to be dumped. Though this change could arguably be removed, when a -w/--where expression is specified, it is also indicated in the comments of the dump output so one viewing the dump can see that it was not necessarily all of the data. When -w/--where is not specified, the dump output is just as if this patch had not been applied. I've also updated the pg_dump.sgml file to add a description of this new flag. The code changes should also conform to the existing code style within pg_dump. The patch should be applied from the root of the source tree with a -p1 option to the patch command. Please give any feedback if the patch needs improvement Thanks for a great DB!
Attachment
Davy Durham <pubaddr5@davyandbeth.com> writes: > I have added a new parameter, -w/--where=EXPR > This lets you specify an expression that will be used in a WHERE > clause when the data is dumped. This seems pretty poorly thought out. It can hardly work in a dump of more than one table, which means that there's not any real reason to use pg_dump at all. Just do a COPY (SELECT ...) TO somefile. regards, tom lane
On Sun, 2008-06-01 at 10:43 -0400, Tom Lane wrote: > Davy Durham <pubaddr5@davyandbeth.com> writes: > > I have added a new parameter, -w/--where=EXPR > > This lets you specify an expression that will be used in a WHERE > > clause when the data is dumped. > > This seems pretty poorly thought out. It can hardly work in a dump > of more than one table, which means that there's not any real reason > to use pg_dump at all. Just do a COPY (SELECT ...) TO somefile. > > regards, tom lane > Well, my primary reason for writing the patch was to have a standard SQL file using INSERT statements in order to load the some of a table's data into a database other than postgresql which does not support the COPY statement. I'll admit that the single where clause would often not be applicable across all tables in a database, but when pg_dump is told specific tables to dump (a nice existing feature of pg_dump for doing something specialized other than a simple entire database backup), then it can be useful. My particular case is that I have several tables that are simple event logs. Each table has a timestamp column. I'm periodically bringing these tables into sync on another database and I only want to pull rows newer than since the last sync.. So, a where-clause of.. 'ts > $last_sync' ..works for me. However, I'm sure there are other uses too.. == Thinking Further == Beyond serving my own needs, I'm trying to generically extend the general idea that pg_dump already supports: 1) pg_dump can be made to dump an entire database 2) pg_dump can be made to dump only requested tables 3) [my addition] pg_dump can be made to dump only requested rows from requested tables However, it's no SO generic in that the where clause applies to all tables. So, if this patch is not acceptable as-is, what would you feel about this: I could enhance the -t/--table=NAME option to accept more than a simple NAME. Rather it could accept something in the form: --table=<table_name>:<where-clause expression> For example, pg_dump --table='foo:col1 > 10 AND f2 < 14' Currently, the user can specify -t/--table multiple times to have more than one table dumped. Or the user can use a pattern to a single -t option to request multiple tabes. This way, a user could specify a WHERE clause per table he has requested to dump. Granted, the WHERE clause may then apply to multiple tables if a pattern was used, but that may very well be desirable to the user. Unless you disagree, this is a more generic solution (than what my patch contains) to allowing the user of pg_dump to further refine what they wish to dump. Thoughts? Thanks for the feedback
Davy Durham <pubaddr5@davyandbeth.com> writes: > So, if this patch is not acceptable as-is, what would you feel about > this: > I could enhance the -t/--table=NAME option to accept more than a > simple NAME. Rather it could accept something in the form: > --table=<table_name>:<where-clause expression> Well, that would at least address the complaint that it doesn't scale to multiple tables, but the whole thing still seems like a frammish that will never see enough use to justify maintaining it. (BTW, what will you do with a table whose name contains a colon?) regards, tom lane
Tom Lane wrote: > Davy Durham <pubaddr5@davyandbeth.com> writes: > >> So, if this patch is not acceptable as-is, what would you feel about >> this: >> I could enhance the -t/--table=NAME option to accept more than a >> simple NAME. Rather it could accept something in the form: >> > > >> --table=<table_name>:<where-clause expression> >> > > Well, that would at least address the complaint that it doesn't scale > to multiple tables, but the whole thing still seems like a frammish > that will never see enough use to justify maintaining it. > > (BTW, what will you do with a table whose name contains a colon?) > > > ISTM this would be better off waiting until we turn large parts of pg_dump into a library, as has been often discussed, at which point it should be relatively simple to write a custom client to do what the OP wants. I agree that it does not at all belong in pg_dump. cheers andrew
On Sun, Jun 01, 2008 at 04:13:34PM -0400, Andrew Dunstan wrote: > > > Tom Lane wrote: > >Davy Durham <pubaddr5@davyandbeth.com> writes: > > > >>So, if this patch is not acceptable as-is, what would you feel about > >>this: > >> I could enhance the -t/--table=NAME option to accept more than a > >> simple NAME. Rather it could accept something in the form: > >> > > > > > >> --table=<table_name>:<where-clause expression> > >> > > > >Well, that would at least address the complaint that it doesn't scale > >to multiple tables, but the whole thing still seems like a frammish > >that will never see enough use to justify maintaining it. > > > >(BTW, what will you do with a table whose name contains a colon?) > > > > > > > > ISTM this would be better off waiting until we turn large parts of > pg_dump into a library, as has been often discussed, at which point it > should be relatively simple to write a custom client to do what the OP > wants. I agree that it does not at all belong in pg_dump. I can't imagine many of my clients ever writing another C program or even being willing to pay me to do so. While modularizing pg_dump is a fine idea, I don't think it addresses the same set of use cases and users as this proposal. -dg -- David Gould daveg@sonic.net 510 536 1443 510 282 0869 If simplicity worked, the world would be overrun with insects.
On Sun, 2008-06-01 at 15:47 -0400, Tom Lane wrote: > Davy Durham <pubaddr5@davyandbeth.com> writes: > > So, if this patch is not acceptable as-is, what would you feel about > > this: > > I could enhance the -t/--table=NAME option to accept more than a > > simple NAME. Rather it could accept something in the form: > > > --table=<table_name>:<where-clause expression> > > Well, that would at least address the complaint that it doesn't scale > to multiple tables, but the whole thing still seems like a frammish > that will never see enough use to justify maintaining it. > The code is not all that much to maintain as it is, and making it an addition to an existing parameter instead of a new one may not increase the code size by much more at all. BTW- I looked it up, and mysqldump supports such an option as mine, but it too is global for all tables and cannot be specified per table AFAICT. > (BTW, what will you do with a table whose name contains a colon?) > I thought about that, but didn't know if tables could contain a colon or not, but I see that this is possible by enclosing the table name in double-quotes. I suppose they could escape the colon as I believe they may have to do if a table contains '*', or '?' ?? Is there another character that is more appropriate? Another option I just thought about was to leave the -w/--where flag in place, but it applies to all subsequent -t/--table flags.. So you could do something like: pg_dump -w 'expr1' -t tab1 -t tab2 -w 'expr2' -t tab3 So that the expr1 filters tab1 and tab2, and expr2 filters tab3.. This should be a simple modification to the existing patch to make the where clause tracked per table rather than merely a global variable. However there the becomes an importance on the order that args are given to pg_dump which you may object to. But then again, if order of the tables in the dump file matters, then the -t/--tables flags already indicate what order the tables will be dumped. (By pointing this out, I mean there may already be an importance of argument order in some circumstances) This also solves the ':' syntax problem you mentioned above. > regards, tom lane >
daveg wrote: >> ISTM this would be better off waiting until we turn large parts of >> pg_dump into a library, as has been often discussed, at which point it >> should be relatively simple to write a custom client to do what the OP >> wants. I agree that it does not at all belong in pg_dump. >> > > I can't imagine many of my clients ever writing another C program or even > being willing to pay me to do so. While modularizing pg_dump is a fine idea, > I don't think it addresses the same set of use cases and users as this > proposal. > > > It's not clear to me that your use case is very compelling. Does your foreign database not support import via CSV or XML? Postgres can now produce both of these for any arbitrary query. cheers andrew
On Sun, Jun 01, 2008 at 04:40:15PM -0400, Andrew Dunstan wrote: > daveg wrote: > > > >I can't imagine many of my clients ever writing another C program or even > >being willing to pay me to do so. While modularizing pg_dump is a fine > >idea, > >I don't think it addresses the same set of use cases and users as this > >proposal. > > It's not clear to me that your use case is very compelling. Does your > foreign database not support import via CSV or XML? Postgres can now > produce both of these for any arbitrary query. The foreign database in question is postgresql. The feature that the proposed patch enables is to create pg_dump custom format archives for multiple tables with a predicate. No amount of csv or xml will do that. Contrived example: pg_dump -Fc --table="*._stats:where ts >= now()::date" -f todays_stats.pgd If I have not been successful in explaining this clearly, please reply privately to avoid cluttering the list. If you simply disagree about the usefulness of the feature, I'm fine with that. -dg -- David Gould daveg@sonic.net 510 536 1443 510 282 0869 If simplicity worked, the world would be overrun with insects.
* daveg (daveg@sonic.net) wrote: > The feature that the proposed patch enables is to create pg_dump custom > format archives for multiple tables with a predicate. No amount of csv or > xml will do that. Contrived example: Uh, pg_dump's custom format really isn't particularly special, to be honest. Is there some reason you're interested in using it over, as was suggested, COPY/CSV/etc format? You could, of course, gzip the output of COPY (as pg_dump does) if you're concerned about space.. Thanks, Stephen
Attachment
* Davy Durham (pubaddr5@davyandbeth.com) wrote: > The code is not all that much to maintain as it is, and making it an > addition to an existing parameter instead of a new one may not increase > the code size by much more at all. I'm generally in favor of 'creature-comfort' kind of additions, but I have to agree with Tom on this one. I don't see much of a use case, and it's trivial to create a shell-script to do what you're looking for with psql and COPY statements. Thanks, Stephen
Attachment
On Sun, 2008-06-01 at 20:53 -0400, Stephen Frost wrote: > * Davy Durham (pubaddr5@davyandbeth.com) wrote: > > The code is not all that much to maintain as it is, and making it an > > addition to an existing parameter instead of a new one may not increase > > the code size by much more at all. > > I'm generally in favor of 'creature-comfort' kind of additions, but I > have to agree with Tom on this one. I don't see much of a use case, and > it's trivial to create a shell-script to do what you're looking for with > psql and COPY statements. > To reiterate, it is not possible to use the COPY command to create standard SQL INSERT statements that can be conveniently loaded by another db. No?
* Davy Durham (pubaddr5@davyandbeth.com) wrote: > To reiterate, it is not possible to use the COPY command to create > standard SQL INSERT statements that can be conveniently loaded by > another db. No? Erm, thankfully, PostgreSQL (what you're loading the data into?) can take more than just SQL INSERT statements. Of course, on the other hand, you *could* use COPY to create SQL INSERT statements through an appropriately crafted query. PG makes that reasonably straight-forward, actually. Stephen
Attachment
On Sun, 2008-06-01 at 22:02 -0400, Stephen Frost wrote: > * Davy Durham (pubaddr5@davyandbeth.com) wrote: > > To reiterate, it is not possible to use the COPY command to create > > standard SQL INSERT statements that can be conveniently loaded by > > another db. No? > > Erm, thankfully, PostgreSQL (what you're loading the data into?) can > take more than just SQL INSERT statements. > No, the database I'm loading into is not PostgreSQL, but that's okay. I think that was someone else. > Of course, on the other hand, you *could* use COPY to create SQL INSERT > statements through an appropriately crafted query. PG makes that > reasonably straight-forward, actually. > > Stephen
* Davy Durham (pubaddr5@davyandbeth.com) wrote: > No, the database I'm loading into is not PostgreSQL, but that's okay. I > think that was someone else. This gets back to the other question then- what database is it and does it support CSV or XML imports? > > Of course, on the other hand, you *could* use COPY to create SQL INSERT > > statements through an appropriately crafted query. PG makes that > > reasonably straight-forward, actually. This is an option too, of course.. select 'insert into XYZ VALUES (' || coalesce(quote_literal(col1),'NULL') || ',' || coalesce(quote_literal(col2),'NULL') || ');'; or so. I suppose it might be interesting to consider an 'insert-format' output for COPY, which pg_dump could possibly be refactored to use when requested. It'd be nice if it was easier to have COPY support more formats but right now it's kind of a pain. Stephen
Attachment
Davy Durham <pubaddr5@davyandbeth.com> writes: > To reiterate, it is not possible to use the COPY command to create > standard SQL INSERT statements that can be conveniently loaded by > another db. No? Fair point, but the question here is about how useful this incremental feature really is compared to its incremental maintenance cost. It is *possible* to achieve what you want without any pg_dump changes: create a fresh table via CREATE TABLE tmp_table AS SELECT * FROM my_table WHERE ... and then pg_dump that. So the issue is how often does this problem come up and is it worth maintaining more code to make it a bit easier to deal with? My thought is probably not, and that seems to be the majority opinion so far. regards, tom lane
On Sun, 2008-06-01 at 22:19 -0400, Stephen Frost wrote: > > or so. I suppose it might be interesting to consider an 'insert-format' > output for COPY, which pg_dump could possibly be refactored to use when > requested. It'd be nice if it was easier to have COPY support more > formats but right now it's kind of a pain. That sounds also reasonable to me. It seems like a worthy format that COPY could output. And sure.. pg_dump could then be simplified a bit by using this. Beyond that, pg_dump would probably most easily let the user specify the very keyword (or full COPY statement parameters) that COPY would use as output format rather than implementing flags for everything COPY already supports. Perhaps support the existing -d/-D flags for backwards compatibility. This makes pg_dump more flexible AND COPY more flexible for use cases like mine.
Davy Durham <pubaddr5@davyandbeth.com> writes: > On Sun, 2008-06-01 at 22:19 -0400, Stephen Frost wrote: >> or so. I suppose it might be interesting to consider an 'insert-format' >> output for COPY, which pg_dump could possibly be refactored to use when >> requested. > And sure.. pg_dump could then be simplified a bit by using this. Not really. pg_dump has to support old server versions, so you won't get any meaningful simplification by moving functionality out of pg_dump into the backend; at least not before you've forgotten having made the change :-( regards, tom lane
On Sun, Jun 01, 2008 at 08:50:13PM -0400, Stephen Frost wrote: > * daveg (daveg@sonic.net) wrote: > > The feature that the proposed patch enables is to create pg_dump custom > > format archives for multiple tables with a predicate. No amount of csv or > > xml will do that. Contrived example: > > Uh, pg_dump's custom format really isn't particularly special, to be > honest. Is there some reason you're interested in using it over, as was > suggested, COPY/CSV/etc format? You could, of course, gzip the output > of COPY (as pg_dump does) if you're concerned about space.. You really underrate the power and utility of the pg_dump, pg_restore combination. These are very useful tools. I sense that they are sometimes considered ugly stepchildren, and in need of comprehensive reformation, but that glorious future should not stand in the way of buttering todays bread. Use case: I have a client with 50+ databases on 40+ hosts with more coming online all the time, they have a couple dozen tables (varies over time) on each db that need to have summaries of daily activity posted to a central database. These also need to be backed up for offline archiving. The archives are sometimes selectively re-loaded to diagnose or verify events in the past. Of course something can always be written to do most of this, or Skytools or even slony might be useful for the posting part, but the proposed pg_dump feature pretty much handles the whole thing in a very nice way that will require about 10 lines of shell script and will work unchanged even as tables and databases are added and dropped from the set. The operations staff at the site are familiar with pg_dump and selective restores using the list options -l/-L etc, so it would very be easy for them to use. It is also an advantage that the pg_dump archives contain a complete set of related data in one container that can be loaded or selectively loaded in one command instead of picking through thousands of csv or copy files. Also pg_dump also handles issues like permissions and ownership and indexes and constraints which a csv file does not. The argument that one could just use copy and tar and gzip applies to the whole existance of pg_dump itself. pg_dump is just a more convenient way to copy out tables and as such does not NEED to exist at all. However it is convenient to that is does exist and does have convenient features to selectively handle schemas and tables. What is being suggested is that pg_dump take advantage of a new feature added to postgresql (copy with where clause) to provide even more convenience and power than it already has. If the patch is intrusive, or ugly, or interferes with future plans, or is likely to confuse people, then certainly it should not go in. But it seems trivial in terms of code and complexity and does no other harm. Merely because one do not have a use case at present does not mean that use cases don't exist. The patch submitter (Davy) seems to have one, and I (dg) have at least two separate client sites that could use this immediately. Oddly, or perhaps predictably, Davy's use case is quite different from mine, and has no appeal to me at all, but the one feature combined with the existing abilities of pg_dump covers both. I'll stop now, this is not that big a deal, but it seems like a useful idea and worth more consideration than it has received. -dg -- David Gould daveg@sonic.net 510 536 1443 510 282 0869 If simplicity worked, the world would be overrun with insects.
On Sun, Jun 01, 2008 at 02:08:47PM -0500, Davy Durham wrote: > On Sun, 2008-06-01 at 10:43 -0400, Tom Lane wrote: > > Davy Durham <pubaddr5@davyandbeth.com> writes: > > > I have added a new parameter, -w/--where=EXPR This lets you > > > specify an expression that will be used in a WHERE clause when > > > the data is dumped. > > > > This seems pretty poorly thought out. It can hardly work in a > > dump of more than one table, which means that there's not any real > > reason to use pg_dump at all. Just do a COPY (SELECT ...) TO > > somefile. > > Well, my primary reason for writing the patch was to have a standard > SQL file using INSERT statements in order to load the some of a > table's data into a database other than postgresql which does not > support the COPY statement. Have you tried DBI-Link <http://pgfoundry.org/projects/dbi-link/>? Cheers, David. -- David Fetter <david@fetter.org> http://fetter.org/ Phone: +1 415 235 3778 AIM: dfetter666 Yahoo!: dfetter Skype: davidfetter XMPP: david.fetter@gmail.com Remember to vote! Consider donating to Postgres: http://www.postgresql.org/about/donate
"daveg" <daveg@sonic.net> writes: > The argument that one could just use copy and tar and gzip applies to the > whole existance of pg_dump itself. pg_dump is just a more convenient way to > copy out tables and as such does not NEED to exist at all. However it is > convenient to that is does exist and does have convenient features to > selectively handle schemas and tables. That's not really true, pg_dump does a *lot* more work than just group table data together. Its real value comes in reconstructing DDL for all the objects and understanding the dependencies between them. That's functionality that isn't available elsewhere and can't be reproduced without just reimplementing pg_dump. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com Ask me about EnterpriseDB's RemoteDBA services!
Gregory Stark <stark@enterprisedb.com> writes: > "daveg" <daveg@sonic.net> writes: >> The argument that one could just use copy and tar and gzip applies to the >> whole existance of pg_dump itself. > That's not really true, pg_dump does a *lot* more work than just group table > data together. Its real value comes in reconstructing DDL for all the objects > and understanding the dependencies between them. That's functionality that > isn't available elsewhere and can't be reproduced without just reimplementing > pg_dump. Indeed. The other altar that pg_dump spends a lot of time worshipping at is cross-version compatibility. It's got quite enough on its plate without people trying to convert it into an ETL tool. Which is not to say that we don't need an ETL tool, just that it should be separate from pg_dump. regards, tom lane