Re: Add --include-table-data-where option to pg_dump, to export onlya subset of table data - Mailing list pgsql-hackers

From Carter Thaxton
Subject Re: Add --include-table-data-where option to pg_dump, to export onlya subset of table data
Date
Msg-id CAGiT_HMrELHpU2YVM=o5VOv0Q6VyD2=NcVQ6xSrPy56ose_-Uw@mail.gmail.com
Whole thread Raw
In response to Re: Add --include-table-data-where option to pg_dump, to export onlya subset of table data  (Euler Taveira <euler@timbira.com.br>)
List pgsql-hackers

How would you handle foreign keys? It seems easier to produce a dump
that won't restore.

This proposal will not attempt to be smart about foreign keys or anything like that.  I don't believe that would even be expected.

> We have the existing options:
>   --include-table=table    (and its -t synonym)
>   --exclude-table=table
>   --exclude-table-data=table
>
> I propose a new option:
>   --include-table-data-where=table:filter_clause
>
I remembered an old thread [1]. At that time pg_dump was not so
decoupled from the backend. We are far from being decoupled in a way
that someone can write his own pg_dump using only calls from a
library. I'm not sure pg_dump is the right place to add another ETL
parameter. We already have too much parameters that could break a
restore (flexibility is always welcome but too much is not so good).

In general, I agree with your sentiment that we don't want too much flexibility in this tool.  However, this just seems like a very obvious missing feature to me.  I was frankly surprised that pg_dump didn't already have it.

I've designed this feature so that it behaves like a more flexible version between --exclude-table-data and --include-table.  Instead of dumping the schema and zero rows, or the schema and all of the rows, it dumps the schema and some specific rows.

Providing "--include-table-data-where=table:false" behaves exactly like --exclude-table-data, and "--include-table-data-where=table:true" behaves exactly like --include-table.
It does no more or less to prevent a restore.  Given that --exclude-table-data already exists, this seems to introduce no new issues with restore.

 
>   pg_dump --include-table-data-where=largetable:"created_at >= '2018-05-01'"
> database_name
>
How would you check that that expression is correct?
 
The patch as already provided produces an error message and appropriate exit code during the dump process, presenting the invalid SQL that is produced as part of the WHERE clause.
I could see some value in refactoring it to provide error messages earlier in the process, but it's actually not bad as is.


Every parameter could quote its value. It means that your parameter have to escape the
quote in '2018-05-01'.

I don't understand.  The double quotes in my example are bash shell quotes.  There is no special quote parsing in this patch.  The single quotes are part of the WHERE clause.
Note that pg_dump already uses getopt_long, so it's not required to use the = symbol to separate option from its associated value.  So, it would also be fine to call as follows:

  pg_dump --include-table-data-where "largetable:created_at >= '2018-05-01'" database_name


Another problem is that your spec does not show
us how you would handle tables like Foo.Bar or "foo:bar" (colon have
to be escaped)?

Using a dot to separate the schema works just fine.  My proposal uses the same mechanism as --include-table, --exclude-table, and --exclude-table-data.  In fact, it even supports wildcards in those patterns.

Your point about a colon in the table name is interesting.  In all my years of working with PostgreSQL and other databases, I've never encountered a table name that contained a colon.  Perhaps an escape character, like \: could work.  Is there another separator character you would suggest, which is illegal in table names, but also intuitive as a separator?  Maybe a comma?

 
> The filter_clause is used as the contents of a WHERE clause when querying
> the data to generate the COPY statement produced by pg_dump.
>
You are forgetting about --inserts parameter. Could I use
--include-table-data-where and --inserts?

Yes, the --inserts parameter works just fine.  Perhaps I should have said "the COPY statement or INSERT statements".

pgsql-hackers by date:

Previous
From: Thomas Munro
Date:
Subject: Re: [sqlsmith] Unpinning error in parallel worker
Next
From: Carter Thaxton
Date:
Subject: Re: Add --include-table-data-where option to pg_dump, to export onlya subset of table data