Thread: proposal: possibility to read dumped table's name from file

proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

29 May 2020, 14:21:00

one my customer has to specify dumped tables name by name. After years and increasing database size and table numbers he has problem with too short command line. He need to read the list of tables from file (or from stdin).

I wrote simple PoC patch

Comments, notes, ideas?

Regards

Pavel

Attachment

pg_dump-table-list-from-file.patch

Re: proposal: possibility to read dumped table's name from file

From

Tom Lane

Date:

29 May 2020, 14:28:47

Pavel Stehule <pavel.stehule@gmail.com> writes:
> one my customer has to specify dumped tables name by name. After years and
> increasing database size and table numbers he has problem with too short
> command line. He need to read the list of tables from file (or from stdin).

I guess the question is why.  That seems like an enormously error-prone
approach.  Can't they switch to selecting schemas?  Or excluding the
hopefully-short list of tables they don't want?

            regards, tom lane

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

29 May 2020, 14:47:15

pá 29. 5. 2020 v 16:28 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal:

Pavel Stehule <pavel.stehule@gmail.com> writes:
> one my customer has to specify dumped tables name by name. After years and
> increasing database size and table numbers he has problem with too short
> command line. He need to read the list of tables from file (or from stdin).

I guess the question is why. That seems like an enormously error-prone
approach. Can't they switch to selecting schemas? Or excluding the
hopefully-short list of tables they don't want?

It is not typical application. It is a analytic application when the schema of database is based on dynamic specification of end user (end user can do customization every time). So schema is very dynamic.

For example - typical server has about four thousand databases and every database has some between 1K .. 10K tables.

Another specific are different versions of data in different tables. A user can work with one set of data (one set of tables) and a application prepares new set of data (new set of tables). Load can be slow, because sometimes bigger tables are filled (about forty GB). pg_dump backups one set of tables (little bit like snapshot of data). So it is strange OLAP (but successfull) application.

Regards

Pavel

regards, tom lane

Re: proposal: possibility to read dumped table's name from file

From

David Fetter

Date:

29 May 2020, 15:28:57

On Fri, May 29, 2020 at 04:21:00PM +0200, Pavel Stehule wrote:
> Hi
> 
> one my customer has to specify dumped tables name by name. After years and
> increasing database size and table numbers he has problem with too short
> command line. He need to read the list of tables from file (or from stdin).
> 
> I wrote simple PoC patch
> 
> Comments, notes, ideas?

This seems like a handy addition. What I've done in cases similar to
this was to use `grep -f` on the output of `pg_dump -l` to create a
file suitable for `pg_dump -L`, or mash them together like this:

    pg_restore -L <(pg_dump -l /path/to/dumpfile | grep -f /path/to/listfile) -d new_db /path/to/dumpfile

That's a lot of shell magic and obscure corners of commands to expect
people to use.

Would it make sense to expand this patch to handle other objects?

Best,
David.
-- 
David Fetter <david(at)fetter(dot)org> http://fetter.org/
Phone: +1 415 235 3778

Remember to vote!
Consider donating to Postgres: http://www.postgresql.org/about/donate

Re: proposal: possibility to read dumped table's name from file

From

Tom Lane

Date:

29 May 2020, 16:03:19

David Fetter <david@fetter.org> writes:
> Would it make sense to expand this patch to handle other objects?

If we're gonna do something like this, +1 for being more general.
The fact that pg_dump only has selection switches for tables and
schemas has always struck me as an omission.

            regards, tom lane

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

29 May 2020, 16:14:37

pá 29. 5. 2020 v 18:03 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal:

David Fetter <david@fetter.org> writes:
> Would it make sense to expand this patch to handle other objects?

Sure. Just we should to design system (and names of options).

If we're gonna do something like this, +1 for being more general.
The fact that pg_dump only has selection switches for tables and
schemas has always struck me as an omission.

a implementation is trivial, hard is good design of names for these options.

Pavel

regards, tom lane

Re: proposal: possibility to read dumped table's name from file

From

Justin Pryzby

Date:

29 May 2020, 18:25:15

On Fri, May 29, 2020 at 04:21:00PM +0200, Pavel Stehule wrote:
> one my customer has to specify dumped tables name by name. After years and
> increasing database size and table numbers he has problem with too short
> command line. He need to read the list of tables from file (or from stdin).

+1 - we would use this.

We put a regex (actually a pg_dump pattern) of tables to skip (timeseries
partitions which are older than a few days and which are also dumped once not
expected to change, and typically not redumped).  We're nowhere near the
execve() limit, but it'd be nice if the command was primarily a list of options
and not a long regex.

Please also support reading from file for --exclude-table=pattern.

I'm drawing a parallel between this and rsync --include/--exclude and --filter.

We'd be implementing a new --filter, which might have similar syntax to rsync
(which I always forget).

-- 
Justin

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

08 June 2020, 17:18:49

pá 29. 5. 2020 v 20:25 odesílatel Justin Pryzby <pryzby@telsasoft.com> napsal:

On Fri, May 29, 2020 at 04:21:00PM +0200, Pavel Stehule wrote:
> one my customer has to specify dumped tables name by name. After years and
> increasing database size and table numbers he has problem with too short
> command line. He need to read the list of tables from file (or from stdin).

+1 - we would use this.

We put a regex (actually a pg_dump pattern) of tables to skip (timeseries
partitions which are older than a few days and which are also dumped once not
expected to change, and typically not redumped). We're nowhere near the
execve() limit, but it'd be nice if the command was primarily a list of options
and not a long regex.

Please also support reading from file for --exclude-table=pattern.

I'm drawing a parallel between this and rsync --include/--exclude and --filter.

We'd be implementing a new --filter, which might have similar syntax to rsync
(which I always forget).

I implemented support for all "repeated" pg_dump options.

--exclude-schemas-file=FILENAME
--exclude-tables-data-file=FILENAME
--exclude-tables-file=FILENAME
--include-foreign-data-file=FILENAME
--include-schemas-file=FILENAME
--include-tables-file=FILENAME

Regards

Pavel

I invite any help with doc. There is just very raw text

--
Justin

Attachment

pg_dump-read-options-from-file.patch

Re: proposal: possibility to read dumped table's name from file

From

Justin Pryzby

Date:

08 June 2020, 21:29:57

On Mon, Jun 08, 2020 at 07:18:49PM +0200, Pavel Stehule wrote:
> pá 29. 5. 2020 v 20:25 odesílatel Justin Pryzby <pryzby@telsasoft.com> napsal:
> 
> > On Fri, May 29, 2020 at 04:21:00PM +0200, Pavel Stehule wrote:
> > > one my customer has to specify dumped tables name by name. After years and
> > > increasing database size and table numbers he has problem with too short
> > > command line. He need to read the list of tables from file (or from stdin).
> >
> > +1 - we would use this.
> >
> > We put a regex (actually a pg_dump pattern) of tables to skip (timeseries
> > partitions which are older than a few days and which are also dumped once not
> > expected to change, and typically not redumped).  We're nowhere near the
> > execve() limit, but it'd be nice if the command was primarily a list of options
> > and not a long regex.
> >
> > Please also support reading from file for --exclude-table=pattern.
> >
> > I'm drawing a parallel between this and rsync --include/--exclude and
> > --filter.
> >
> > We'd be implementing a new --filter, which might have similar syntax to rsync
> > (which I always forget).
> 
> I implemented support for all "repeated" pg_dump options.
> 
> I invite any help with doc. There is just very raw text
> 
> +        Do not dump data of tables spefified in file.

*specified

I still wonder if a better syntax would use a unified --filter option, whose
argument would allow including/excluding any type of object:

+[tn] include (t)table/(n)namespace/...
-[tn] exclude (t)table/(n)namespace/...

In the past, I looked for a way to exclude extended stats objects, and ended up
using a separate schema.  An "extensible" syntax might be better (although
reading a file of just patterns has the advantage that the function can just be
called once for each option for each type of object).

-- 
Justin

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

09 June 2020, 09:46:24

po 8. 6. 2020 v 23:30 odesílatel Justin Pryzby <pryzby@telsasoft.com> napsal:

On Mon, Jun 08, 2020 at 07:18:49PM +0200, Pavel Stehule wrote:
> pá 29. 5. 2020 v 20:25 odesílatel Justin Pryzby <pryzby@telsasoft.com> napsal:
>
> > On Fri, May 29, 2020 at 04:21:00PM +0200, Pavel Stehule wrote:
> > > one my customer has to specify dumped tables name by name. After years and
> > > increasing database size and table numbers he has problem with too short
> > > command line. He need to read the list of tables from file (or from stdin).
> >
> > +1 - we would use this.
> >
> > We put a regex (actually a pg_dump pattern) of tables to skip (timeseries
> > partitions which are older than a few days and which are also dumped once not
> > expected to change, and typically not redumped). We're nowhere near the
> > execve() limit, but it'd be nice if the command was primarily a list of options
> > and not a long regex.
> >
> > Please also support reading from file for --exclude-table=pattern.
> >
> > I'm drawing a parallel between this and rsync --include/--exclude and
> > --filter.
> >
> > We'd be implementing a new --filter, which might have similar syntax to rsync
> > (which I always forget).
>
> I implemented support for all "repeated" pg_dump options.
>
> I invite any help with doc. There is just very raw text
>
> + Do not dump data of tables spefified in file.

*specified

I am sending updated version - now with own implementation GNU (not POSIX) function getline

I still wonder if a better syntax would use a unified --filter option, whose
argument would allow including/excluding any type of object:

+[tn] include (t)table/(n)namespace/...
-[tn] exclude (t)table/(n)namespace/...

In the past, I looked for a way to exclude extended stats objects, and ended up
using a separate schema. An "extensible" syntax might be better (although
reading a file of just patterns has the advantage that the function can just be
called once for each option for each type of object).

I tried to implement simple format "[+-][tndf] objectname"

please, check attached patch

Regards

Pavel

--
Justin

Attachment

pg_dump-filter.patch

Re: proposal: possibility to read dumped table's name from file

From

Justin Pryzby

Date:

09 June 2020, 22:30:42

On Tue, Jun 09, 2020 at 11:46:24AM +0200, Pavel Stehule wrote:
> po 8. 6. 2020 v 23:30 odesílatel Justin Pryzby <pryzby@telsasoft.com> napsal:

> I still wonder if a better syntax would use a unified --filter option, whose
> > argument would allow including/excluding any type of object:
> I tried to implement simple format "[+-][tndf] objectname"

Thanks.

> +                        /* ignore empty rows */
> +                        if (*line != '\0')

Maybe: if line=='\0': continue

We should also support comments.

> +                            bool        include_filter = false;
> +                            bool        exclude_filter = false;

I think we only need one bool.
You could call it: bool is_exclude = false

> +
> +                            if (chars < 4)
> +                                invalid_filter_format(optarg, line, lineno);

I think that check is too lax.
I think it's ok if we require the first char to be [-+] and the 2nd char to be
[dntf]

> +                            objecttype = line[1];

... but I think this is inadequately "liberal in what it accepts"; I think it
should skip spaces.  In my proposed scheme, someone might reasonably write:

> +
> +                            objectname = &line[3];
> +
> +                            /* skip initial spaces */
> +                            while (*objectname == ' ')
> +                                objectname++;

I suggest to use isspace()

I think we should check that *objectname != '\0', rather than chars>=4, above.

> +                                if (include_filter)
> +                                {
> +                                    simple_string_list_append(&table_include_patterns, objectname);
> +                                    dopt.include_everything = false;
> +                                }
> +                                else if (exclude_filter)
> +                                    simple_string_list_append(&table_exclude_patterns, objectname);

If you use bool is_exclude, then this becomes "else" and you don't need to
think about checking if (!include && !exclude).

> +                            else if (objecttype == 'f')
> +                            {
> +                                if (include_filter)
> +                                    simple_string_list_append(&foreign_servers_include_patterns, objectname);
> +                                else if (exclude_filter)
> +                                    invalid_filter_format(optarg, line, lineno);
> +                            }

I would handle invalid object types as "else: invalid_filter_format()" here,
rather than duplicating above as: !=ALL('d','n','t','f')

> +
> +                    if (ferror(f))
> +                        fatal("could not read from file \"%s\": %s",
> +                              f == stdin ? "stdin" : optarg,
> +                              strerror(errno));

I think we're allowed to use %m here ?

> +    printf(_("  --filter=FILENAME            read object names from file\n"));

Object name filter expression, or something..

> + * getline is originaly GNU function, and should not be everywhere still.
originally

> + * Use own reduced implementation.

Did you "reduce" this from another implementation?  Where?
What is its license ?

Maybe a line-reader already exists in the frontend (?) .. or maybe it should.

-- 
Justin

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

10 June 2020, 03:03:49

st 10. 6. 2020 v 0:30 odesílatel Justin Pryzby <pryzby@telsasoft.com> napsal:

On Tue, Jun 09, 2020 at 11:46:24AM +0200, Pavel Stehule wrote:
> po 8. 6. 2020 v 23:30 odesílatel Justin Pryzby <pryzby@telsasoft.com> napsal:

> I still wonder if a better syntax would use a unified --filter option, whose
> > argument would allow including/excluding any type of object:
> I tried to implement simple format "[+-][tndf] objectname"

I had another idea about format - instead using +-, we can use case sensitive options same to pg_dump command line (with extending Df - because these options doesn't exists in short form)

So format can looks like

[tTnNDf] {objectname}

What do you think about this? This format is simpler, and it can work. What do you think about it?

> Did you "reduce" this from another implementation? Where?
> What is its license ?

The code is 100% mine. It is not copy from gnulib and everybody can simply check it

https://code.woboq.org/userspace/glibc/stdio-common/getline.c.html

https://code.woboq.org/userspace/glibc/libio/iogetdelim.c.html#_IO_getdelim

Reduced in functionality sense. There is no full argument check that is necessary for glibc functions. There are no memory checks because pg_malloc, pg_realloc are used.

Re: proposal: possibility to read dumped table's name from file

From

Justin Pryzby

Date:

10 June 2020, 03:57:18

On Wed, Jun 10, 2020 at 05:03:49AM +0200, Pavel Stehule wrote:
> st 10. 6. 2020 v 0:30 odesílatel Justin Pryzby <pryzby@telsasoft.com> napsal:
> 
> > On Tue, Jun 09, 2020 at 11:46:24AM +0200, Pavel Stehule wrote:
> > > po 8. 6. 2020 v 23:30 odesílatel Justin Pryzby <pryzby@telsasoft.com> napsal:
> >
> > > I still wonder if a better syntax would use a unified --filter option, whose
> > > > argument would allow including/excluding any type of object:
> > > I tried to implement simple format "[+-][tndf] objectname"
> 
> I had another idea about format - instead using +-, we can use case
> sensitive options  same to pg_dump command line (with extending Df -
> because these options doesn't exists in short form)
> 
> So format can looks like
> 
> [tTnNDf] {objectname}
> 
> What do you think about this? This format is simpler, and it can work. What
> do you think about it?

I prefer [-+][dtnf], which is similar to rsync --filter, and clear what it's
doing.  I wouldn't put much weight on what the short options are.

I wonder if some people would want to be able to use *long* or short options:

-table foo
+schema baz

Or maybe:

exclude-table=foo
schema=bar

Some tools use "long options without leading dashes" as their configuration
file format.  Examples: openvpn, mysql.  So that could be a good option.
OTOH, there's only a few "keys", so I'm not sure how many people would want to
repeat them, if there's enough to bother putting them in the file rather than
the cmdline.

-- 
Justin

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

11 June 2020, 07:36:18

st 10. 6. 2020 v 0:30 odesílatel Justin Pryzby <pryzby@telsasoft.com> napsal:

On Tue, Jun 09, 2020 at 11:46:24AM +0200, Pavel Stehule wrote:
> po 8. 6. 2020 v 23:30 odesílatel Justin Pryzby <pryzby@telsasoft.com> napsal:

> I still wonder if a better syntax would use a unified --filter option, whose
> > argument would allow including/excluding any type of object:
> I tried to implement simple format "[+-][tndf] objectname"

Thanks.

> + /* ignore empty rows */
> + if (*line != '\0')

Maybe: if line=='\0': continue

We should also support comments.

> + bool include_filter = false;
> + bool exclude_filter = false;

I think we only need one bool.
You could call it: bool is_exclude = false

> +
> + if (chars < 4)
> + invalid_filter_format(optarg, line, lineno);

I think that check is too lax.
I think it's ok if we require the first char to be [-+] and the 2nd char to be
[dntf]

> + objecttype = line[1];

... but I think this is inadequately "liberal in what it accepts"; I think it
should skip spaces. In my proposed scheme, someone might reasonably write:

> +
> + objectname = &line[3];
> +
> + /* skip initial spaces */
> + while (*objectname == ' ')
> + objectname++;

I suggest to use isspace()

I think we should check that *objectname != '\0', rather than chars>=4, above.

done

> + if (include_filter)
> + {
> + simple_string_list_append(&table_include_patterns, objectname);
> + dopt.include_everything = false;
> + }
> + else if (exclude_filter)
> + simple_string_list_append(&table_exclude_patterns, objectname);

If you use bool is_exclude, then this becomes "else" and you don't need to
think about checking if (!include && !exclude).

> + else if (objecttype == 'f')
> + {
> + if (include_filter)
> + simple_string_list_append(&foreign_servers_include_patterns, objectname);
> + else if (exclude_filter)
> + invalid_filter_format(optarg, line, lineno);
> + }

I would handle invalid object types as "else: invalid_filter_format()" here,
rather than duplicating above as: !=ALL('d','n','t','f')

good idea

> +
> + if (ferror(f))
> + fatal("could not read from file \"%s\": %s",
> + f == stdin ? "stdin" : optarg,
> + strerror(errno));

I think we're allowed to use %m here ?

changed

> + printf(_(" --filter=FILENAME read object names from file\n"));

Object name filter expression, or something..

yes, it is not object names now

> + * getline is originaly GNU function, and should not be everywhere still.
originally

> + * Use own reduced implementation.

Did you "reduce" this from another implementation? Where?
What is its license ?

Maybe a line-reader already exists in the frontend (?) .. or maybe it should.

everywhere else is used a function fgets. Currently pg_getline is used just on only one place, so I don't think so moving it to some common part is maybe premature.

Maybe it can be used as replacement of some fgets calls, but then it is different topic, I think.

Thank you for comments, attached updated patch

Regards

Pavel

--
Justin

Attachment

pg_dump-filter-20200611.patch

Re: proposal: possibility to read dumped table's name from file

From

vignesh C

Date:

27 June 2020, 12:55:19

On Thu, Jun 11, 2020 at 1:07 PM Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
> Thank you for comments, attached updated patch
>

Few comments:
+invalid_filter_format(char *message, char *filename, char *line, int lineno)
+{
+       char       *displayname;
+
+       displayname = *filename == '-' ? "stdin" : filename;
+
+       pg_log_error("invalid format of filter file \"%s\": %s",
+                                displayname,
+                                message);
+
+       fprintf(stderr, "%d: %s\n", lineno, line);
+       exit_nicely(1);
+}
I think fclose is missing here.

+                                               if (line[chars - 1] == '\n')
+                                                       line[chars - 1] = '\0';
Should we check for '\r' also to avoid failures in some platforms.

+     <varlistentry>
+      <term><option>--filter=<replaceable
class="parameter">filename</replaceable></option></term>
+      <listitem>
+       <para>
+        Read filters from file. Format "(+|-)(tnfd) objectname:
+       </para>
+      </listitem>
+     </varlistentry>

I felt some documentation is missing here. We could include,
options tnfd is for controlling table, schema, foreign server data &
table exclude patterns.

Instead of using tnfd, if we could use the same options as existing
pg_dump options it will be less confusing.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: proposal: possibility to read dumped table's name from file

From

Justin Pryzby

Date:

01 July 2020, 21:24:52

On Thu, Jun 11, 2020 at 09:36:18AM +0200, Pavel Stehule wrote:
> st 10. 6. 2020 v 0:30 odesílatel Justin Pryzby <pryzby@telsasoft.com> napsal:
> > > +                                             /* ignore empty rows */
> > > +                                             if (*line != '\0')
> >
> > Maybe: if line=='\0': continue
> > We should also support comments.

Comment support is still missing but easily added :)

I tried this patch and it works for my purposes.

Also, your getline is dynamically re-allocating lines of arbitrary length.
Possibly that's not needed.  We'll typically read "+t schema.relname", which is
132 chars.  Maybe it's sufficient to do
char buf[1024];
fgets(buf);
if strchr(buf, '\n') == NULL: error();
ret = pstrdup(buf);

In any case, you could have getline return a char* and (rather than following
GNU) no need to take char**, int* parameters to conflate inputs and outputs.

I realized that --filter has an advantage over the previous implementation
(with multiple --exclude-* and --include-*) in that it's possible to use stdin
for includes *and* excludes.

By chance, I had the opportunity yesterday to re-use with rsync a regex that
I'd previously been using with pg_dump and grep.  What this patch calls
"--filter" in rsync is called "--filter-from".  rsync's --filter-from rejects
filters of length longer than max filename, so I had to split it up into
multiple lines instead of using regex alternation ("|").  This option is a
close parallel in pg_dump.

-- 
Justin

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

05 July 2020, 19:50:34

so 27. 6. 2020 v 14:55 odesílatel vignesh C <vignesh21@gmail.com> napsal:

On Thu, Jun 11, 2020 at 1:07 PM Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
> Thank you for comments, attached updated patch
>

Few comments:
+invalid_filter_format(char *message, char *filename, char *line, int lineno)
+{
+ char *displayname;
+
+ displayname = *filename == '-' ? "stdin" : filename;
+
+ pg_log_error("invalid format of filter file \"%s\": %s",
+ displayname,
+ message);
+
+ fprintf(stderr, "%d: %s\n", lineno, line);
+ exit_nicely(1);
+}

I think fclose is missing here.

done

+ if (line[chars - 1] == '\n')
+ line[chars - 1] = '\0';
Should we check for '\r' also to avoid failures in some platforms.

I checked other usage of fgets in Postgres source code, and everywhere is used test on \n

When I did some fast research, then https://stackoverflow.com/questions/12769289/carriage-return-by-fgets \r in this case should be thrown by libc on Microsoft

https://stackoverflow.com/questions/2061334/fgets-linux-vs-mac

\n should be on Mac OS X .. 2001 year .. I am not sure if Mac OS 9 should be supported.

+ <varlistentry>
+ <term><option>--filter=<replaceable
class="parameter">filename</replaceable></option></term>
+ <listitem>
+ <para>
+ Read filters from file. Format "(+|-)(tnfd) objectname:
+ </para>
+ </listitem>
+ </varlistentry>

I felt some documentation is missing here. We could include,
options tnfd is for controlling table, schema, foreign server data &
table exclude patterns.

I have a plan to completate doc when the design is completed. It was not clear if people prefer long or short forms of option names.

Instead of using tnfd, if we could use the same options as existing
pg_dump options it will be less confusing.

it almost same

+-t .. tables

+-n schema

-d exclude data .. there is not short option for --exclude-table-data

+f include foreign table .. there is not short option for --include-foreign-data

So still, there is a opened question if use +-tnfd system, or system based on long option

table foo

exclude-table foo

schema xx

exclude-schema xx

include-foreign-data yyy

exclude-table-data zzz

Typically these files will be generated by scripts and processed via pipe, so there I see just two arguments for and aginst:

short format - there is less probability to do typo error (but there is not full consistency with pg_dump options)

long format - it is self documented (and there is full consistency with pg_dump)

In this case I prefer short form .. it is more comfortable for users, and there are only a few variants, so it is not necessary to use too verbose language (design). But my opinion is not aggressively strong and I'll accept any common agreement.

Regards

Updated patch attached

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachment

pg_dump-filter-20200705.patch

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

05 July 2020, 20:08:09

st 1. 7. 2020 v 23:24 odesílatel Justin Pryzby <pryzby@telsasoft.com> napsal:

On Thu, Jun 11, 2020 at 09:36:18AM +0200, Pavel Stehule wrote:
> st 10. 6. 2020 v 0:30 odesílatel Justin Pryzby <pryzby@telsasoft.com> napsal:
> > > + /* ignore empty rows */
> > > + if (*line != '\0')
> >
> > Maybe: if line=='\0': continue
> > We should also support comments.

Comment support is still missing but easily added :)

I tried this patch and it works for my purposes.

Also, your getline is dynamically re-allocating lines of arbitrary length.
Possibly that's not needed. We'll typically read "+t schema.relname", which is
132 chars. Maybe it's sufficient to do
char buf[1024];
fgets(buf);
if strchr(buf, '\n') == NULL: error();
ret = pstrdup(buf);

63 bytes is max effective identifier size, but it is not max size of identifiers. It is very probably so buff with 1024 bytes will be enough for all, but I do not want to increase any new magic limit. More when dynamic implementation is not too hard.

Table name can be very long - sometimes the data names (table names) can be stored in external storages with full length and should not be practical to require truncating in filter file.

For this case it is very effective, because a resized (increased) buffer is used for following rows, so realloc should not be often. So when I have to choose between two implementations with similar complexity, I prefer more dynamic code without hardcoded limits. This dynamic hasn't any overhead.

In any case, you could have getline return a char* and (rather than following
GNU) no need to take char**, int* parameters to conflate inputs and outputs.

no, it has a special benefit. It eliminates the short malloc/free cycle. When some lines are longer, then the buffer is increased (and limits), and for other rows with same or less size is not necessary realloc.

I realized that --filter has an advantage over the previous implementation
(with multiple --exclude-* and --include-*) in that it's possible to use stdin
for includes *and* excludes.

yes, it looks like better choose

By chance, I had the opportunity yesterday to re-use with rsync a regex that
I'd previously been using with pg_dump and grep. What this patch calls
"--filter" in rsync is called "--filter-from". rsync's --filter-from rejects
filters of length longer than max filename, so I had to split it up into
multiple lines instead of using regex alternation ("|"). This option is a
close parallel in pg_dump.

we can talk about option name - maybe "--filter-from" is better than just "--filter"

Regards

Pavel

--
Justin

Re: proposal: possibility to read dumped table's name from file

From

Justin Pryzby

Date:

05 July 2020, 20:31:19

On Wed, Jul 01, 2020 at 04:24:52PM -0500, Justin Pryzby wrote:
> On Thu, Jun 11, 2020 at 09:36:18AM +0200, Pavel Stehule wrote:
> > st 10. 6. 2020 v 0:30 odesílatel Justin Pryzby <pryzby@telsasoft.com> napsal:
> > > > +                                             /* ignore empty rows */
> > > > +                                             if (*line != '\0')
> > >
> > > Maybe: if line=='\0': continue
> > > We should also support comments.
> 
> Comment support is still missing but easily added :)

Still missing from the latest patch.

With some added documentation, I think this can be RfC.

-- 
Justin

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

05 July 2020, 20:37:00

ne 5. 7. 2020 v 22:31 odesílatel Justin Pryzby <pryzby@telsasoft.com> napsal:

On Wed, Jul 01, 2020 at 04:24:52PM -0500, Justin Pryzby wrote:
> On Thu, Jun 11, 2020 at 09:36:18AM +0200, Pavel Stehule wrote:
> > st 10. 6. 2020 v 0:30 odesílatel Justin Pryzby <pryzby@telsasoft.com> napsal:
> > > > + /* ignore empty rows */
> > > > + if (*line != '\0')
> > >
> > > Maybe: if line=='\0': continue
> > > We should also support comments.
>
> Comment support is still missing but easily added :)

Still missing from the latest patch.

I can implement a comment support. But I am not sure about the format. The start can be "--" or classic #.

but "--" can be in this context messy

With some added documentation, I think this can be RfC.

--
Justin

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

06 July 2020, 04:34:15

ne 5. 7. 2020 v 22:37 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal:

ne 5. 7. 2020 v 22:31 odesílatel Justin Pryzby <pryzby@telsasoft.com> napsal:
On Wed, Jul 01, 2020 at 04:24:52PM -0500, Justin Pryzby wrote:
> On Thu, Jun 11, 2020 at 09:36:18AM +0200, Pavel Stehule wrote:
> > st 10. 6. 2020 v 0:30 odesílatel Justin Pryzby <pryzby@telsasoft.com> napsal:
> > > > + /* ignore empty rows */
> > > > + if (*line != '\0')
> > >
> > > Maybe: if line=='\0': continue
> > > We should also support comments.
>
> Comment support is still missing but easily added :)

Still missing from the latest patch.

I can implement a comment support. But I am not sure about the format. The start can be "--" or classic #.

but "--" can be in this context messy

here is support for comment's line - first char should be #

Regards

Pavel

With some added documentation, I think this can be RfC.

--
Justin

Attachment

pg_dump-filter-20200706.patch

Re: proposal: possibility to read dumped table's name from file

From

Justin Pryzby

Date:

11 July 2020, 23:06:50

On Mon, Jul 06, 2020 at 06:34:15AM +0200, Pavel Stehule wrote:
> >> > Comment support is still missing but easily added :)
> >> Still missing from the latest patch.
> >
> > I can implement a comment support. But I am not sure about the format. The
> 
> here is support for comment's line - first char should be #

Thanks, that's how I assumed it would look.

> >> With some added documentation, I think this can be RfC.

Do you want to add any more documentation ?  

Few more things:

> +exit_invalid_filter_format(FILE *fp, char *filename, char *message, char *line, int lineno)
> +{
> +    pg_log_error("invalid format of filter file \"%s\": %s",
> +                 *filename == '-' ? "stdin" : filename,
> +                 message);

You refer to as "stdin" any filename beginning with -.

I think you could just print "-" and not "stdin".
In any case, *filename=='-' is wrong since it only checks filename[0].
In a few places you compare ==stdin (which is right).

Also, I think "f" isn't as good a name as "fp".

You're adding 139 lines to a 600 line main(), and no other option is more than
15 lines.  Would you put it in a separate function ?

-- 
Justin

Re: proposal: possibility to read dumped table's name from file

From

vignesh C

Date:

12 July 2020, 01:43:07

On Mon, Jul 6, 2020 at 10:05 AM Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
> here is support for comment's line - first char should be #
>

Few comments:
+               str = fgets(*lineptr + total_chars,
+                                       *n - total_chars,
+                                       fp);
+
+               if (ferror(fp))
+                       return -1;

Should we include any error message in the above case.

+               else
+                       break;
+       }
+
+       if (ferror(fp))
+               return -1;

Similar to above.

+                       /* check, if there is good enough space for
next content */
+                       if (*n - total_chars < 2)
+                       {
+                               *n += 1024;
+                               *lineptr = pg_realloc(*lineptr, *n);
+                       }
We could use a macro in place of 1024.

+                                               if (objecttype == 't')
+                                               {
+                                                       if (is_include)
+                                                       {
+
simple_string_list_append(&table_include_patterns,
+
                                           objectname);
+
dopt.include_everything = false;
+                                                       }
+                                                       else
+
simple_string_list_append(&table_exclude_patterns,
+
                                           objectname);
+                                               }
+                                               else if (objecttype == 'n')
+                                               {
+                                                       if (is_include)
+                                                       {
+
simple_string_list_append(&schema_include_patterns,
+
                                           objectname);
+
dopt.include_everything = false;
+                                                       }
+                                                       else
+
simple_string_list_append(&schema_exclude_patterns,
+
                                           objectname);
+                                               }
Some of the above code is repetitive in above, can the common code be
made into a macro and called?

        printf(_("  --extra-float-digits=NUM     override default
setting for extra_float_digits\n"));
+       printf(_("  --filter=FILENAME            read object name
filter expressions from file\n"));
        printf(_("  --if-exists                  use IF EXISTS when
dropping objects\n"));
Can this be changed to dump objects and data based on the filter
expressions from the filter file.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

13 July 2020, 06:15:42

ne 12. 7. 2020 v 1:06 odesílatel Justin Pryzby <pryzby@telsasoft.com> napsal:

On Mon, Jul 06, 2020 at 06:34:15AM +0200, Pavel Stehule wrote:
> >> > Comment support is still missing but easily added :)
> >> Still missing from the latest patch.
> >
> > I can implement a comment support. But I am not sure about the format. The
>
> here is support for comment's line - first char should be #

Thanks, that's how I assumed it would look.

> >> With some added documentation, I think this can be RfC.

Do you want to add any more documentation ?

done

Few more things:

> +exit_invalid_filter_format(FILE *fp, char *filename, char *message, char *line, int lineno)
> +{
> + pg_log_error("invalid format of filter file \"%s\": %s",
> + *filename == '-' ? "stdin" : filename,
> + message);

You refer to as "stdin" any filename beginning with -.

I think you could just print "-" and not "stdin".
In any case, *filename=='-' is wrong since it only checks filename[0].
In a few places you compare ==stdin (which is right).

done

Also, I think "f" isn't as good a name as "fp".

done

You're adding 139 lines to a 600 line main(), and no other option is more than
15 lines. Would you put it in a separate function ?

done

please, check attached patch

Regards

Pavel

--
Justin

Attachment

pg_dump-filter-20200713-1.patch

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

13 July 2020, 08:20:42

ne 12. 7. 2020 v 3:43 odesílatel vignesh C <vignesh21@gmail.com> napsal:

On Mon, Jul 6, 2020 at 10:05 AM Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
> here is support for comment's line - first char should be #
>

Few comments:
+ str = fgets(*lineptr + total_chars,
+ *n - total_chars,
+ fp);
+
+ if (ferror(fp))
+ return -1;

Should we include any error message in the above case.

+ else
+ break;
+ }
+
+ if (ferror(fp))
+ return -1;

Similar to above.

it should be ok, both variant finishing by

<-->if (ferror(fp))
<--><-->fatal("could not read from file \"%s\": %m", filename);

%m should to print related error message

+ /* check, if there is good enough space for
next content */
+ if (*n - total_chars < 2)
+ {
+ *n += 1024;
+ *lineptr = pg_realloc(*lineptr, *n);
+ }
We could use a macro in place of 1024.

done

+ if (objecttype == 't')
+ {
+ if (is_include)
+ {
+
simple_string_list_append(&table_include_patterns,
+
objectname);
+
dopt.include_everything = false;
+ }
+ else
+
simple_string_list_append(&table_exclude_patterns,
+
objectname);
+ }
+ else if (objecttype == 'n')
+ {
+ if (is_include)
+ {
+
simple_string_list_append(&schema_include_patterns,
+
objectname);
+
dopt.include_everything = false;
+ }
+ else
+
simple_string_list_append(&schema_exclude_patterns,
+
objectname);
+ }
Some of the above code is repetitive in above, can the common code be
made into a macro and called?

There are two same fragments and two different fragments. In this case I don't think so using macro or auxiliary function can help with readability. Current code is well structured and well readable.

printf(_(" --extra-float-digits=NUM override default
setting for extra_float_digits\n"));
+ printf(_(" --filter=FILENAME read object name
filter expressions from file\n"));
printf(_(" --if-exists use IF EXISTS when
dropping objects\n"));
Can this be changed to dump objects and data based on the filter
expressions from the filter file.

I am sorry, I don't understand. This should work for data from specified by filter without any modification.

attached updated patch

Regards

Pavel

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachment

pg_dump-filter-20200713-2.patch

Re: proposal: possibility to read dumped table's name from file

From

Daniel Gustafsson

Date:

13 July 2020, 10:04:09

> On 13 Jul 2020, at 10:20, Pavel Stehule <pavel.stehule@gmail.com> wrote:

> attached updated patch

Sorry for jumping in late, but thinking about this extension to pg_dump:
doesn't it make more sense to use an existing file format like JSON for this,
given that virtually all devops/cd/etc tooling know about JSON already?

Considering its application and the problem space, I'd expect users to generate
this file rather than handcraft it with 10 rows of content, and standard file
formats help there.  Creative users could even use the database itself to
easily manage its content and generate the file (which isn't limited to JSON of
course, but it would be easier).  Also, we now have backup manifests in JSON
which IMO sets a bit of a precedent, even though thats a separate thing.

At the very least it seems limiting to not include a file format version
identifier since we'd otherwise risk running into backwards compat issues
should we want to expand on this in the future.

cheers ./daniel

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

13 July 2020, 11:02:32

po 13. 7. 2020 v 12:04 odesílatel Daniel Gustafsson <daniel@yesql.se> napsal:

> On 13 Jul 2020, at 10:20, Pavel Stehule <pavel.stehule@gmail.com> wrote:

> attached updated patch

Sorry for jumping in late, but thinking about this extension to pg_dump:
doesn't it make more sense to use an existing file format like JSON for this,
given that virtually all devops/cd/etc tooling know about JSON already?

Considering its application and the problem space, I'd expect users to generate
this file rather than handcraft it with 10 rows of content, and standard file
formats help there. Creative users could even use the database itself to
easily manage its content and generate the file (which isn't limited to JSON of
course, but it would be easier). Also, we now have backup manifests in JSON
which IMO sets a bit of a precedent, even though thats a separate thing.

At the very least it seems limiting to not include a file format version
identifier since we'd otherwise risk running into backwards compat issues
should we want to expand on this in the future.

I like JSON format. But why here? For this purpose the JSON is over engineered. This input file has no nested structure - it is just a stream of lines.

I don't think so introducing JSON here can be a good idea. For this feature typical usage can be used in pipe, and the most simple format (what is possible) is ideal.

It is a really different case than pg_dump manifest file - in this case, in this case pg_dump is consument.

Regards

Pavel

cheers ./daniel

Re: proposal: possibility to read dumped table's name from file

From

vignesh C

Date:

13 July 2020, 13:32:59

On Mon, Jul 13, 2020 at 1:51 PM Pavel Stehule <pavel.stehule@gmail.com> wrote:
>> Can this be changed to dump objects and data based on the filter
>> expressions from the filter file.
>
>
> I am sorry, I don't understand. This should work for data from specified by filter without any modification.
>
I meant can this:
printf(_("  --filter=FILENAME            read object name filter
expressions from file\n"));
be changed to:
printf(_("  --filter=FILENAME            dump objects and data based
on the filter expressions from the filter file\n"));

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: proposal: possibility to read dumped table's name from file

From

Justin Pryzby

Date:

13 July 2020, 14:32:04

On Mon, Jul 13, 2020 at 12:04:09PM +0200, Daniel Gustafsson wrote:
> Sorry for jumping in late, but thinking about this extension to pg_dump:
> doesn't it make more sense to use an existing file format like JSON for this,
> given that virtually all devops/cd/etc tooling know about JSON already?
> 
> Considering its application and the problem space, I'd expect users to generate
> this file rather than handcraft it with 10 rows of content, and standard file
> formats help there.

I mentioned having tested this patch as we would use it.  But it's likely I
*wouldn't* use it if the format was something which required added complexity
to pipe in from an existing shell script.

> At the very least it seems limiting to not include a file format version
> identifier since we'd otherwise risk running into backwards compat issues
> should we want to expand on this in the future.

Maybe .. I'm not sure.  The patch would of course be extended to handle
additional include/exclude options.  Is there any other future behavior we
might reasonably anticipate ?

If at some point we wanted to support another file format, maybe it would look
like: --format=v2:filters.txt (or maybe the old one would be v1:filters.txt)

-- 
Justin

Re: proposal: possibility to read dumped table's name from file

From

Daniel Gustafsson

Date:

13 July 2020, 14:57:41

> On 13 Jul 2020, at 13:02, Pavel Stehule <pavel.stehule@gmail.com> wrote:

> I like JSON format. But why here? For this purpose the JSON is over engineered.

I respectfully disagree, JSON is a commonly used and known format in systems
administration and most importantly: we already have code to parse it in the
frontend.

> This input file has  no nested structure - it is just a stream of lines.

Well, it has a set of object types which in turn have objects.  There is more
structure than meets the eye.

Also, the current patch allows arbitrary whitespace before object names, but no
whitespace before comments etc.  Using something where the rules of parsing are
known is rarely a bad thing.

> I don't think so introducing JSON here can be a good idea.

Quite possibly it isn't, but not discussing options seems like a worse idea so
I wanted to bring it up.

> It is a really different case than pg_dump manifest file - in this case, in this case pg_dump is consument.

Right, as I said these are two different, while tangentially related, things.

cheers ./daniel

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

13 July 2020, 15:33:38

po 13. 7. 2020 v 16:57 odesílatel Daniel Gustafsson <daniel@yesql.se> napsal:

> On 13 Jul 2020, at 13:02, Pavel Stehule <pavel.stehule@gmail.com> wrote:

> I like JSON format. But why here? For this purpose the JSON is over engineered.

I respectfully disagree, JSON is a commonly used and known format in systems
administration and most importantly: we already have code to parse it in the
frontend.

I disagree with the idea so if we have a client side JSON parser we have to use it everywhere.

For this case, parsing JSON means more code, not less. I checked the parse_manifest.c. More

the JSON API is DOM type. For this purpose the SAX type is better. But still, things should be simple as possible.

There is not any necessity to use it.

JSON is good for a lot of purposes, and can be good if the document uses more lexer types, numeric, ... But nothing is used there

> This input file has no nested structure - it is just a stream of lines.

Well, it has a set of object types which in turn have objects. There is more
structure than meets the eye.

Also, the current patch allows arbitrary whitespace before object names, but no
whitespace before comments etc. Using something where the rules of parsing are
known is rarely a bad thing.

if I know - JSON hasn't comments at all.

> I don't think so introducing JSON here can be a good idea.

Quite possibly it isn't, but not discussing options seems like a worse idea so
I wanted to bring it up.

> It is a really different case than pg_dump manifest file - in this case, in this case pg_dump is consument.

Right, as I said these are two different, while tangentially related, things.

Backup manifest format has no trivial complexity - and using JSON has sense. Input filter file is a trivial - +/- list of strings (and it will be everytime).

In this case I don't see any benefits from JSON - on both sides (producent, consuments). It is harder (little bit) to parse it, it is harder (little bit) to generate it.

Regards

Pavel

cheers ./daniel

Re: proposal: possibility to read dumped table's name from file

From

Justin Pryzby

Date:

13 July 2020, 17:59:57

On Mon, Jul 13, 2020 at 08:15:42AM +0200, Pavel Stehule wrote:
> > Do you want to add any more documentation ?
> >
> 
> done

Thanks - I think the documentation was maybe excessive.  See attached.

-- 
Justin

Attachment

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

14 July 2020, 06:31:31

po 13. 7. 2020 v 20:00 odesílatel Justin Pryzby <pryzby@telsasoft.com> napsal:

On Mon, Jul 13, 2020 at 08:15:42AM +0200, Pavel Stehule wrote:
> > Do you want to add any more documentation ?
> >
>
> done

Thanks - I think the documentation was maybe excessive. See attached.

I merged your patch - thank you

new patch with doc changes and text of help change requested by Vignesh attached

Regards

Pavel

--
Justin

Attachment

pg_dump-filter-20200714.patch

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

14 July 2020, 06:32:41

po 13. 7. 2020 v 15:33 odesílatel vignesh C <vignesh21@gmail.com> napsal:

On Mon, Jul 13, 2020 at 1:51 PM Pavel Stehule <pavel.stehule@gmail.com> wrote:
>> Can this be changed to dump objects and data based on the filter
>> expressions from the filter file.
>
>
> I am sorry, I don't understand. This should work for data from specified by filter without any modification.
>
I meant can this:
printf(_(" --filter=FILENAME read object name filter
expressions from file\n"));
be changed to:
printf(_(" --filter=FILENAME dump objects and data based
on the filter expressions from the filter file\n"));

done in today patch

Pavel

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: proposal: possibility to read dumped table's name from file

From

vignesh C

Date:

25 July 2020, 13:26:31

On Tue, Jul 14, 2020 at 12:03 PM Pavel Stehule <pavel.stehule@gmail.com> wrote:
>> I meant can this:
>> printf(_("  --filter=FILENAME            read object name filter
>> expressions from file\n"));
>> be changed to:
>> printf(_("  --filter=FILENAME            dump objects and data based
>> on the filter expressions from the filter file\n"));
>
> done in today patch
>

Thanks for fixing the  comments.
Few comments:
+ /* use "-" as symbol for stdin */
+ if (strcmp(filename, "-") != 0)
+ {
+ fp = fopen(filename, "r");
+ if (!fp)
+ fatal("could not open the input file \"%s\": %m",
+   filename);
+ }
+ else
+ fp = stdin;

We could use STDIN itself instead of -, it will be a more easier
option to understand.

+ /* when first char is hash, ignore whole line */
+ if (*line == '#')
+ continue;

If line starts with # we ignore that line, I feel this should be
included in the documentation.

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Re: proposal: possibility to read dumped table's name from file

From

Justin Pryzby

Date:

26 July 2020, 19:10:06

On Sat, Jul 25, 2020 at 06:56:31PM +0530, vignesh C wrote:
> On Tue, Jul 14, 2020 at 12:03 PM Pavel Stehule <pavel.stehule@gmail.com> wrote:
> >> I meant can this:
> >> printf(_("  --filter=FILENAME            read object name filter
> >> expressions from file\n"));
> >> be changed to:
> >> printf(_("  --filter=FILENAME            dump objects and data based
> >> on the filter expressions from the filter file\n"));
> >
> > done in today patch
> 
> Thanks for fixing the  comments.
> Few comments:
> + /* use "-" as symbol for stdin */
> + if (strcmp(filename, "-") != 0)
> + {
> + fp = fopen(filename, "r");
> + if (!fp)
> + fatal("could not open the input file \"%s\": %m",
> +   filename);
> + }
> + else
> + fp = stdin;
> 
> We could use STDIN itself instead of -, it will be a more easier
> option to understand.

I think "-" is used widely for commandline tools, and STDIN is not (even though
it's commonly used by programmers).  For example, since last year, pg_restore
-f - means stdout.

-- 
Justin

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

27 July 2020, 05:25:54

so 25. 7. 2020 v 15:26 odesílatel vignesh C <vignesh21@gmail.com> napsal:

On Tue, Jul 14, 2020 at 12:03 PM Pavel Stehule <pavel.stehule@gmail.com> wrote:
>> I meant can this:
>> printf(_(" --filter=FILENAME read object name filter
>> expressions from file\n"));
>> be changed to:
>> printf(_(" --filter=FILENAME dump objects and data based
>> on the filter expressions from the filter file\n"));
>
> done in today patch
>

Thanks for fixing the comments.
Few comments:
+ /* use "-" as symbol for stdin */
+ if (strcmp(filename, "-") != 0)
+ {
+ fp = fopen(filename, "r");
+ if (!fp)
+ fatal("could not open the input file \"%s\": %m",
+ filename);
+ }
+ else
+ fp = stdin;

We could use STDIN itself instead of -, it will be a more easier
option to understand.

+ /* when first char is hash, ignore whole line */
+ if (*line == '#')
+ continue;

If line starts with # we ignore that line, I feel this should be
included in the documentation.

Good note - I wrote sentence to doc

+ <para>
+ The lines starting with symbol <literal>#</literal> are ignored.
+ Previous white chars (spaces, tabs) are not allowed. These
+ lines can be used for comments, notes.
+ </para>
+

Regards,
Vignesh
EnterpriseDB: http://www.enterprisedb.com

Attachment

pg_dump-filter-20200727.patch

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

27 July 2020, 05:28:09

ne 26. 7. 2020 v 21:10 odesílatel Justin Pryzby <pryzby@telsasoft.com> napsal:

On Sat, Jul 25, 2020 at 06:56:31PM +0530, vignesh C wrote:
> On Tue, Jul 14, 2020 at 12:03 PM Pavel Stehule <pavel.stehule@gmail.com> wrote:
> >> I meant can this:
> >> printf(_(" --filter=FILENAME read object name filter
> >> expressions from file\n"));
> >> be changed to:
> >> printf(_(" --filter=FILENAME dump objects and data based
> >> on the filter expressions from the filter file\n"));
> >
> > done in today patch
>
> Thanks for fixing the comments.
> Few comments:
> + /* use "-" as symbol for stdin */
> + if (strcmp(filename, "-") != 0)
> + {
> + fp = fopen(filename, "r");
> + if (!fp)
> + fatal("could not open the input file \"%s\": %m",
> + filename);
> + }
> + else
> + fp = stdin;
>
> We could use STDIN itself instead of -, it will be a more easier
> option to understand.

I think "-" is used widely for commandline tools, and STDIN is not (even though
it's commonly used by programmers). For example, since last year, pg_restore
-f - means stdout.

yes, STDIN is used by programming languages, but it is not usual in command line tools. And because it was used by pg_restore, then we should not use new inconsistency.

Regards

Pavel

--
Justin

Re: proposal: possibility to read dumped table's name from file

From

Justin Pryzby

Date:

13 August 2020, 21:41:04

On Mon, Jul 27, 2020 at 07:25:54AM +0200, Pavel Stehule wrote:
> so 25. 7. 2020 v 15:26 odesílatel vignesh C <vignesh21@gmail.com> napsal:
> 
> > On Tue, Jul 14, 2020 at 12:03 PM Pavel Stehule <pavel.stehule@gmail.com>
> > wrote:
> > >> I meant can this:
> > >> printf(_("  --filter=FILENAME            read object name filter
> > >> expressions from file\n"));
> > >> be changed to:
> > >> printf(_("  --filter=FILENAME            dump objects and data based
> > >> on the filter expressions from the filter file\n"));
> > >
> > > done in today patch

This looks good to my eyes - marked RFC.

-- 
Justin

Re: proposal: possibility to read dumped table's name from file

From

Justin Pryzby

Date:

03 September 2020, 19:48:02

On Sun, Jul 05, 2020 at 10:08:09PM +0200, Pavel Stehule wrote:
> st 1. 7. 2020 v 23:24 odesílatel Justin Pryzby <pryzby@telsasoft.com> napsal:
> 
> > On Thu, Jun 11, 2020 at 09:36:18AM +0200, Pavel Stehule wrote:
> > > st 10. 6. 2020 v 0:30 odesílatel Justin Pryzby <pryzby@telsasoft.com>> napsal:
> > Also, your getline is dynamically re-allocating lines of arbitrary length.
> > Possibly that's not needed.  We'll typically read "+t schema.relname",
> > which is
> > 132 chars.  Maybe it's sufficient to do
> > char buf[1024];
> > fgets(buf);
> > if strchr(buf, '\n') == NULL: error();
> > ret = pstrdup(buf);
> 
> 63 bytes is max effective identifier size, but it is not max size of
> identifiers. It is very probably so buff with 1024 bytes will be enough for
> all, but I do not want to increase any new magic limit. More when dynamic
> implementation is not too hard.

Maybe you'd want to use a StrInfo like recent patches (8f8154a50).

> Table name can be very long - sometimes the data names (table names) can be
> stored in external storages with full length and should not be practical to
> require truncating in filter file.
> 
> For this case it is very effective, because a resized (increased) buffer is
> used for following rows, so realloc should not be often. So when I have to
> choose between two implementations with similar complexity, I prefer more
> dynamic code without hardcoded limits. This dynamic hasn't any overhead.

-- 
Justin

Re: proposal: possibility to read dumped table's name from file

From

Alvaro Herrera

Date:

03 September 2020, 20:08:42

On 2020-Jul-27, Pavel Stehule wrote:

> +/*
> + * getline is originally GNU function, and should not be everywhere still.
> + * Use own reduced implementation.
> + */
> +static size_t
> +pg_getline(char **lineptr, size_t *n, FILE *fp)
> +{

So, Tom added a coding pattern for doing this in commit 8f8154a503c7,
which is ostensibly also to be used in pg_regress [1] -- maybe it'd be
useful to have this in src/common?

[1]
https://postgr.es/m/m_1NfbowTqSJnrC6rq1a9cQK7E-CHQE7B6Kz9w6fNH-OiV-4mcsdMw7UP2oA2_6dZmXvAMjbSPZjW9U7FD2R52D3d9DtaJxcBprsqJqZNBc=@protonmail.com

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: proposal: possibility to read dumped table's name from file

From

Tom Lane

Date:

04 September 2020, 00:15:32

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> So, Tom added a coding pattern for doing this in commit 8f8154a503c7,
> which is ostensibly also to be used in pg_regress [1] -- maybe it'd be
> useful to have this in src/common?

Done, see pg_get_line() added by 67a472d71.

            regards, tom lane

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

04 September 2020, 03:21:37

pá 4. 9. 2020 v 2:15 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> So, Tom added a coding pattern for doing this in commit 8f8154a503c7,
> which is ostensibly also to be used in pg_regress [1] -- maybe it'd be
> useful to have this in src/common?

Done, see pg_get_line() added by 67a472d71.

Here is updated patch for pg_dump

Regards

Pavel

regards, tom lane

Attachment

pg_dump_filter-20200904.patch

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

07 September 2020, 05:08:18

pá 4. 9. 2020 v 5:21 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal:

pá 4. 9. 2020 v 2:15 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal:
Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> So, Tom added a coding pattern for doing this in commit 8f8154a503c7,
> which is ostensibly also to be used in pg_regress [1] -- maybe it'd be
> useful to have this in src/common?

Done, see pg_get_line() added by 67a472d71.

Here is updated patch for pg_dump

another update based on pg_get_line_append function

Regards

Pavel

Regards

Pavel

regards, tom lane

Attachment

pg_dump-filter-20200907.patch

Re: proposal: possibility to read dumped table's name from file

From

Surafel Temesgen

Date:

07 September 2020, 12:14:45

Hi Pavel

On Fri, Sep 4, 2020 at 6:22 AM Pavel Stehule <pavel.stehule@gmail.com> wrote:

Here is updated patch for pg_dump

pg_dumpall also has –exclude-database=pattern and –no-comments option doesn't that qualify it to benefits from this feature? And please add a test case for this option

regards

Surafel

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

11 September 2020, 08:50:01

po 7. 9. 2020 v 14:14 odesílatel Surafel Temesgen <surafel3000@gmail.com> napsal:

Hi Pavel

On Fri, Sep 4, 2020 at 6:22 AM Pavel Stehule <pavel.stehule@gmail.com> wrote:

Here is updated patch for pg_dump

pg_dumpall also has –exclude-database=pattern and –no-comments option doesn't that qualify it to benefits from this feature? And please add a test case for this option

This patch is related to pg_dump (in this moment), so pg_dumpall options are out of scope.

I am not sure if pg_dumpall needs this functionality - maybe, but I can use bash or some similar for implementation of this feature. There is no requirement to do it all necessary work under one transaction, one snapshot.

For pg_dump can be used different format, because it uses different granularity. Some like "{+/-} dbname"

"--no-comments" is a global parameter without arguments. I don't understand how this parameter can be related to this feature?

I am working on regress tests.

Regards

Pavel

regards
Surafel

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

12 September 2020, 08:12:16

pá 11. 9. 2020 v 10:50 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal:

Hi

po 7. 9. 2020 v 14:14 odesílatel Surafel Temesgen <surafel3000@gmail.com> napsal:
Hi Pavel

On Fri, Sep 4, 2020 at 6:22 AM Pavel Stehule <pavel.stehule@gmail.com> wrote:

Here is updated patch for pg_dump

pg_dumpall also has –exclude-database=pattern and –no-comments option doesn't that qualify it to benefits from this feature? And please add a test case for this option

This patch is related to pg_dump (in this moment), so pg_dumpall options are out of scope.

I am not sure if pg_dumpall needs this functionality - maybe, but I can use bash or some similar for implementation of this feature. There is no requirement to do it all necessary work under one transaction, one snapshot.

For pg_dump can be used different format, because it uses different granularity. Some like "{+/-} dbname"

"--no-comments" is a global parameter without arguments. I don't understand how this parameter can be related to this feature?

I am working on regress tests.

There is a updated version with regress tests

Regards

Pavel

Regards

Pavel

regards
Surafel

Attachment

pg_dump-filter-20200912.patch

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

24 September 2020, 17:47:32

rebase + minor change - using pg_get_line_buf instead pg_get_line_append

Regards

Pavel

Attachment

pg_dump-filter-20200924.patch

Re: proposal: possibility to read dumped table's name from file

From

Stephen Frost

Date:

10 November 2020, 20:09:04

Greetings,

* Pavel Stehule (pavel.stehule@gmail.com) wrote:
> rebase + minor change - using pg_get_line_buf instead pg_get_line_append

I started looking at this and went back through the thread and while I
tend to agree that JSON may not be a good choice for this, it's not the
only possible alternative.  There is no doubt that pg_dump is already a
sophisticated data export tool, and likely to continue to gain new
features, such that having a configuration file for it would be very
handy, but this clearly isn't really going in a direction that would
allow for that.

Perhaps this feature could co-exist with a full blown configuration for
pg_dump, but even then there's certainly issues with what's proposed-
how would you handle explicitly asking for a table which is named
"  mytable" to be included or excluded?  Or a table which has a newline
in it?  Using a standardized format which supports the full range of
what we do in a table name, explicitly and clearly, would address these
issues and also give us the flexibility to extend the options which
could be used through the configuration file beyond just the filters in
the future.

Unlike for the pg_basebackup manifest, which we generate and read
entirely programatically, a config file for pg_dump would almost
certainly be updated manually (or, at least, parts of it would be and
perhaps other parts generated), which means it'd really be ideal to have
a proper way to support comments in it (something that the proposed
format also doesn't really get right- # must be the *first* character,
and you can only have whole-line comments..?), avoid extra unneeded
punctuation (or, at times, allow it- such as trailing commas in lists),
cleanly handle multi-line strings (consider the oft discussed idea
around having pg_dump support a WHERE clause for exporting data from
tables...), etc.

Overall, -1 from me on this approach.  Maybe it could be fixed up to
handle all the different names of objects that we support today
(something which, imv, is really a clear requirement for this feature to
be committed), but I suspect you'd end up half-way to yet another
configuration format when we could be working to support something like
TOML or maybe YAML... but if you want my 2c, TOML seems closer to what
we do for postgresql.conf and getting that over to something that's
standardized, while a crazy long shot, is a general nice idea, imv.

Thanks,

Stephen

Attachment

signature.asc

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

11 November 2020, 05:32:33

út 10. 11. 2020 v 21:09 odesílatel Stephen Frost <sfrost@snowman.net> napsal:

Greetings,

* Pavel Stehule (pavel.stehule@gmail.com) wrote:
> rebase + minor change - using pg_get_line_buf instead pg_get_line_append

I started looking at this and went back through the thread and while I
tend to agree that JSON may not be a good choice for this, it's not the
only possible alternative. There is no doubt that pg_dump is already a
sophisticated data export tool, and likely to continue to gain new
features, such that having a configuration file for it would be very
handy, but this clearly isn't really going in a direction that would
allow for that.

Perhaps this feature could co-exist with a full blown configuration for
pg_dump, but even then there's certainly issues with what's proposed-
how would you handle explicitly asking for a table which is named
" mytable" to be included or excluded? Or a table which has a newline
in it? Using a standardized format which supports the full range of
what we do in a table name, explicitly and clearly, would address these
issues and also give us the flexibility to extend the options which
could be used through the configuration file beyond just the filters in
the future.

This is the correct argument - I will check a possibility to use strange names, but there is the same possibility and functionality like we allow from the command line. So you can use double quoted names. I'll check it.

Unlike for the pg_basebackup manifest, which we generate and read
entirely programatically, a config file for pg_dump would almost
certainly be updated manually (or, at least, parts of it would be and
perhaps other parts generated), which means it'd really be ideal to have
a proper way to support comments in it (something that the proposed
format also doesn't really get right- # must be the *first* character,
and you can only have whole-line comments..?), avoid extra unneeded
punctuation (or, at times, allow it- such as trailing commas in lists),
cleanly handle multi-line strings (consider the oft discussed idea
around having pg_dump support a WHERE clause for exporting data from
tables...), etc.

I think the proposed feature is very far to be the config file for pg_dump (it implements a option "--filter"). This is not the target. It is not designed for this. This is just an alternative for options like -t, -T, ... and I am sure so nobody will generate this file manually. Main target of this patch is eliminating problems with the max length of the command line. So it is really not designed to be the config file for pg_dump.

Overall, -1 from me on this approach. Maybe it could be fixed up to
handle all the different names of objects that we support today
(something which, imv, is really a clear requirement for this feature to
be committed), but I suspect you'd end up half-way to yet another
configuration format when we could be working to support something like
TOML or maybe YAML... but if you want my 2c, TOML seems closer to what
we do for postgresql.conf and getting that over to something that's
standardized, while a crazy long shot, is a general nice idea, imv.

I have nothing against TOML, but I don't see a sense of usage in this patch. This patch doesn't implement a config file for pg_dump, and I don't see any sense or benefits of it. The TOML is designed for different purposes. TOML is good for manual creating, but it is not this case. Typical usage of this patch is some like, and TOML syntax (or JSON) is not good for this.

psql -c "select '+t' || quote_ident(relname) from pg_class where relname ..." | pg_dump --filter=/dev/stdin

I can imagine some benefits of saved configure files for postgres applications - but it should be designed generally and implemented generally. Probably you would use one for pg_dump, psql, pg_restore, .... But it is a different feature with different usage. This patch doesn't implement option "--config", it implements option "--filter".

Regards

Pavel

Thanks,

Stephen

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

11 November 2020, 05:49:43

Perhaps this feature could co-exist with a full blown configuration for
pg_dump, but even then there's certainly issues with what's proposed-
how would you handle explicitly asking for a table which is named
" mytable" to be included or excluded? Or a table which has a newline
in it? Using a standardized format which supports the full range of
what we do in a table name, explicitly and clearly, would address these
issues and also give us the flexibility to extend the options which
could be used through the configuration file beyond just the filters in
the future.

This is the correct argument - I will check a possibility to use strange names, but there is the same possibility and functionality like we allow from the command line. So you can use double quoted names. I'll check it.

I checked

echo "+t \"bad Name\"" | /usr/local/pgsql/master/bin/pg_dump --filter=/dev/stdin

It is working without any problem

Regards

Pavel

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

11 November 2020, 05:54:22

čt 24. 9. 2020 v 19:47 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal:

Hi

rebase + minor change - using pg_get_line_buf instead pg_get_line_append

fresh rebase

Regards

Pavel

Attachment

pg_dump-filter-20201011.patch

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

11 November 2020, 07:09:40

st 11. 11. 2020 v 6:32 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal:

Hi

út 10. 11. 2020 v 21:09 odesílatel Stephen Frost <sfrost@snowman.net> napsal:
Greetings,

* Pavel Stehule (pavel.stehule@gmail.com) wrote:
> rebase + minor change - using pg_get_line_buf instead pg_get_line_append

I started looking at this and went back through the thread and while I
tend to agree that JSON may not be a good choice for this, it's not the
only possible alternative. There is no doubt that pg_dump is already a
sophisticated data export tool, and likely to continue to gain new
features, such that having a configuration file for it would be very
handy, but this clearly isn't really going in a direction that would
allow for that.

Perhaps this feature could co-exist with a full blown configuration for
pg_dump, but even then there's certainly issues with what's proposed-
how would you handle explicitly asking for a table which is named
" mytable" to be included or excluded? Or a table which has a newline
in it? Using a standardized format which supports the full range of
what we do in a table name, explicitly and clearly, would address these
issues and also give us the flexibility to extend the options which
could be used through the configuration file beyond just the filters in
the future.

This is the correct argument - I will check a possibility to use strange names, but there is the same possibility and functionality like we allow from the command line. So you can use double quoted names. I'll check it.

Unlike for the pg_basebackup manifest, which we generate and read
entirely programatically, a config file for pg_dump would almost
certainly be updated manually (or, at least, parts of it would be and
perhaps other parts generated), which means it'd really be ideal to have
a proper way to support comments in it (something that the proposed
format also doesn't really get right- # must be the *first* character,
and you can only have whole-line comments..?), avoid extra unneeded
punctuation (or, at times, allow it- such as trailing commas in lists),
cleanly handle multi-line strings (consider the oft discussed idea
around having pg_dump support a WHERE clause for exporting data from
tables...), etc.

I think the proposed feature is very far to be the config file for pg_dump (it implements a option "--filter"). This is not the target. It is not designed for this. This is just an alternative for options like -t, -T, ... and I am sure so nobody will generate this file manually. Main target of this patch is eliminating problems with the max length of the command line. So it is really not designed to be the config file for pg_dump.

Overall, -1 from me on this approach. Maybe it could be fixed up to
handle all the different names of objects that we support today
(something which, imv, is really a clear requirement for this feature to
be committed), but I suspect you'd end up half-way to yet another
configuration format when we could be working to support something like
TOML or maybe YAML... but if you want my 2c, TOML seems closer to what
we do for postgresql.conf and getting that over to something that's
standardized, while a crazy long shot, is a general nice idea, imv.

I have nothing against TOML, but I don't see a sense of usage in this patch. This patch doesn't implement a config file for pg_dump, and I don't see any sense or benefits of it. The TOML is designed for different purposes. TOML is good for manual creating, but it is not this case. Typical usage of this patch is some like, and TOML syntax (or JSON) is not good for this.

psql -c "select '+t' || quote_ident(relname) from pg_class where relname ..." | pg_dump --filter=/dev/stdin

I can imagine some benefits of saved configure files for postgres applications - but it should be designed generally and implemented generally. Probably you would use one for pg_dump, psql, pg_restore, .... But it is a different feature with different usage. This patch doesn't implement option "--config", it implements option "--filter".

Some generic configuration for postgres binary applications is an interesting idea. And TOML language can be well for this purpose. We can parametrize applications by command line and by system variables. But filtering objects is a really different case - although there is some small intersection, and it will be used very differently, and I don't think so one language can be practical for both cases. The object filtering is an independent feature, and both features can coexist together.

Regards

Pavel

Regards

Pavel

Thanks,

Stephen

Re: proposal: possibility to read dumped table's name from file

From

Alvaro Herrera

Date:

11 November 2020, 15:17:36

On 2020-Nov-11, Pavel Stehule wrote:

> I think the proposed feature is very far to be the config file for pg_dump
> (it implements a option "--filter"). This is not the target. It is not
> designed for this. This is just an alternative for options like -t, -T, ...
> and I am sure so nobody will generate this file manually. Main target of
> this patch is eliminating problems with the max length of the command line.
> So it is really not designed to be the config file for pg_dump.

I agree that a replacement for existing command line arguments is a good
goal, but at the same time it's good to keep in mind the request that
more object types are supported as dumpable.  While it's not necessary
that this infrastructure supports all object types in the first cut,
it'd be good to have it support that.  I would propose that instead of a
single letter 't' etc we support keywords, maybe similar to those
returned by getObjectTypeDescription() (with additions -- for example
for "table data").  Then we can extend for other object types later
without struggling to find good letters for each.

Of course we could allow abbrevations for common cases, such that "t"
means "table".

For example: it'll be useful to support selective dumping of functions,
materialized views, foreign objects, etc.

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

12 November 2020, 07:45:21

st 11. 11. 2020 v 16:17 odesílatel Alvaro Herrera <alvherre@alvh.no-ip.org> napsal:

On 2020-Nov-11, Pavel Stehule wrote:

> I think the proposed feature is very far to be the config file for pg_dump
> (it implements a option "--filter"). This is not the target. It is not
> designed for this. This is just an alternative for options like -t, -T, ...
> and I am sure so nobody will generate this file manually. Main target of
> this patch is eliminating problems with the max length of the command line.
> So it is really not designed to be the config file for pg_dump.

I agree that a replacement for existing command line arguments is a good
goal, but at the same time it's good to keep in mind the request that
more object types are supported as dumpable. While it's not necessary
that this infrastructure supports all object types in the first cut,
it'd be good to have it support that. I would propose that instead of a
single letter 't' etc we support keywords, maybe similar to those
returned by getObjectTypeDescription() (with additions -- for example
for "table data"). Then we can extend for other object types later
without struggling to find good letters for each.

Of course we could allow abbrevations for common cases, such that "t"
means "table".

For example: it'll be useful to support selective dumping of functions,
materialized views, foreign objects, etc.

Implementation of this is trivial.

The hard work is mapping pg_dump options on database objects. t -> table is simple, but n -> schema looks a little bit inconsistent - although it is consistent with pg_dump. d or D - there is no system object like data. I am afraid so there are two independent systems - pg_dump options, and database objects, and it can be hard or not very practical to join these systems. Unfortunately there is not good consistency in the short options of pg_dump today. More - a lot of object names are multi words with inner space. This is not too practical.

What about supporting two syntaxes?

1. first short current +-tndf filter - but the implementation should not be limited to one char - there can be any string until space

2. long syntax - all these pg_dump options has long options, and then we can extend this feature without any problem in future

so the content of filter file can looks like:

+t mytable

+t tabprefix*

-t bigtable

table=mytable2

exclude-table=mytable2

This format allows quick work for most common database objects, and it is extensible and consistent with pg_dump's long options.

What do you think about it?

Personally, I am thinking that it is over-engineering a little bit, maybe we can implement this feature just test first string after +- symbols (instead first char like now) - and any enhanced syntax can be implemented in future when there will be this requirement. Second syntax can be implemented very simply, because it can be identified by first char processing. We can implement second syntax only too. It will work too, but I think so short syntax is more practical for daily work (for common options). I expect so 99% percent of this objects will be "+t tablename".

Regards

Pavel

Re: proposal: possibility to read dumped table's name from file

From

Justin Pryzby

Date:

17 November 2020, 21:53:21

On Wed, Nov 11, 2020 at 06:49:43AM +0100, Pavel Stehule wrote:
> Perhaps this feature could co-exist with a full blown configuration for
> >> pg_dump, but even then there's certainly issues with what's proposed-
> >> how would you handle explicitly asking for a table which is named
> >> "  mytable" to be included or excluded?  Or a table which has a newline
> >> in it?  Using a standardized format which supports the full range of
> >> what we do in a table name, explicitly and clearly, would address these
> >> issues and also give us the flexibility to extend the options which
> >> could be used through the configuration file beyond just the filters in
> >> the future.

I think it's a reasonable question - why would a new configuration file option
include support for only a handful of existing arguments but not the rest.

> > This is the correct argument - I will check a possibility to use strange
> > names, but there is the same possibility and functionality like we allow
> > from the command line. So you can use double quoted names. I'll check it.
> 
> I checked
> echo "+t \"bad Name\"" | /usr/local/pgsql/master/bin/pg_dump --filter=/dev/stdin
> It is working without any problem

I think it couldn't possibly work with newlines, since you call pg_get_line().
I realize that entering a newline into the shell would also be a PITA, but that
could be one *more* reason to support a config file - to allow terrible table
names to be in a file and avoid writing dash tee quote something enter else
quote in a pg_dump command, or shell script.

I fooled with argument parsing to handle reading from a file in the quickest
way.  As written, this fails to handle multiple config files, and special table
names, which need to support arbitrary, logical lines, with quotes surrounding
newlines or other special chars.  As written, the --config file is parsed
*after* all other arguments, so it could override previous args (like
--no-blobs --no-blogs, --file, --format, --compress, --lock-wait), which I
guess is bad, so the config file should be processed *during* argument parsing.
Unfortunately, I think that suggests duplicating parsing of all/most the
argument parsing for config file support - I'd be happy if someone suggested a
better way.

BTW, in your most recent patch:
s/empty rows/empty lines/
unbalanced parens: "invalid option type (use [+-]"

@cfbot: I renamed the patch so please ignore it.

-- 
Justin

Attachment

Re: proposal: possibility to read dumped table's name from file

From

Stephen Frost

Date:

18 November 2020, 14:46:30

Greetings,

* Justin Pryzby (pryzby@telsasoft.com) wrote:
> On Wed, Nov 11, 2020 at 06:49:43AM +0100, Pavel Stehule wrote:
> > >> Perhaps this feature could co-exist with a full blown configuration for
> > >> pg_dump, but even then there's certainly issues with what's proposed-
> > >> how would you handle explicitly asking for a table which is named
> > >> "  mytable" to be included or excluded?  Or a table which has a newline
> > >> in it?  Using a standardized format which supports the full range of
> > >> what we do in a table name, explicitly and clearly, would address these
> > >> issues and also give us the flexibility to extend the options which
> > >> could be used through the configuration file beyond just the filters in
> > >> the future.
>
> I think it's a reasonable question - why would a new configuration file option
> include support for only a handful of existing arguments but not the rest.

Even if the first version of having a config file for pg_dump only
supported some options, that would be reasonable imv, but I dislike the
idea of building it in such a way that it'll be awkward to add more
options to it in the future, something that I definitely think people
would like to see (I know I would...).

> > > This is the correct argument - I will check a possibility to use strange
> > > names, but there is the same possibility and functionality like we allow
> > > from the command line. So you can use double quoted names. I'll check it.
> >
> > I checked
> > echo "+t \"bad Name\"" | /usr/local/pgsql/master/bin/pg_dump --filter=/dev/stdin
> > It is working without any problem
>
> I think it couldn't possibly work with newlines, since you call pg_get_line().

Yeah, I didn't really believe that it actually worked but hadn't had a
chance to demonstrate that it didn't yet.

> I realize that entering a newline into the shell would also be a PITA, but that
> could be one *more* reason to support a config file - to allow terrible table
> names to be in a file and avoid writing dash tee quote something enter else
> quote in a pg_dump command, or shell script.

Agreed.

> I fooled with argument parsing to handle reading from a file in the quickest
> way.  As written, this fails to handle multiple config files, and special table
> names, which need to support arbitrary, logical lines, with quotes surrounding
> newlines or other special chars.  As written, the --config file is parsed
> *after* all other arguments, so it could override previous args (like
> --no-blobs --no-blogs, --file, --format, --compress, --lock-wait), which I
> guess is bad, so the config file should be processed *during* argument parsing.
> Unfortunately, I think that suggests duplicating parsing of all/most the
> argument parsing for config file support - I'd be happy if someone suggested a
> better way.

This still feels like we're trying to quickly hack-and-slash at adding a
config file option rather than thinking through what a sensible design
for a pg_dump config file would look like.  Having a way to avoid having
multiple places in the code that has to handle all the possible options
is a nice idea but, as I tried to allude to up-thread, I fully expect
that once we've got this config file capability that we're going to want
to add things to it that would be difficult to utilize through the
command-line and so I expect these code paths to diverge anyway.

I would imagine something like:

[connection]
db-host=whatever
db-port=5433
...

[output]
owners = true
privileges = false
format = "custom"
file = "myoutput.dump"

# This is a comment
[include]
tables = [ "sometable", "table with spaces",
"table with quoted \"",
"""this is my table
with a carriage return""", "anothertable" ]
table-patterns = [ "table*" ]
schemas = [ "myschema" ]

[exclude]
tables = [ "similar to include" ]
functions = [ "somefunction(int)" ]

etc, etc ...

Thanks,

Stephen

Attachment

signature.asc

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

19 November 2020, 19:51:18

út 17. 11. 2020 v 22:53 odesílatel Justin Pryzby <pryzby@telsasoft.com> napsal:

On Wed, Nov 11, 2020 at 06:49:43AM +0100, Pavel Stehule wrote:
> Perhaps this feature could co-exist with a full blown configuration for
> >> pg_dump, but even then there's certainly issues with what's proposed-
> >> how would you handle explicitly asking for a table which is named
> >> " mytable" to be included or excluded? Or a table which has a newline
> >> in it? Using a standardized format which supports the full range of
> >> what we do in a table name, explicitly and clearly, would address these
> >> issues and also give us the flexibility to extend the options which
> >> could be used through the configuration file beyond just the filters in
> >> the future.

I think it's a reasonable question - why would a new configuration file option
include support for only a handful of existing arguments but not the rest.

I don't see a strong technical problem - enhancing parsing is not hard work, but I miss a use case for this. The option "--filter" tries to solve a problem with limited command line size. This is a clean use case and there and supported options are options that can be used repeatedly on the command line. Nothing less, nothing more. The format that is used is designed just for this purpose.

When we would implement an alternative configuration to command line and system environments, then the use case should be defined first. When the use case is defined, we can talk about implementation and about good format. There are a lot of interesting formats, but I miss a reason why the usage of this alternative configuration can be helpful for pg_dump. Using external libraries for richer formats means a new dependency, necessity to solve portability issues, and maybe other issues, and for this there should be a good use case. Passing a list of tables for dumping doesn't need a rich format.

I cannot imagine using a config file with generated object names and some other options together. Maybe if these configurations will not be too long (then handy written) configuration can be usable. But when I think about using pg_dump from some bash scripts, then much more practical is using usual command line options and passing a list of objects by pipe. I really miss the use case for special pg_dump's config file, and if there is, then it is very different from a use case for "--filter" option.

> > This is the correct argument - I will check a possibility to use strange
> > names, but there is the same possibility and functionality like we allow
> > from the command line. So you can use double quoted names. I'll check it.
>
> I checked
> echo "+t \"bad Name\"" | /usr/local/pgsql/master/bin/pg_dump --filter=/dev/stdin
> It is working without any problem

I think it couldn't possibly work with newlines, since you call pg_get_line().
I realize that entering a newline into the shell would also be a PITA, but that
could be one *more* reason to support a config file - to allow terrible table
names to be in a file and avoid writing dash tee quote something enter else
quote in a pg_dump command, or shell script.

New patch is working with names that contains multilines

[pavel@localhost postgresql.master]$ psql -At -X -c "select '+t ' || quote_ident(table_name) from information_schema.tables where table_name like 'foo%'"| /usr/local/pgsql/master/bin/pg_dump --filter=/dev/stdin
--
-- PostgreSQL database dump
--

-- Dumped from database version 14devel
-- Dumped by pg_dump version 14devel

-
-- Name: foo boo; Type: TABLE; Schema: public; Owner: pavel
--

CREATE TABLE public."foo
boo" (
a integer
);

ALTER TABLE public."foo
boo" OWNER TO pavel;

--
-- Data for Name: foo boo; Type: TABLE DATA; Schema: public; Owner: pavel
--

COPY public."foo
boo" (a) FROM stdin;
\.

--
-- PostgreSQL database dump complete
--

I fooled with argument parsing to handle reading from a file in the quickest
way. As written, this fails to handle multiple config files, and special table
names, which need to support arbitrary, logical lines, with quotes surrounding
newlines or other special chars. As written, the --config file is parsed
*after* all other arguments, so it could override previous args (like
--no-blobs --no-blogs, --file, --format, --compress, --lock-wait), which I
guess is bad, so the config file should be processed *during* argument parsing.
Unfortunately, I think that suggests duplicating parsing of all/most the
argument parsing for config file support - I'd be happy if someone suggested a
better way.

BTW, in your most recent patch:
s/empty rows/empty lines/
unbalanced parens: "invalid option type (use [+-]"

should be fixed now, thank you for check

Regards

Pavel

@cfbot: I renamed the patch so please ignore it.

--
Justin

Attachment

pg_dump-filter-option-20201119.patch

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

19 November 2020, 19:57:01

BTW, in your most recent patch:
s/empty rows/empty lines/
unbalanced parens: "invalid option type (use [+-]"

should be fixed now, thank you for check

minor update - fixed handling of processing names with double quotes inside

Regards

Pavel

@cfbot: I renamed the patch so please ignore it.

--
Justin

Attachment

pg_dump-filter-option-20201119-2.patch

Re: proposal: possibility to read dumped table's name from file

From

Dean Rasheed

Date:

25 November 2020, 18:25:27

On Thu, 19 Nov 2020 at 19:57, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
> minor update - fixed handling of processing names with double quotes inside
>

I see this is marked RFC, but reading the thread it doesn't feel like
we have reached consensus on the design for this feature.

I agree that being able to configure pg_dump via a config file would
be very useful, but the syntax proposed here feels much more like a
hacked-up syntax designed to meet this one use case, rather than a
good general-purpose design that can be easily extended.

IMO, a pg_dump config file should be able to specify all options
currently supported through the command line, and vice versa (subject
to command line length limits), with a single common code path for
handling options. That way, any new options we add will work on the
command line and in config files. Likewise, the user should only need
to learn one set of options, and have the choice of specifying them on
the command line or in a config file (or a mix of both).

I can imagine eventually supporting multiple different file formats,
each just being a different representation of the same data, so
perhaps this could work with 2 new options:

  --option-file-format=plain|yaml|json|...
  --option-file=filename

with "plain" being the default initial implementation, which might be
something like our current postgresql.conf file format.

Also, I think we should allow multiple "--option-file" arguments
(e.g., to list different object types in different files), and for a
config file to contain its own "--option-file" arguments, to allow
config files to include other config files.

The current design feels far too limited to me, and requires new code
and new syntax to be added each time we extend it, so I'm -1 on this
patch as it stands.

Regards,
Dean

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

25 November 2020, 19:29:00

st 25. 11. 2020 v 19:25 odesílatel Dean Rasheed <dean.a.rasheed@gmail.com> napsal:

On Thu, 19 Nov 2020 at 19:57, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
> minor update - fixed handling of processing names with double quotes inside
>

I see this is marked RFC, but reading the thread it doesn't feel like
we have reached consensus on the design for this feature.

I agree that being able to configure pg_dump via a config file would
be very useful, but the syntax proposed here feels much more like a
hacked-up syntax designed to meet this one use case, rather than a
good general-purpose design that can be easily extended.

Nobody sent a real use case for introducing the config file. There was a discussion about formats, and you introduce other dimensions and variability.

But I don't understand why? What is a use case? What is a benefit against command line, or libpq variables? And why should config files be better as a solution for limited length of command line, when I need to dump thousands of tables exactly specified?

Regards

Pavel

IMO, a pg_dump config file should be able to specify all options
currently supported through the command line, and vice versa (subject
to command line length limits), with a single common code path for
handling options. That way, any new options we add will work on the
command line and in config files. Likewise, the user should only need
to learn one set of options, and have the choice of specifying them on
the command line or in a config file (or a mix of both).

I can imagine eventually supporting multiple different file formats,
each just being a different representation of the same data, so
perhaps this could work with 2 new options:

--option-file-format=plain|yaml|json|...
--option-file=filename

with "plain" being the default initial implementation, which might be
something like our current postgresql.conf file format.

Also, I think we should allow multiple "--option-file" arguments
(e.g., to list different object types in different files), and for a
config file to contain its own "--option-file" arguments, to allow
config files to include other config files.

The current design feels far too limited to me, and requires new code
and new syntax to be added each time we extend it, so I'm -1 on this
patch as it stands.

This new syntax tries to be consistent and simple. It really doesn't try to implement an alternative configuration file for pg_dump. The code is simple and can be easily extended.

What are the benefits of supporting multiple formats?

Regards,
Dean

Re: proposal: possibility to read dumped table's name from file

From

Tom Lane

Date:

25 November 2020, 20:00:32

Pavel Stehule <pavel.stehule@gmail.com> writes:
> st 25. 11. 2020 v 19:25 odesílatel Dean Rasheed <dean.a.rasheed@gmail.com>
> napsal:
>> I agree that being able to configure pg_dump via a config file would
>> be very useful, but the syntax proposed here feels much more like a
>> hacked-up syntax designed to meet this one use case, rather than a
>> good general-purpose design that can be easily extended.

> But I don't understand why? What is a use case? What is a benefit against
> command line, or libpq variables? And why should config files be better as
> a solution for limited length of command line, when I need to dump
> thousands of tables exactly specified?

Because next week somebody will want to dump thousands of functions
selected by name, or schemas selected by name, etc etc.  I agree with
the position that we don't want a single-purpose solution.  The idea
that the syntax should match the command line switch syntax seems
reasonable, though I'm not wedded to it.  (One thing to consider is
how painful will it be for people to quote table names containing
funny characters, for instance.  On the command line, we largely
depend on the shell's quoting behavior to solve that, but we'd not
have that infrastructure when reading from a file.)

> What are the benefits of supporting multiple formats?

Yeah, that part of Dean's sketch seemed like overkill to me too.

I wasn't very excited about multiple switch files either, though
depending on how the implementation is done, that could be simple
enough to be in the might-as-well category.

One other point that I'm wondering about is that there's really no
value in doing anything here until you get to some thousands of
table names; as long as the list fits in the shell's command line
length limit, you might as well just make a shell script file.
Does pg_dump really have sane performance for that situation, or
are we soon going to be fielding requests to make it not be O(N^2)
in the number of listed tables?

            regards, tom lane

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

26 November 2020, 06:42:33

st 25. 11. 2020 v 21:00 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal:

Pavel Stehule <pavel.stehule@gmail.com> writes:
> st 25. 11. 2020 v 19:25 odesílatel Dean Rasheed <dean.a.rasheed@gmail.com>
> napsal:
>> I agree that being able to configure pg_dump via a config file would
>> be very useful, but the syntax proposed here feels much more like a
>> hacked-up syntax designed to meet this one use case, rather than a
>> good general-purpose design that can be easily extended.

> But I don't understand why? What is a use case? What is a benefit against
> command line, or libpq variables? And why should config files be better as
> a solution for limited length of command line, when I need to dump
> thousands of tables exactly specified?

Because next week somebody will want to dump thousands of functions
selected by name, or schemas selected by name, etc etc. I agree with
the position that we don't want a single-purpose solution. The idea
that the syntax should match the command line switch syntax seems
reasonable, though I'm not wedded to it. (One thing to consider is
how painful will it be for people to quote table names containing
funny characters, for instance. On the command line, we largely
depend on the shell's quoting behavior to solve that, but we'd not
have that infrastructure when reading from a file.)

This is not a problem with the current patch - and the last version of this patch supports well obscure names.

There was a requirement for supporting all and future pg_dump options - ok it can make sense. I have not a problem to use instead a line format

"option argument" or "long-option=argument"

This format - can it be a solution? I'll try to rewrite the parser for this format.

It is implementable, but this is in collision with Stephen's requirement for human well readable format designed for handy writing. There are requests that have no intersection. Well readable format needs a more complex parser. And machine generating in this format needs more fork - generating flat file is more simple and more robust than generating JSON or YAML.

> What are the benefits of supporting multiple formats?

Yeah, that part of Dean's sketch seemed like overkill to me too.

I wasn't very excited about multiple switch files either, though
depending on how the implementation is done, that could be simple
enough to be in the might-as-well category.

One other point that I'm wondering about is that there's really no
value in doing anything here until you get to some thousands of
table names; as long as the list fits in the shell's command line
length limit, you might as well just make a shell script file.
Does pg_dump really have sane performance for that situation, or
are we soon going to be fielding requests to make it not be O(N^2)
in the number of listed tables?

Performance is another factor, but the command line limit can be easily touched when table names have maximum width.

regards, tom lane

Re: proposal: possibility to read dumped table's name from file

From

Dean Rasheed

Date:

26 November 2020, 11:50:34

On Thu, 26 Nov 2020 at 06:43, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
> st 25. 11. 2020 v 21:00 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal:
>>
>> (One thing to consider is
>> how painful will it be for people to quote table names containing
>> funny characters, for instance.  On the command line, we largely
>> depend on the shell's quoting behavior to solve that, but we'd not
>> have that infrastructure when reading from a file.)
>
> This is not a problem with the current patch - and the last version of this patch supports well obscure names.
>

Actually, that raises a different possible benefit of passing options
in an options file -- if the user wants to pass in a table name
pattern, it can be a nuisance if the shell's argument processing does
additional unwanted things like globbing and environment variable
substitutions. Using an options file could provide a handy way to
ensure that any option values are interpreted exactly as written,
without any additional mangling.

> There was a requirement for supporting all and future pg_dump options - ok it can make sense. I have not a problem to
useinstead a line format 
>
> "option argument" or "long-option=argument"
>
> This format - can it be a solution? I'll try to rewrite the parser for this format.
>

Yes, that's the sort of thing I was thinking of, to make the feature
more general-purpose.

> It is implementable, but this is in collision with Stephen's requirement for human well readable format designed for
handywriting. There are requests that have no intersection. Well readable format needs a more complex parser. And
machinegenerating in this format needs more fork - generating flat file is more simple and more robust than generating
JSONor YAML. 
>

To be clear, I wasn't suggesting that this patch implement multiple
formats. Just the "plain" format above. Other formats like YAML might
well be more human-readable, and be able to take advantage of values
that are lists to avoid repeating option names, and they would have
the advantage of being readable and writable by other standard tools,
which might be useful. But I think such things would be for the
future. Maybe no one will ever add support for other formats, or maybe
someone will just write a separate external tool to convert YAML or
JSON to our plain format. I don't know. But supporting all pg_dump
options makes more things possible.

>> I wasn't very excited about multiple switch files either, though
>> depending on how the implementation is done, that could be simple
>> enough to be in the might-as-well category.
>>

That's what I was hoping.

Regards,
Dean

Re: proposal: possibility to read dumped table's name from file

From

Tom Lane

Date:

26 November 2020, 16:02:18

Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> Actually, that raises a different possible benefit of passing options
> in an options file -- if the user wants to pass in a table name
> pattern, it can be a nuisance if the shell's argument processing does
> additional unwanted things like globbing and environment variable
> substitutions. Using an options file could provide a handy way to
> ensure that any option values are interpreted exactly as written,
> without any additional mangling.

Huh?  Any format we might devise, or borrow, will have to have some
kind of escaping/quoting convention.  The idea that "we don't need
that" tends to lead to very ugly workarounds later.

I do agree that the shell's quoting conventions are pretty messy
and so those aren't the ones we should borrow.  We could do a lot
worse than to use some established data format like JSON or YAML.
Given that we already have src/common/jsonapi.c, it seems like
JSON would be the better choice of those two.

            regards, tom lane

Re: proposal: possibility to read dumped table's name from file

From

Stephen Frost

Date:

27 November 2020, 16:56:31

Greetings,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Dean Rasheed <dean.a.rasheed@gmail.com> writes:
> > Actually, that raises a different possible benefit of passing options
> > in an options file -- if the user wants to pass in a table name
> > pattern, it can be a nuisance if the shell's argument processing does
> > additional unwanted things like globbing and environment variable
> > substitutions. Using an options file could provide a handy way to
> > ensure that any option values are interpreted exactly as written,
> > without any additional mangling.
>
> Huh?  Any format we might devise, or borrow, will have to have some
> kind of escaping/quoting convention.  The idea that "we don't need
> that" tends to lead to very ugly workarounds later.

Agreed.

> I do agree that the shell's quoting conventions are pretty messy
> and so those aren't the ones we should borrow.  We could do a lot
> worse than to use some established data format like JSON or YAML.
> Given that we already have src/common/jsonapi.c, it seems like
> JSON would be the better choice of those two.

JSON doesn't support comments, something that's really useful to have in
configuration files, so I don't agree that it's a sensible thing to use
in this case.  JSON also isn't very forgiving, which is also
unfortunate and makes for a poor choice.

This is why I was suggesting TOML up-thread, which is MIT licensed, has
been around for a number of years, supports comments, has sensible
quoting that's easier to deal with than the shell, and has a C (C99)
implementation.  It's also used in quite a few other projects.

In a quick look, I suspect it might also be something that could be used
to replace our existing hand-hacked postgresql.conf parser and give us
the ability to handle things a bit cleaner there too...

Thanks,

Stephen

Attachment

signature.asc

Re: proposal: possibility to read dumped table's name from file

From

Stephen Frost

Date:

27 November 2020, 18:45:09

Greetings,

* Pavel Stehule (pavel.stehule@gmail.com) wrote:
> > I agree that being able to configure pg_dump via a config file would
> > be very useful, but the syntax proposed here feels much more like a
> > hacked-up syntax designed to meet this one use case, rather than a
> > good general-purpose design that can be easily extended.
>
> Nobody sent a real use case for introducing the config file. There was a
> discussion about formats, and you introduce other dimensions and
> variability.

I'm a bit baffled by this because it seems abundently clear to me that
being able to have a config file for pg_dump would be extremely helpful.
There's no shortage of times that I've had to hack up a shell script and
figure out quoting and set up the right set of options for pg_dump,
resulting in things like:

pg_dump \
  --host=myserver.com \
  --username=postgres \
  --schema=public \
  --schema=myschema \
  --no-comments \
  --no-tablespaces \
  --file=somedir \
  --format=d \
  --jobs=5

which really is pretty grotty.  Being able to have a config file that
has proper comments would be much better and we could start to extend to
things like "please export schema A to directory A, schema B to
directory B" and other ways of selecting source and destination, and
imagine if we could validate it too, eg:

pg_dump --config=whatever --dry-run

or --check-config maybe.

This isn't a new concept either- export and import tools for other
databases have similar support, eg: Oracle's imp/exp tool, mysqldump
(see: https://dev.mysql.com/doc/refman/8.0/en/option-files.html which
has a TOML-looking format too), pgloader of course has a config file,
etc.  We certainly aren't in novel territory here.

Thanks,

Stephen

Attachment

signature.asc

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

28 November 2020, 20:14:35

st 25. 11. 2020 v 21:00 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal:

Pavel Stehule <pavel.stehule@gmail.com> writes:
> st 25. 11. 2020 v 19:25 odesílatel Dean Rasheed <dean.a.rasheed@gmail.com>
> napsal:
>> I agree that being able to configure pg_dump via a config file would
>> be very useful, but the syntax proposed here feels much more like a
>> hacked-up syntax designed to meet this one use case, rather than a
>> good general-purpose design that can be easily extended.

> But I don't understand why? What is a use case? What is a benefit against
> command line, or libpq variables? And why should config files be better as
> a solution for limited length of command line, when I need to dump
> thousands of tables exactly specified?

Because next week somebody will want to dump thousands of functions
selected by name, or schemas selected by name, etc etc. I agree with
the position that we don't want a single-purpose solution. The idea
that the syntax should match the command line switch syntax seems
reasonable, though I'm not wedded to it. (One thing to consider is
how painful will it be for people to quote table names containing
funny characters, for instance. On the command line, we largely
depend on the shell's quoting behavior to solve that, but we'd not
have that infrastructure when reading from a file.)

> What are the benefits of supporting multiple formats?

Yeah, that part of Dean's sketch seemed like overkill to me too.

I wasn't very excited about multiple switch files either, though
depending on how the implementation is done, that could be simple
enough to be in the might-as-well category.

One other point that I'm wondering about is that there's really no
value in doing anything here until you get to some thousands of
table names; as long as the list fits in the shell's command line
length limit, you might as well just make a shell script file.
Does pg_dump really have sane performance for that situation, or
are we soon going to be fielding requests to make it not be O(N^2)
in the number of listed tables?

Here is a fresh implementation. I used the name of a new option - "options-file". Looks more accurate than "config", but the name can be changed easily anytime.

Any short or long option can be read from this file in simple format - one option per line. Arguments inside double quotes can be multi lined. Row comments started by # and can be used everywhere.

The implementation is very generic - support of new options doesn't require change of this new part code. The parser can ignore white spaces almost everywhere, where it has sense.

The option should start with "-" or "--" in the options file too, because this is necessary for good detection if the option is short or long.

Regards

Pavel

regards, tom lane

Attachment

pg_dump-options-file.patch

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

28 November 2020, 21:14:38

pá 27. 11. 2020 v 19:45 odesílatel Stephen Frost <sfrost@snowman.net> napsal:

Greetings,

* Pavel Stehule (pavel.stehule@gmail.com) wrote:
> > I agree that being able to configure pg_dump via a config file would
> > be very useful, but the syntax proposed here feels much more like a
> > hacked-up syntax designed to meet this one use case, rather than a
> > good general-purpose design that can be easily extended.
>
> Nobody sent a real use case for introducing the config file. There was a
> discussion about formats, and you introduce other dimensions and
> variability.

I'm a bit baffled by this because it seems abundently clear to me that
being able to have a config file for pg_dump would be extremely helpful.
There's no shortage of times that I've had to hack up a shell script and
figure out quoting and set up the right set of options for pg_dump,
resulting in things like:

pg_dump \
--host=myserver.com \
--username=postgres \
--schema=public \
--schema=myschema \
--no-comments \
--no-tablespaces \
--file=somedir \
--format=d \
--jobs=5

which really is pretty grotty. Being able to have a config file that
has proper comments would be much better and we could start to extend to
things like "please export schema A to directory A, schema B to
directory B" and other ways of selecting source and destination, and
imagine if we could validate it too, eg:

pg_dump --config=whatever --dry-run

or --check-config maybe.

This isn't a new concept either- export and import tools for other
databases have similar support, eg: Oracle's imp/exp tool, mysqldump
(see: https://dev.mysql.com/doc/refman/8.0/en/option-files.html which
has a TOML-looking format too), pgloader of course has a config file,
etc. We certainly aren't in novel territory here

Still, I am not a fan of this. pg_dump is a simple tool for simple purposes. It is not a pgloader or any ETL tool. It can be changed in future, maybe, but still, why? And any time, there will be a question if pg_dump is a good foundation for massive enhancement in ETL direction. The development in C is expensive and pg_dump is too Postgres specific, so I cannot imagine so pg_dump will be used for some complex tasks directly, and there will be requirements for special configuration. When we have a pgloader, then we don't need to move pg_dump in the pgloader direction.

Anyway - new patch allows to store any options (one per line) with possible comments (everywhere in line) and argument's can be across more lines. It hasn't any more requirements on memory or CPU.

Regards

Pavel

Thanks,

Stephen

Re: proposal: possibility to read dumped table's name from file

From

Justin Pryzby

Date:

28 November 2020, 23:49:45

On Sat, Nov 28, 2020 at 09:14:35PM +0100, Pavel Stehule wrote:
> Any short or long option can be read from this file in simple format - one
> option per line. Arguments inside double quotes can be multi lined. Row
> comments started by # and can be used everywhere.

Does this support even funkier table names ?

This tests a large number and fraction of characters in dbname/username, so all
of pg_dump has to continue supporting that:
./src/bin/pg_dump/t/010_dump_connstr.pl

I tested and it seems to work with -t "fooå"
But it didn't work with -t "foo\nbar" (literal newline).  Fix attached.
If you send another patch, please consider including a test case for quoted
names in long and short options.

> +static char *optsfilename = NULL;

> + * It assign the values of options to related DumpOption fields or to
> + * some global values. It is called from twice. First, for processing
> + * the command line argumens. Second, for processing an options from
> + * options file.

This didn't support multiple config files, nor config files which include
config files, as Dean and I mentioned.  I think the argument parsers should
themselves call the config file parser, as need be, so the last option
specification should override previous ones.

For example pg_dump --config-file=./pg_dump.conf --blobs should have blobs even
if the config file says --no-blobs.  (Command-line arguments normally take
precedence over config files, certainly if the argument is specified "later").
I think it'd be ok if it's recursive.  I made a quick hack to do that.

I doubt this will satisfy Stephen.  Personally, I would use this if it were a
plain and simple text config file (which for our purposes I would pass on
stdin), and I would almost certainly not use it if it were json.  But it'd be
swell if there were a standard config file format, that handled postgresql.conf
and maybe pg_hba.conf.

-- 
Justin

Attachment

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

29 November 2020, 14:09:48

ne 29. 11. 2020 v 0:49 odesílatel Justin Pryzby <pryzby@telsasoft.com> napsal:

On Sat, Nov 28, 2020 at 09:14:35PM +0100, Pavel Stehule wrote:
> Any short or long option can be read from this file in simple format - one
> option per line. Arguments inside double quotes can be multi lined. Row
> comments started by # and can be used everywhere.

here is updated patch

Does this support even funkier table names ?

This tests a large number and fraction of characters in dbname/username, so all
of pg_dump has to continue supporting that:
./src/bin/pg_dump/t/010_dump_connstr.pl

I tested and it seems to work with -t "fooå"
But it didn't work with -t "foo\nbar" (literal newline). Fix attached.
If you send another patch, please consider including a test case for quoted
names in long and short options.

I implemented some basic backslash escaping. I will write more tests, when there will be good agreement on the main concept.

> +static char *optsfilename = NULL;

> + * It assign the values of options to related DumpOption fields or to
> + * some global values. It is called from twice. First, for processing
> + * the command line argumens. Second, for processing an options from
> + * options file.

This didn't support multiple config files, nor config files which include
config files, as Dean and I mentioned. I think the argument parsers should
themselves call the config file parser, as need be, so the last option
specification should override previous ones.

For example pg_dump --config-file=./pg_dump.conf --blobs should have blobs even
if the config file says --no-blobs. (Command-line arguments normally take
precedence over config files, certainly if the argument is specified "later").
I think it'd be ok if it's recursive. I made a quick hack to do that.

I did it. I used a different design than you. Making "dopt" be a global variable looks too invasive. Almost all functions there expect "dopt" as an argument. But I think it is not necessary.

I implemented two iterations of argument's processing. 1. for options file (more options-file options are allowed, and nesting is allowed too), 2. all other arguments from the command line. Any options file is processed only once - second processing is ignored. So there is no problem with cycles.

The name of the new option - "config-file" or "options-file" ? I prefer "options-file". "config-file" is valid too, but "options-file" is more specific, more descriptive (it is self descriptive).

I merged your patch with a fix of typos.

Regards

Pavel

I doubt this will satisfy Stephen. Personally, I would use this if it were a
plain and simple text config file (which for our purposes I would pass on
stdin), and I would almost certainly not use it if it were json. But it'd be
swell if there were a standard config file format, that handled postgresql.conf
and maybe pg_hba.conf.

--
Justin

Attachment

pg_dump-options-file-2.patch

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

16 February 2021, 19:32:19

rebase

Regards

Pavel

Attachment

pg_dump-options-file-3.patch

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

11 April 2021, 07:48:35

út 16. 2. 2021 v 20:32 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal:

Hi

rebase

fresh rebase

Regards

Pavel

Regards

Pavel

Attachment

pg_dump-options-file-4.patch

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

12 May 2021, 06:22:26

ne 11. 4. 2021 v 9:48 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal:

Hi

út 16. 2. 2021 v 20:32 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal:

Hi

rebase

rebase

fresh rebase

Regards

Pavel

Regards

Pavel

Attachment

pg_dump-options-file-5.patch

Re: proposal: possibility to read dumped table's name from file

From

Tomas Vondra

Date:

10 July 2021, 15:47:12

Hi,

I started looking at the patch allowing to export just functions [1], 
and I got pointed to this patch as an alternative approach (to adding a 
separate filtering option for every possible object type).

I'm familiar with the customer that inspired Pavel to start working on 
this, so I understand the use case he's trying to address - a flexible 
way to filter (include/exclude) large number of objects.

IMHO it's a mistake to try to broaden the scope of the patch and require 
implementing some universal pg_dump config file, particularly if it 
requires "complex" structure or formats like JSON, TOML or whatever. 
Maybe that's worth doing, but in my mind it's orthogonal to what this 
patch aims (or aimed) to do - filtering objects using rules in a file, 
not on the command line.

I believe it's much closer to .gitignore or rsync --filter than to a 
full config file. Even if we end up implementing the pg_dump config 
file, it'd be nice to keep the filter rules in a separate file and just 
reference that file from the config file.

That also means I find it pointless to use an "advanced" format like 
JSON or TOML - I think the format should be as simple as possible. Yes, 
it has to support all valid identifiers, comments and so on. But I don't 
quite see a point in using JSON or similar "full" format. If a simple 
format is good enough for rsync or gitignore, why should we insist on 
using something more complex?

OTOH I don't quite like the current approach of simply reading options 
from a file, because that requires adding new command-line options for 
each type of object we want to support. Which seems to contradict the 
idea of "general filter" method as mentioned in [1].

So if it was up to me, I'd go back to the original format or something 
close it. So something like this:

[+-] OBJECT_TYPE_PATTERN OBJECT_NAME_PATTERN


regards


[1] https://commitfest.postgresql.org/33/3051/

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: proposal: possibility to read dumped table's name from file

From

Daniel Gustafsson

Date:

12 July 2021, 22:08:22

> On 10 Jul 2021, at 17:47, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

> So if it was up to me, I'd go back to the original format or something close it. So something like this:
>
> [+-] OBJECT_TYPE_PATTERN OBJECT_NAME_PATTERN

That still leaves the parsing with quoting and escaping that needs to be done
less trivial and more bespoke than what meets the eye, no?

As mentioned upthread, I'm still hesitant to add a file format which deosn't
have any version information of sorts for distinguishing it from when the
inevitable "now wouldn't it be nice if we could do this too" patch which we all
know will come.  The amount of selectivity switches we have for pg_dump is an
indication about just how much control users like, this will no doubt be
subject to the same.

--
Daniel Gustafsson        https://vmware.com/

Re: proposal: possibility to read dumped table's name from file

From

Tomas Vondra

Date:

12 July 2021, 22:47:05

On 7/13/21 12:08 AM, Daniel Gustafsson wrote:
>> On 10 Jul 2021, at 17:47, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:
> 
>> So if it was up to me, I'd go back to the original format or something close it. So something like this:
>>
>> [+-] OBJECT_TYPE_PATTERN OBJECT_NAME_PATTERN
> 
> That still leaves the parsing with quoting and escaping that needs to be done
> less trivial and more bespoke than what meets the eye, no?
> 

Yes, it'd require proper escaping/quoting of the fields/identifiers etc.

> As mentioned upthread, I'm still hesitant to add a file format which deosn't
> have any version information of sorts for distinguishing it from when the
> inevitable "now wouldn't it be nice if we could do this too" patch which we all
> know will come.  The amount of selectivity switches we have for pg_dump is an
> indication about just how much control users like, this will no doubt be
> subject to the same.
> 

I'm not going to fight against some sort of versioning, but I think 
keeping the scope as narrow as possible would make it unnecessary. That 
is, let's stick to the original goal to allow passing filtering rules 
that would not fit on the command-line, and maybe let's make it a bit 
more flexible to support other object types etc.

IMHO the filtering rules are simple enough to not really need elaborate 
versioning, and if a more advanced rule is proposed in the future it can 
be supported in the existing format (extra field, ...).

Of course, maybe my imagination is not wild enough.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: proposal: possibility to read dumped table's name from file

From

Alvaro Herrera

Date:

12 July 2021, 22:59:47

On 2021-Jul-13, Tomas Vondra wrote:

> I'm not going to fight against some sort of versioning, but I think keeping
> the scope as narrow as possible would make it unnecessary. That is, let's
> stick to the original goal to allow passing filtering rules that would not
> fit on the command-line, and maybe let's make it a bit more flexible to
> support other object types etc.
> 
> IMHO the filtering rules are simple enough to not really need elaborate
> versioning, and if a more advanced rule is proposed in the future it can be
> supported in the existing format (extra field, ...).

I don't understand why is versioning needed for this file.  Surely we
can just define some line-based grammar that's accepted by the current
pg_dump[1] and that would satisfy the current need as well as allowing
for extending the grammar in the future; even JSON or Windows-INI format
(ugh?) if that's necessary to tailor the output file in some other way
not covered by that.

[1] your proposal of "[+-] OBJTYPE OBJIDENT" plus empty lines allowed
    plus lines starting with # are comments, seems plenty.  Any line not
    following that format would cause an error to be thrown.

-- 
Álvaro Herrera         PostgreSQL Developer  —  https://www.EnterpriseDB.com/

Re: proposal: possibility to read dumped table's name from file

From

Tom Lane

Date:

12 July 2021, 23:16:35

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> [1] your proposal of "[+-] OBJTYPE OBJIDENT" plus empty lines allowed
>     plus lines starting with # are comments, seems plenty.  Any line not
>     following that format would cause an error to be thrown.

I'd like to see some kind of keyword on each line, so that we could extend
the command set by adding new keywords.  As this stands, I fear we'd end
up using random punctuation characters in place of [+-], which seems
pretty horrid from a readability standpoint.

I think that this file format should be designed with an eye to allowing
every, or at least most, pg_dump options to be written in the file rather
than on the command line.  I don't say we have to *implement* that right
now; but if the format spec is incapable of being extended to meet
requests like that one, I think we'll regret it.  This line of thought
suggests that the initial commands ought to match the existing
include/exclude switches, at least approximately.

Hence I suggest

    include table PATTERN
    exclude table PATTERN

which ends up being the above but with words not [+-].

            regards, tom lane

Re: proposal: possibility to read dumped table's name from file

From

Stephen Frost

Date:

13 July 2021, 13:40:13

Greetings,

* Tom Lane (tgl@sss.pgh.pa.us) wrote:
> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> > [1] your proposal of "[+-] OBJTYPE OBJIDENT" plus empty lines allowed
> >     plus lines starting with # are comments, seems plenty.  Any line not
> >     following that format would cause an error to be thrown.
>
> I'd like to see some kind of keyword on each line, so that we could extend
> the command set by adding new keywords.  As this stands, I fear we'd end
> up using random punctuation characters in place of [+-], which seems
> pretty horrid from a readability standpoint.

I agree that it'd end up being bad with single characters.

> I think that this file format should be designed with an eye to allowing
> every, or at least most, pg_dump options to be written in the file rather
> than on the command line.  I don't say we have to *implement* that right
> now; but if the format spec is incapable of being extended to meet
> requests like that one, I think we'll regret it.  This line of thought
> suggests that the initial commands ought to match the existing
> include/exclude switches, at least approximately.

I agree that we want to have an actual config file that allows just
about every pg_dump option.  I'm also fine with saying that we don't
have to implement that initially but the format should be one which can
be extended to allow that.

> Hence I suggest
>
>     include table PATTERN
>     exclude table PATTERN
>
> which ends up being the above but with words not [+-].

Which ends up inventing yet-another-file-format which people will end up
writing generators and parsers for.  Which is exactly what I was arguing
we really should be trying to avoid doing.

I definitely feel that we should have a way to allow anything that can
be created as an object in the database to be explicitly included in the
file and that means whatever we do need to be able to handle objects
that have names that span multiple lines, etc.  It's not clear how the
above would.  As I recall, the proposed patch didn't have anything for
handling that, which was one of the issues I had with it and is why I
bring it up again.

Thanks,

Stephen

Attachment

signature.asc

Re: proposal: possibility to read dumped table's name from file

From

Tomas Vondra

Date:

13 July 2021, 16:14:03

On 7/13/21 3:40 PM, Stephen Frost wrote:
> Greetings,
> 
> * Tom Lane (tgl@sss.pgh.pa.us) wrote:
>> Alvaro Herrera <alvherre@2ndquadrant.com> writes:
>>> [1] your proposal of "[+-] OBJTYPE OBJIDENT" plus empty lines allowed
>>>      plus lines starting with # are comments, seems plenty.  Any line not
>>>      following that format would cause an error to be thrown.
>>
>> I'd like to see some kind of keyword on each line, so that we could extend
>> the command set by adding new keywords.  As this stands, I fear we'd end
>> up using random punctuation characters in place of [+-], which seems
>> pretty horrid from a readability standpoint.
> 
> I agree that it'd end up being bad with single characters.
> 

The [+-] format is based on what rsync does, so there's at least some 
precedent for that, and IMHO it's fairly readable. I agree the rest of 
the rule (object type, ...) may be a bit more verbose.

>> I think that this file format should be designed with an eye to allowing
>> every, or at least most, pg_dump options to be written in the file rather
>> than on the command line.  I don't say we have to *implement* that right
>> now; but if the format spec is incapable of being extended to meet
>> requests like that one, I think we'll regret it.  This line of thought
>> suggests that the initial commands ought to match the existing
>> include/exclude switches, at least approximately.
> 
> I agree that we want to have an actual config file that allows just
> about every pg_dump option.  I'm also fine with saying that we don't
> have to implement that initially but the format should be one which can
> be extended to allow that.
> 

I understand the desire to have a config file that may contain all 
pg_dump options, but I really don't see why we'd want to mix that with 
the file containing filter rules.

I think those should be separate, one of the reasons being that I find 
it desirable to be able to "include" the filter rules into different 
pg_dump configs. That also means the format for the filter rules can be 
much simpler.

It's also not clear to me whether the single-file approach would allow 
filtering not supported by actual pg_dump option, for example.

>> Hence I suggest
>>
>>     include table PATTERN
>>     exclude table PATTERN
>>
>> which ends up being the above but with words not [+-].
> 
Work for me.

> Which ends up inventing yet-another-file-format which people will end up
> writing generators and parsers for.  Which is exactly what I was arguing
> we really should be trying to avoid doing.
> 

People will have to write generators *in any case* because how else 
would you use this? Unless we also provide tools to manipulate that file 
(which seems rather futile), they'll have to do that. Even if we used 
JSON/YAML/TOML/... they'd still need to deal with the semantics of the 
file format.

FWIW I don't understand why would they need to write parsers. That's 
something we'd need to do to process the file. I think the case when the 
filter file needs to be modified is rather rare - it certainly is not 
what the original use case Pavel tried to address needs. (I know that 
customer and the filter would be generated and used for a single dump.)

My opinion is that the best solution (to make both generators and 
parsers simple) is to keep the format itself as simple as possible. 
Which is exactly why I'm arguing for only addressing the filtering, not 
trying to invent a "universal" pg_dump config file format.

> I definitely feel that we should have a way to allow anything that can
> be created as an object in the database to be explicitly included in the
> file and that means whatever we do need to be able to handle objects
> that have names that span multiple lines, etc.  It's not clear how the
> above would.  As I recall, the proposed patch didn't have anything for
> handling that, which was one of the issues I had with it and is why I
> bring it up again.
> 

I really don't understand why you think the current format can't do 
escaping/quoting or handle names spanning multiple lines. The fact that 
the original patch did not handle that correctly is a bug, but it does 
not mean the format can't handle that.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: proposal: possibility to read dumped table's name from file

From

Daniel Gustafsson

Date:

13 July 2021, 20:44:38

> On 13 Jul 2021, at 18:14, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

> FWIW I don't understand why would they need to write parsers.

It's quite common to write unit tests for VM recipes/playbooks wheen using
tools like Chef etc, parsing and checking the installed/generated files is part
of that. This would be one very real use case for writing a parser.

> I think the case when the filter file needs to be modified is rather rare - it certainly is not what the original use
casePavel tried to address needs. (I know that customer and the filter would be generated and used for a single dump.) 

I'm not convinced that basing design decisions on a single customer reference
who only want to use the code once is helpful.  I hear what you're saying, but
I think this will see more diverse use cases than what we can foresee here.

--
Daniel Gustafsson        https://vmware.com/

Re: proposal: possibility to read dumped table's name from file

From

Daniel Gustafsson

Date:

13 July 2021, 20:55:35

> On 13 Jul 2021, at 00:59, Alvaro Herrera <alvherre@2ndquadrant.com> wrote:
>
> On 2021-Jul-13, Tomas Vondra wrote:
>
>> I'm not going to fight against some sort of versioning, but I think keeping
>> the scope as narrow as possible would make it unnecessary. That is, let's
>> stick to the original goal to allow passing filtering rules that would not
>> fit on the command-line, and maybe let's make it a bit more flexible to
>> support other object types etc.
>>
>> IMHO the filtering rules are simple enough to not really need elaborate
>> versioning, and if a more advanced rule is proposed in the future it can be
>> supported in the existing format (extra field, ...).
>
> I don't understand why is versioning needed for this file.  Surely we
> can just define some line-based grammar that's accepted by the current
> pg_dump[1] and that would satisfy the current need as well as allowing
> for extending the grammar in the future; even JSON or Windows-INI format
> (ugh?) if that's necessary to tailor the output file in some other way
> not covered by that.

I wasn't expressing myself very well; by "versioning" I mean a way to be able
to add to/change/fix the format and still be able to deterministically parse it
without having to resort to ugly heuristics and hacks.  If that's achieved by
an explicit version number or if it's an inherit characteristic of the format
doesn't really matter (to me).  My worry is that the very simple proposed
format might not fit that bill, but since I don't know what the future of the
feature might bring it's (mostly) a gut feeling.

--
Daniel Gustafsson        https://vmware.com/

Re: proposal: possibility to read dumped table's name from file

From

Stephen Frost

Date:

13 July 2021, 20:55:56

Greetings,

On Tue, Jul 13, 2021 at 16:44 Daniel Gustafsson <daniel@yesql.se> wrote:

> On 13 Jul 2021, at 18:14, Tomas Vondra <tomas.vondra@enterprisedb.com> wrote:

> FWIW I don't understand why would they need to write parsers.

It's quite common to write unit tests for VM recipes/playbooks wheen using
tools like Chef etc, parsing and checking the installed/generated files is part
of that. This would be one very real use case for writing a parser.

Consider pgAdmin and the many other tools which essentially embed pg_dump and pg_restore. There’s no shortage of use cases for a variety of tools to be able to understand, read, parse, generate, rewrite, and probably do more, with such a pg_dump/restore config file.

> I think the case when the filter file needs to be modified is rather rare - it certainly is not what the original use case Pavel tried to address needs. (I know that customer and the filter would be generated and used for a single dump.)

I'm not convinced that basing design decisions on a single customer reference
who only want to use the code once is helpful.

Agreed.

Thanks,

Stephen

Re: proposal: possibility to read dumped table's name from file

From

Tomas Vondra

Date:

13 July 2021, 21:57:10

On 7/13/21 10:55 PM, Stephen Frost wrote:
> Greetings,
> 
> On Tue, Jul 13, 2021 at 16:44 Daniel Gustafsson <daniel@yesql.se 
> <mailto:daniel@yesql.se>> wrote:
> 
>      > On 13 Jul 2021, at 18:14, Tomas Vondra
>     <tomas.vondra@enterprisedb.com
>     <mailto:tomas.vondra@enterprisedb.com>> wrote:
> 
>      > FWIW I don't understand why would they need to write parsers.
> 
>     It's quite common to write unit tests for VM recipes/playbooks wheen
>     using
>     tools like Chef etc, parsing and checking the installed/generated
>     files is part
>     of that. This would be one very real use case for writing a parser.
> 
> 
> Consider pgAdmin and the many other tools which essentially embed 
> pg_dump and pg_restore.  There’s no shortage of use cases for a variety 
> of tools to be able to understand, read, parse, generate, rewrite, and 
> probably do more, with such a pg_dump/restore config file.
> 

Sure. Which is why I'm advocating for the simplest possible format (and 
not expanding the scope of this patch beyond filtering), because that 
makes this kind of processing simpler.

>      > I think the case when the filter file needs to be modified is
>     rather rare - it certainly is not what the original use case Pavel
>     tried to address needs. (I know that customer and the filter would
>     be generated and used for a single dump.)
> 
>     I'm not convinced that basing design decisions on a single customer
>     reference
>     who only want to use the code once is helpful. 
> 
> 
> Agreed.
> 

I wasn't really basing this on a single customer - that was merely an 
example, of course. FWIW Justin Pryzby already stated having to use some 
more complex format would likely mean they would not use the feature, so 
that's another data point to consider.

FWIW I believe it's clear what my opinions on this topic are. Repeating 
that seems a bit pointless, so I'll step aside and let this thread move 
forward in whatever direction.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: proposal: possibility to read dumped table's name from file

From

Stephen Frost

Date:

13 July 2021, 22:32:17

Greetings,

* Tomas Vondra (tomas.vondra@enterprisedb.com) wrote:
> On 7/13/21 10:55 PM, Stephen Frost wrote:
> >On Tue, Jul 13, 2021 at 16:44 Daniel Gustafsson <daniel@yesql.se
> ><mailto:daniel@yesql.se>> wrote:
> >
> >     > On 13 Jul 2021, at 18:14, Tomas Vondra
> >    <tomas.vondra@enterprisedb.com
> >    <mailto:tomas.vondra@enterprisedb.com>> wrote:
> >
> >     > FWIW I don't understand why would they need to write parsers.
> >
> >    It's quite common to write unit tests for VM recipes/playbooks wheen
> >    using
> >    tools like Chef etc, parsing and checking the installed/generated
> >    files is part
> >    of that. This would be one very real use case for writing a parser.
>
> >Consider pgAdmin and the many other tools which essentially embed pg_dump
> >and pg_restore.  There’s no shortage of use cases for a variety of tools
> >to be able to understand, read, parse, generate, rewrite, and probably do
> >more, with such a pg_dump/restore config file.
>
> Sure. Which is why I'm advocating for the simplest possible format (and not
> expanding the scope of this patch beyond filtering), because that makes this
> kind of processing simpler.

The simplest possible format isn't going to work with all the different
pg_dump options and it still isn't going to be 'simple' since it needs
to work with the flexibility that we have in what we support for object
names, and is still going to require people write a new parser and
generator for it instead of using something existing.

I don't know that the options that I suggested previously would
definitely work or not but they at least would allow other projects like
pgAdmin to leverage existing code for parsing and generating these
config files.  I'm not completely against inventing something new, but
I'd really prefer that we at least try to make something existing work
first before inventing something new that everyone is going to have to
deal with.  If we do invent a new thing for $reasons, then we should
really look at what exists today and try to design it properly instead
of just throwing something together and formally document it because
it's absolutely going to become a standard of sorts that people are
going to almost immediately write their own parsers/generators in
various languages for.

Thanks,

Stephen

Attachment

signature.asc

Re: proposal: possibility to read dumped table's name from file

From

Alvaro Herrera

Date:

13 July 2021, 23:00:38

On 2021-Jul-13, Stephen Frost wrote:

> The simplest possible format isn't going to work with all the different
> pg_dump options and it still isn't going to be 'simple' since it needs
> to work with the flexibility that we have in what we support for object
> names,

That's fine.  If people want a mechanism that allows changing the other
pg_dump options that are not related to object filtering, they can
implement a configuration file for that.

> and is still going to require people write a new parser and
> generator for it instead of using something existing.

Sure.  That's not part of this patch.

> I don't know that the options that I suggested previously would
> definitely work or not but they at least would allow other projects like
> pgAdmin to leverage existing code for parsing and generating these
> config files.

Keep in mind that this patch is not intended to help pgAdmin
specifically.  It would be great if pgAdmin uses the functionality
implemented here, but if they decide not to, that's not terrible.  They
have survived decades without a pg_dump configuration file; they still
can.

There are several votes in this thread for pg_dump to gain functionality
to filter objects based on a simple specification -- particularly one
that can be written using shell pipelines.  This patch gives it.

> I'm not completely against inventing something new, but I'd really
> prefer that we at least try to make something existing work first
> before inventing something new that everyone is going to have to deal
> with.

That was discussed upthread and led nowhere.

-- 
Álvaro Herrera           39°49'30"S 73°17'W  —  https://www.EnterpriseDB.com/
"Nunca se desea ardientemente lo que solo se desea por razón" (F. Alexandre)

Re: proposal: possibility to read dumped table's name from file

From

Stephen Frost

Date:

14 July 2021, 00:18:35

Greetings,

* Alvaro Herrera (alvherre@2ndquadrant.com) wrote:
> On 2021-Jul-13, Stephen Frost wrote:
> > The simplest possible format isn't going to work with all the different
> > pg_dump options and it still isn't going to be 'simple' since it needs
> > to work with the flexibility that we have in what we support for object
> > names,
>
> That's fine.  If people want a mechanism that allows changing the other
> pg_dump options that are not related to object filtering, they can
> implement a configuration file for that.

It's been said multiple times that people *do* want that and that they
want it to all be part of this one file, and specifically that they
don't want to end up with a file structure that actively works against
allowing other options to be added to it.

> > I don't know that the options that I suggested previously would
> > definitely work or not but they at least would allow other projects like
> > pgAdmin to leverage existing code for parsing and generating these
> > config files.
>
> Keep in mind that this patch is not intended to help pgAdmin
> specifically.  It would be great if pgAdmin uses the functionality
> implemented here, but if they decide not to, that's not terrible.  They
> have survived decades without a pg_dump configuration file; they still
> can.

The adding of a config file for pg_dump should specifically be looking
at pgAdmin as the exact use-case for having such a capability.

> There are several votes in this thread for pg_dump to gain functionality
> to filter objects based on a simple specification -- particularly one
> that can be written using shell pipelines.  This patch gives it.

And several votes for having a config file that supports, or at least
can support in the future, the various options which pg_dump supports-
and active voices against having a new file format that doesn't allow
for that.

> > I'm not completely against inventing something new, but I'd really
> > prefer that we at least try to make something existing work first
> > before inventing something new that everyone is going to have to deal
> > with.
>
> That was discussed upthread and led nowhere.

You're right- no one followed up on that.  Instead, one group continues
to push for 'simple' and to just accept what's been proposed, while
another group counters that we should be looking at the broader design
question and work towards a solution which will work for us down the
road, and not just right now.

One thing remains clear- there's no consensus here.

Thanks,

Stephen

Attachment

signature.asc

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

14 July 2021, 05:00:26

You're right- no one followed up on that. Instead, one group continues
to push for 'simple' and to just accept what's been proposed, while
another group counters that we should be looking at the broader design
question and work towards a solution which will work for us down the
road, and not just right now.

One thing remains clear- there's no consensus here.

I think there should be some misunderstanding about the target of this patch, and I am afraid so there cannot be consensus, because the people are speaking about two very different features. And it is not possible to push it to one thing. It cannot work I am afraid.

1. The main target of this patch is to solve the problem with the too large command line of pg_dump when there are a lot of dumped objects. You need to call pg_dump only once to ensure dump in one transaction. And sometimes it is not possible to use wild characters effectively, because the state of objects is in different databases. Enhancing the length of the command line is not secure, and there are other production issues. In this case you need a very simple format - just because you want to use pg_dump in pipe. This format should be line oriented - and usually it will contain just "dump this table, dump second table". Nothing else. Nobody will read this format, nobody will edit this format. Because the main platform for this format is probably the UNIX shell, the format should be simple. I really don't see any joy in generating JSON and parsing JSON later. These data will be processed locally. This is one purpose designed format, and it is not designed for holding configuration. For this purpose the complex format has not any advantage. There is not a problem with parsing JSON or other formats on the pg_dump side, but it is pretty hard to generate valid JSON from bash script. For a unix shell we need the most possible simple format. Theoretically this format (this file) can hold any pg_dump's option, but for usual streaming processing the only filter's options will be there. Originally this feature had the name "filter file". There are a lot of examples of successful filter's file formats in the UNIX world, and I think so nobody doubts about sense and usability. Probably there is a consensus so filter's files are not config files.

The format of the filter file can look like "+d tablename" or "include data tablename". If we find a consensus so the filter file is a good thing, then the format design and implementation is easy work. Isn't problem to invent comment lines.

2. Is true, so there is only a small step from filter's file to option's file. I rewrote this patch in this direction. The advantage is universality - it can support any options without necessity to modify related code. Still this format is not difficult for producers, and it is simple for parsing. Now, the format should be defined by command line format: "-t tablename" or "--table tablename" or "table tablename". There can be issues related to different parsers in shell and in implemented code, but it can be solved. Isn't problem to introduce comment lines. The big advantage is simplicity of usage, simplicity of implementation - more the implementation is generic.

3. But the option's file is just a small step to config file. I can imagine somebody wanting to store typical configuration (and usual options) for psql, pg_dump, pg_restore, pgAdmin, ... somewhere. The config files are very different creatures than filter's files. Although they can be generated, usually are edited and can be very complex. There can be shared parts for all applications, and specific sections for psql, and specific sections for every database. The config files can be brutally complex. The simple text format is not good for this purpose. And some people prefer YAML, some people hate this format. Other people prefer XML or JSON or anything else. Sometimes the complexity of config files is too big, and people prefer startup scripting.

Although there is an intersection between filter's files and config files, I see very big differences in usage. Filter's files are usually temporal and generated and non shared. Config file's are persistent, usually manually modified and can be shared. The requests are different, and should be different too. I don't propose any configuration's file related features, and my proposal doesn't block the introduction of configuration's file in any format in future. I think these features are very different, and should be implemented differently. The filter's file or option's file will be a pretty ugly config file, and config's file will be a pretty impractical filter's file.

So can we talk about implementation of filter's file or option's file? And can we talk about implementation config's files in separate topics? Without it, I am afraid so there is no possibility of finding an agreement and moving forward.

Regards

Pavel

Thanks,

Stephen

Re: proposal: possibility to read dumped table's name from file

From

Tomas Vondra

Date:

14 July 2021, 10:08:01

On 7/14/21 2:18 AM, Stephen Frost wrote:
> Greetings,
> 
> * Alvaro Herrera (alvherre@2ndquadrant.com) wrote:
>> On 2021-Jul-13, Stephen Frost wrote:
>>> The simplest possible format isn't going to work with all the different
>>> pg_dump options and it still isn't going to be 'simple' since it needs
>>> to work with the flexibility that we have in what we support for object
>>> names,
>>
>> That's fine.  If people want a mechanism that allows changing the other
>> pg_dump options that are not related to object filtering, they can
>> implement a configuration file for that.
> 
> It's been said multiple times that people *do* want that and that they
> want it to all be part of this one file, and specifically that they
> don't want to end up with a file structure that actively works against
> allowing other options to be added to it.
> 

I have no problem believing some people want to be able to specify 
pg_dump parameters in a file, similarly to IMPDP/EXPDP parameter files 
etc. That seems useful, but I doubt they considered the case with many 
filter rules ... which is what "my people" want.

Not sure how keeping the filter rules in a separate file (which I assume 
is what you mean by "file structure"), with a format tailored for filter 
rules, works *actively* against adding options to the "main" config.

I'm not buying the argument that keeping some of the stuff in a separate 
file is an issue - plenty of established tools do that, the concept of 
"including" a config is not a radical new thing, and I don't expect we'd 
have many options supported by a file.

In any case, I think user input is important, but ultimately it's up to 
us to reconcile the conflicting requirements coming from various users 
and come up with a reasonable compromise design.

>>> I don't know that the options that I suggested previously would
>>> definitely work or not but they at least would allow other projects like
>>> pgAdmin to leverage existing code for parsing and generating these
>>> config files.
>>
>> Keep in mind that this patch is not intended to help pgAdmin
>> specifically.  It would be great if pgAdmin uses the functionality
>> implemented here, but if they decide not to, that's not terrible.  They
>> have survived decades without a pg_dump configuration file; they still
>> can.
> 
> The adding of a config file for pg_dump should specifically be looking
> at pgAdmin as the exact use-case for having such a capability.
> 
>> There are several votes in this thread for pg_dump to gain functionality
>> to filter objects based on a simple specification -- particularly one
>> that can be written using shell pipelines.  This patch gives it.
> 
> And several votes for having a config file that supports, or at least
> can support in the future, the various options which pg_dump supports-
> and active voices against having a new file format that doesn't allow
> for that.
> 

IMHO the whole "problem" here stems from the question whether there 
should be a single universal pg_dump config file, containing everything 
including the filter rules. I'm of the opinion it's better to keep the 
filter rules separate, mainly because:

1) simplicity - Options (key/value) and filter rules (with more internal 
structure) seem quite different, and mixing them in the same file will 
just make the format more complex.

2) flexibility - Keeping the filter rules in a separate file makes it 
easier to reuse the same set of rules with different pg_dump configs, 
specified in (much smaller) config files.

So in principle, the "main" config could use e.g. TOML or whatever we 
find most suitable for this type of key/value config file (or we could 
just use the same format as for postgresql.conf et al). And the filter 
rules could use something as simple as CSV (yes, I know it's not great, 
but there's plenty of parsers, it handles multi-line strings etc.).

>>> I'm not completely against inventing something new, but I'd really
>>> prefer that we at least try to make something existing work first
>>> before inventing something new that everyone is going to have to deal
>>> with.
>>
>> That was discussed upthread and led nowhere.
> 
> You're right- no one followed up on that.  Instead, one group continues
> to push for 'simple' and to just accept what's been proposed, while
> another group counters that we should be looking at the broader design
> question and work towards a solution which will work for us down the
> road, and not just right now.
> 

I have quite thick skin, but I have to admit I rather dislike how this 
paints the people arguing for simplicity.

IMO simplicity is a perfectly legitimate (and desirable) design feature, 
and simpler solutions often fare better in the long run. Yes, we need to 
look at the broader design, no doubt about that.

> One thing remains clear- there's no consensus here.
> 

True.

regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

28 July 2021, 04:18:01

st 12. 5. 2021 v 8:22 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal:

Hi

ne 11. 4. 2021 v 9:48 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal:
Hi

út 16. 2. 2021 v 20:32 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal:

Hi

rebase

rebase

fresh rebase

fresh rebase

Regards

Pavel

Regards

Pavel

Attachment

pg_dump-option-file-20210728.patch

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

28 July 2021, 07:28:17

út 13. 7. 2021 v 1:16 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal:

Alvaro Herrera <alvherre@2ndquadrant.com> writes:
> [1] your proposal of "[+-] OBJTYPE OBJIDENT" plus empty lines allowed
> plus lines starting with # are comments, seems plenty. Any line not
> following that format would cause an error to be thrown.

I'd like to see some kind of keyword on each line, so that we could extend
the command set by adding new keywords. As this stands, I fear we'd end
up using random punctuation characters in place of [+-], which seems
pretty horrid from a readability standpoint.

I think that this file format should be designed with an eye to allowing
every, or at least most, pg_dump options to be written in the file rather
than on the command line. I don't say we have to *implement* that right
now; but if the format spec is incapable of being extended to meet
requests like that one, I think we'll regret it. This line of thought
suggests that the initial commands ought to match the existing
include/exclude switches, at least approximately.

Hence I suggest

include table PATTERN
exclude table PATTERN

which ends up being the above but with words not [+-].

Here is an updated implementation of filter's file, that implements syntax proposed by you.

Regards

Pavel

regards, tom lane

Attachment

pg_dump-filter-20210728.patch

Re: proposal: possibility to read dumped table's name from file

From

Daniel Gustafsson

Date:

13 September 2021, 13:01:56

> On 28 Jul 2021, at 09:28, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> út 13. 7. 2021 v 1:16 odesílatel Tom Lane <tgl@sss.pgh.pa.us <mailto:tgl@sss.pgh.pa.us>> napsal:

> Hence I suggest
>
>         include table PATTERN
>         exclude table PATTERN
>
> which ends up being the above but with words not [+-].

One issue with this syntax is that the include keyword can be quite misleading
as it's semantic interpretion of "include table t" can be different from
"--table=t".  The former is less clear about the fact that it means "exclude
all other tables than " then the latter.  It can be solved with documentation,
but I think that needs be to be made clearer.

> Here is an updated implementation of filter's file, that implements syntax proposed by you.

While it's not the format I would prefer, it does allow for most (all?) use
cases expressed in this thread with ample armtwisting applied so let's go ahead
from this point and see if we can agree on it (or a version of it).

A few notes on the patch after a first pass over it:

+(include|exclude)[table|schema|foreign_data|data] <replaceable class="parameter">objectname</replaceable>
Lacks whitespace between keyword and object type.  Also, since these are
mandatory parameters, shouldn't they be within '{..}' ?


+    /* skip initial white spaces */
+    while (isblank(*ptr))
+        ptr += 1;
We don't trust isblank() as of 3fd5faed5 due to portability concerns, this
should probably use a version of the pg_isblank() we already have (and possibly
move that to src/common/string.c as there now are more consumers).


+static bool
+isblank_line(const char *line)
This could be replaced with a single call to strspn() as we already do for
parsing the TOC file.


+    /* when first char is hash, ignore whole line */
+    if (*str == '#')
+        continue;
I think we should strip leading whitespace before this to allow commentlines to
start with whitespace, it's easy enough and will make life easier for users.


+       pg_log_error("invalid format of filter file \"%s\": %s",
+                                filename,
+                                message);
+
+       fprintf(stderr, "%d: %s\n", lineno, line);
Can't we just include the lineno in the error logging and skip dumping the
offending line?  Fast-forwarding the pointer to print the offending part is
less useful than a context marker, and in some cases suboptimal.  With this
coding, if a pattern is omitted for example the below error message is given:

    pg_dump: error: invalid format of filter file "filter.txt": missing object name
    1:

The errormessage and the linenumber in the file should be enough for the user
to figure out what to fix.


+       if (keyword && is_keyword(keyword, size, "table"))
+               objecttype = 't';
Should this use an enum, or at least a struct translation the literal keyword
to the internal representation?  Magic constants without explicit connection to
their token counterparts can easily be cause of bugs.


If I create a table called "a\nb" and try to dump it I get an error in parsing
the file.  Isn't this supposed to work?
    $ cat filter.txt
    include table "a
    b"
    $ ./bin/pg_dump --filter=filter.txt
    pg_dump: error: invalid format of filter file "filter.txt": unexpected chars after object name
    2:


Did you consider implementing this in Bison to abstract some of the messier
parsing logic?

--
Daniel Gustafsson        https://vmware.com/

Re: proposal: possibility to read dumped table's name from file

From

Justin Pryzby

Date:

13 September 2021, 13:11:52

On Wed, Jul 28, 2021 at 09:28:17AM +0200, Pavel Stehule wrote:
> Here is an updated implementation of filter's file, that implements syntax
> proposed by you.

Thanks.

If there's any traction for this approach.  I have some comments for the next
revision,

> +++ b/doc/src/sgml/ref/pg_dump.sgml
> @@ -789,6 +789,56 @@ PostgreSQL documentation
>        </listitem>
>       </varlistentry>
>  
> +     <varlistentry>
> +      <term><option>--filter=<replaceable class="parameter">filename</replaceable></option></term>
> +      <listitem>
> +       <para>
> +        Read objects filters from the specified file.
> +        If you use "-" as a filename, the filters are read from stdin.

Say 'Specify "-" to read from stdin'

> +        The lines starting with symbol <literal>#</literal> are ignored.

Remove "The" and "symbol"

> +        Previous white chars (spaces, tabs) are not allowed. These

Preceding whitespace characters...

But actually, they are allowed?  But if it needs to be explained, maybe they
shouldn't be - I don't see the utility of it.

> +static bool
> +isblank_line(const char *line)
> +{
> +    while (*line)
> +    {
> +        if (!isblank(*line++))
> +            return false;
> +    }
> +
> +    return true;
> +}

I don't think this requires nor justifies having a separate function.
Either don't support blank lines, or use get_keyword() with size==0 for that ?

> +        /* Now we expect sequence of two keywords */
> +        if (keyword && is_keyword(keyword, size, "include"))
> +            is_include = true;
> +        else if (keyword && is_keyword(keyword, size, "exclude"))
> +            is_include = false;
> +        else

I think this should first check "if keyword == NULL".
That could give a more specific error message like "no keyword found",

> +            exit_invalid_filter_format(fp,
> +                                       filename,
> +                                       "expected keyword \"include\" or \"exclude\"",
> +                                       line.data,
> +                                       lineno);

..and then this one can say "invalid keyword".

-- 
Justin

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

15 September 2021, 17:31:13

po 13. 9. 2021 v 15:01 odesílatel Daniel Gustafsson <daniel@yesql.se> napsal:

> On 28 Jul 2021, at 09:28, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> út 13. 7. 2021 v 1:16 odesílatel Tom Lane <tgl@sss.pgh.pa.us <mailto:tgl@sss.pgh.pa.us>> napsal:

> Hence I suggest
>
> include table PATTERN
> exclude table PATTERN
>
> which ends up being the above but with words not [+-].

One issue with this syntax is that the include keyword can be quite misleading
as it's semantic interpretion of "include table t" can be different from
"--table=t". The former is less clear about the fact that it means "exclude
all other tables than " then the latter. It can be solved with documentation,
but I think that needs be to be made clearer.

I invite any documentation enhancing and fixing

> Here is an updated implementation of filter's file, that implements syntax proposed by you.

While it's not the format I would prefer, it does allow for most (all?) use
cases expressed in this thread with ample armtwisting applied so let's go ahead
from this point and see if we can agree on it (or a version of it).

A few notes on the patch after a first pass over it:

+(include|exclude)[table|schema|foreign_data|data] <replaceable class="parameter">objectname</replaceable>
Lacks whitespace between keyword and object type. Also, since these are
mandatory parameters, shouldn't they be within '{..}' ?

yes, fixed

+ /* skip initial white spaces */
+ while (isblank(*ptr))
+ ptr += 1;
We don't trust isblank() as of 3fd5faed5 due to portability concerns, this
should probably use a version of the pg_isblank() we already have (and possibly
move that to src/common/string.c as there now are more consumers).

I rewrote this part, and I don't use function isblank ever

+static bool
+isblank_line(const char *line)
This could be replaced with a single call to strspn() as we already do for
parsing the TOC file.

+ /* when first char is hash, ignore whole line */
+ if (*str == '#')
+ continue;
I think we should strip leading whitespace before this to allow commentlines to
start with whitespace, it's easy enough and will make life easier for users.

now, the comments can be used as first non blank char or after filter

+ pg_log_error("invalid format of filter file \"%s\": %s",
+ filename,
+ message);
+
+ fprintf(stderr, "%d: %s\n", lineno, line);
Can't we just include the lineno in the error logging and skip dumping the
offending line? Fast-forwarding the pointer to print the offending part is
less useful than a context marker, and in some cases suboptimal. With this
coding, if a pattern is omitted for example the below error message is given:

pg_dump: error: invalid format of filter file "filter.txt": missing object name
1:

The errormessage and the linenumber in the file should be enough for the user
to figure out what to fix.

I did it like you proposed, but still, I think the content can be useful. More times you read dynamically generated files, or you read data from stdin, and in complex environments it can be hard regenerate new content for debugging.

+ if (keyword && is_keyword(keyword, size, "table"))
+ objecttype = 't';
Should this use an enum, or at least a struct translation the literal keyword
to the internal representation? Magic constants without explicit connection to
their token counterparts can easily be cause of bugs.

fixed

If I create a table called "a\nb" and try to dump it I get an error in parsing
the file. Isn't this supposed to work?
$ cat filter.txt
include table "a
b"
$ ./bin/pg_dump --filter=filter.txt
pg_dump: error: invalid format of filter file "filter.txt": unexpected chars after object name
2:

probably there was some issue, because it should work. I tested a new version and this is tested in new regress tests. Please, check

Did you consider implementing this in Bison to abstract some of the messier
parsing logic?

Initially not, but now, when I am thinking about it, I don't think so Bison helps. The syntax of the filter file is nicely linear. Now, the code of the parser is a little bit larger than minimalistic, but it is due to nicer error's messages. The raw implementation in Bison raised just "syntax error" and positions. I did code refactoring, and now the scanning, parsing and processing are divided into separated routines. Parsing related code has 90 lines. In this case, I don't think using a parser grammar file can carry any benefit. grammar is more readable, sure, but we need to include bison, we need to handle errors, and if we want to raise more helpful errors than just "syntax error", then the code will be longer.

please, check attached patch

Regards

Pavel

--
Daniel Gustafsson https://vmware.com/

Attachment

pg_dump-filteropt-20210915.patch

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

15 September 2021, 18:30:45

po 13. 9. 2021 v 15:11 odesílatel Justin Pryzby <pryzby@telsasoft.com> napsal:

On Wed, Jul 28, 2021 at 09:28:17AM +0200, Pavel Stehule wrote:
> Here is an updated implementation of filter's file, that implements syntax
> proposed by you.

Thanks.

If there's any traction for this approach. I have some comments for the next
revision,

> +++ b/doc/src/sgml/ref/pg_dump.sgml
> @@ -789,6 +789,56 @@ PostgreSQL documentation
> </listitem>
> </varlistentry>
>
> + <varlistentry>
> + <term><option>--filter=<replaceable class="parameter">filename</replaceable></option></term>
> + <listitem>
> + <para>
> + Read objects filters from the specified file.
> + If you use "-" as a filename, the filters are read from stdin.

Say 'Specify "-" to read from stdin'

> + The lines starting with symbol <literal>#</literal> are ignored.

Remove "The" and "symbol"

> + Previous white chars (spaces, tabs) are not allowed. These

Preceding whitespace characters...

But actually, they are allowed? But if it needs to be explained, maybe they
shouldn't be - I don't see the utility of it.

> +static bool
> +isblank_line(const char *line)
> +{
> + while (*line)
> + {
> + if (!isblank(*line++))
> + return false;
> + }
> +
> + return true;
> +}

I don't think this requires nor justifies having a separate function.
Either don't support blank lines, or use get_keyword() with size==0 for that ?

> + /* Now we expect sequence of two keywords */
> + if (keyword && is_keyword(keyword, size, "include"))
> + is_include = true;
> + else if (keyword && is_keyword(keyword, size, "exclude"))
> + is_include = false;
> + else

I think this should first check "if keyword == NULL".
That could give a more specific error message like "no keyword found",

> + exit_invalid_filter_format(fp,
> + filename,
> + "expected keyword \"include\" or \"exclude\"",
> + line.data,
> + lineno);

..and then this one can say "invalid keyword".

I fixed (I hope) mentioned issues. Please check last patch

Regards

Pavel

--
Justin

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

16 September 2021, 04:45:30

In yesterday's patch I used strndup, which is not available on win. I am sending update when I used pnstrdup instead.

Regards

Pavel

Attachment

pg_dump-filteropt-20210916.patch

Re: proposal: possibility to read dumped table's name from file

From

Daniel Gustafsson

Date:

17 September 2021, 11:18:18

As there have been a lot of differing opinions raised in this thread, I re-read
it and tried to summarize the discussion so far to try and figure out where we
agree and on what (and disagree) before we get deep into the technicalities wrt
the current patch.  If anyone feel I've misrepresented them below then I
sincerely do apologize.  If I missed a relevant viewpoint I also apologize,
I've tried to objectively represent the thread.

I proposed JSON in [0] which is where the format discussion to some extent
started, Justin and Pavel had up until that point discussed the format by
refining the original proposal.

In [1] Surafel Temesgen brought up --exclude-database from pg_dumpall and
--no-comments, and argued for them being handled by this patch.  This was
objected against on the grounds that pg_dumpall is out of scope, and
all-or-nothing switches not being applicable in a filter option.

Stephen objected to both the proposed, and the suggestion of JSON, in [2] and
argued for a more holistic configuration file approach.  TOML was suggested.
Dean then +1'd the config file approach in [3].

In [4] Tom supported the idea of a more generic config file, and remarked that
the proposed filter for table names only makes sense when the number of exclude
patterns are large enough that we might hit other problems in pg_dump.
Further, in [5] Tom commented that a format with established quoting
conventions would buy us not having to invent our own to cope with complicated
relation names.

The fact that JSON doesn't support comments is brought up in a few emails and
is a very valid point, as the need for comments regardless of format is brought
up as well.

Tomas Vondra in [6] wanted the object filter be a separate file from a config
file, and argued for a simpler format for these lists (while still supporting
multiple object types).

Alvaro agreed with Tomas on [+-] OBJTYPE OBJIDENT in [7] and Tom extended the
proposal to use [include/exclude] keywords in [8] in order to support more than
just excluding and including.  Regardless of stance on format, the use of
keywords instead of [+-] is a rare point of consensus in this thread.

Stephen and myself have also expressed concern in various parts of the thread
that inventing our own format rather than using something with existing broad
library support will end up with third-parties (like pgAdmin et.al) having to
all write their own generators and parsers.

A main concern among most (all?) participants of the thread, regardless of
format supported, is that quoting is hard and must be done right for all object
names postgres support (including any not currently in scope by this patch).

Below is an attempt at summarizing and grouping the proposals so far into the
set of ideas presented:

    A) A keyword+object based format to invoke with a switch to essentially
    allow for more filters than the commandline can handle and nothing more.
    After a set of revisions, the current proposal is:
        [include|exclude] [<objtype>] [<objident>]

    B) A format similar to (A) which can also be used for pg_dump configuration

    C) The format in (A), or a close variant thereof, with the intention of it
    being included in/referred to from a future configuration file of currently
    unknown format.  One reference being a .gitignore type file.

    D) An existing format (JSON and TOML have been suggested, with JSON
    being dismissed due to lack of comment support) which has quoting
    conventions that supports postgres' object names and which can be used to
    define a full pg_dump configuration file syntax.

For B), C) and D) there is implicit consensus in the thread that we don't need
to implement the full configuration file as of this patch, merely that it
*must* be possible to do so without having to paint ourselves out of a corner.

At this point it seems to me that B) and C) has the broadest support.  Can the
C) option may represent the compromise between "simple" format for object
filtering and a more structured format for configuration?  Are there other
options?

Thoughts?

--
Daniel Gustafsson        https://vmware.com/

[0] https://postgr.es/m/F6674FF0-5800-4AED-9DC7-13C475707241@yesql.se
[1] https://postgr.es/m/CALAY4q9u30L7oGhbsfY3dPECQ8SrYa8YO=H-xOn5xWUeiEneeg@mail.gmail.com
[2] https://postgr.es/m/20201110200904.GU16415@tamriel.snowman.net
[3] https://postgr.es/m/CAEZATCVKMG7+b+_5tNwrNZ-aNDBy3=FMRNea2bO9O4qGcEvSTg@mail.gmail.com
[4] https://postgr.es/m/502641.1606334432@sss.pgh.pa.us
[5] https://postgr.es/m/619671.1606406538@sss.pgh.pa.us
[6] https://postgr.es/m/cb545d78-2dae-8d27-f062-822a07ca56cf@enterprisedb.com
[7] https://postgr.es/m/202107122259.n6o5uwb5erza@alvherre.pgsql
[8] https://postgr.es/m/3183720.1626131795@sss.pgh.pa.us

Re: proposal: possibility to read dumped table's name from file

From

Daniel Gustafsson

Date:

17 September 2021, 11:42:09

> On 15 Sep 2021, at 19:31, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> po 13. 9. 2021 v 15:01 odesílatel Daniel Gustafsson <daniel@yesql.se <mailto:daniel@yesql.se>> napsal:

> One issue with this syntax is that the include keyword can be quite misleading
> as it's semantic interpretion of "include table t" can be different from
> "--table=t".  The former is less clear about the fact that it means "exclude
> all other tables than " then the latter.  It can be solved with documentation,
> but I think that needs be to be made clearer.
>
> I invite any documentation enhancing and fixing

Sure, that can be collabored on.  This gist is though that IMO the keywords in
the filter file aren't as clear on the sideeffects as the command line params,
even though they are equal in functionality.

> +       pg_log_error("invalid format of filter file \"%s\": %s",
> +                                filename,
> +                                message);
> +
> +       fprintf(stderr, "%d: %s\n", lineno, line);
> Can't we just include the lineno in the error logging and skip dumping the
> offending line?  Fast-forwarding the pointer to print the offending part is
> less useful than a context marker, and in some cases suboptimal.  With this
> coding, if a pattern is omitted for example the below error message is given:
>
>     pg_dump: error: invalid format of filter file "filter.txt": missing object name
>     1:
>
> The errormessage and the linenumber in the file should be enough for the user
> to figure out what to fix.
>
> I did it like you proposed, but still, I think the content can be useful.

Not when there is no content in the error message, printing an empty string for
a line number which isn't a blank line doesn't seem terribly helpful.  If we
know the error context is empty, printing a tailored error hint seems more
useful for the user.

> More times you read dynamically generated files, or you read data from stdin, and in complex environments it can be
hardregenerate new content for debugging. 

That seems odd given that the arguments for this format has been that it's
likely to be handwritten.

> If I create a table called "a\nb" and try to dump it I get an error in parsing
> the file.  Isn't this supposed to work?
>     $ cat filter.txt
>     include table "a
>     b"
>     $ ./bin/pg_dump --filter=filter.txt
>     pg_dump: error: invalid format of filter file "filter.txt": unexpected chars after object name
>     2:
>
> probably there was some issue, because it should work. I tested a new version and this is tested in new regress
tests.Please, check 

That seems to work, but I am unable to write a filter statement which can
handle this relname:

CREATE TABLE "a""
""b" (a integer);

Are you able to craft one for that?

> Did you consider implementing this in Bison to abstract some of the messier
> parsing logic?
>
> Initially not, but now, when I am thinking about it, I don't think so Bison helps. The syntax of the filter file is
nicelylinear. Now, the code of the parser is a little bit larger than minimalistic, but it is due to nicer error's
messages.The raw implementation in Bison raised just "syntax error" and positions. I did code refactoring, and now the
scanning,parsing and processing are divided into separated routines. Parsing related code has 90 lines. In this case, I
don'tthink using a parser grammar file can carry any benefit. grammar is more readable, sure, but we need to include
bison,we need to handle errors, and if we want to raise more helpful errors than just "syntax error", then the code
willbe longer.  

I'm not so concerned by code size, but rather parsing of quotations etc and
being able to reason about it's correctness.  IMHO that's easier done by
reading a defined grammar than parsing a handwritten parser.

Will do a closer review on the patch shortly.

--
Daniel Gustafsson        https://vmware.com/

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

17 September 2021, 11:42:55

A main concern among most (all?) participants of the thread, regardless of
format supported, is that quoting is hard and must be done right for all object
names postgres support (including any not currently in scope by this patch).

Just a small note - when quoting is calculated to design, then the implementation is easy. I am sure, so my last code covers all possibilities, and it is about 100 lines of code.

Below is an attempt at summarizing and grouping the proposals so far into the
set of ideas presented:

A) A keyword+object based format to invoke with a switch to essentially
allow for more filters than the commandline can handle and nothing more.
After a set of revisions, the current proposal is:
[include|exclude] [<objtype>] [<objident>]

B) A format similar to (A) which can also be used for pg_dump configuration

C) The format in (A), or a close variant thereof, with the intention of it
being included in/referred to from a future configuration file of currently
unknown format. One reference being a .gitignore type file.

D) An existing format (JSON and TOML have been suggested, with JSON
being dismissed due to lack of comment support) which has quoting
conventions that supports postgres' object names and which can be used to
define a full pg_dump configuration file syntax.

For B), C) and D) there is implicit consensus in the thread that we don't need
to implement the full configuration file as of this patch, merely that it
*must* be possible to do so without having to paint ourselves out of a corner.

At this point it seems to me that B) and C) has the broadest support. Can the
C) option may represent the compromise between "simple" format for object
filtering and a more structured format for configuration? Are there other
options?

What should be a benefit of this variant?

Regards

Pavel

Thoughts?

--
Daniel Gustafsson https://vmware.com/

[0] https://postgr.es/m/F6674FF0-5800-4AED-9DC7-13C475707241@yesql.se
[1] https://postgr.es/m/CALAY4q9u30L7oGhbsfY3dPECQ8SrYa8YO=H-xOn5xWUeiEneeg@mail.gmail.com
[2] https://postgr.es/m/20201110200904.GU16415@tamriel.snowman.net
[3] https://postgr.es/m/CAEZATCVKMG7+b+_5tNwrNZ-aNDBy3=FMRNea2bO9O4qGcEvSTg@mail.gmail.com
[4] https://postgr.es/m/502641.1606334432@sss.pgh.pa.us
[5] https://postgr.es/m/619671.1606406538@sss.pgh.pa.us
[6] https://postgr.es/m/cb545d78-2dae-8d27-f062-822a07ca56cf@enterprisedb.com
[7] https://postgr.es/m/202107122259.n6o5uwb5erza@alvherre.pgsql
[8] https://postgr.es/m/3183720.1626131795@sss.pgh.pa.us

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

17 September 2021, 11:51:34

pá 17. 9. 2021 v 13:42 odesílatel Daniel Gustafsson <daniel@yesql.se> napsal:

> On 15 Sep 2021, at 19:31, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> po 13. 9. 2021 v 15:01 odesílatel Daniel Gustafsson <daniel@yesql.se <mailto:daniel@yesql.se>> napsal:

> One issue with this syntax is that the include keyword can be quite misleading
> as it's semantic interpretion of "include table t" can be different from
> "--table=t". The former is less clear about the fact that it means "exclude
> all other tables than " then the latter. It can be solved with documentation,
> but I think that needs be to be made clearer.
>
> I invite any documentation enhancing and fixing

Sure, that can be collabored on. This gist is though that IMO the keywords in
the filter file aren't as clear on the sideeffects as the command line params,
even though they are equal in functionality.

> + pg_log_error("invalid format of filter file \"%s\": %s",
> + filename,
> + message);
> +
> + fprintf(stderr, "%d: %s\n", lineno, line);
> Can't we just include the lineno in the error logging and skip dumping the
> offending line? Fast-forwarding the pointer to print the offending part is
> less useful than a context marker, and in some cases suboptimal. With this
> coding, if a pattern is omitted for example the below error message is given:
>
> pg_dump: error: invalid format of filter file "filter.txt": missing object name
> 1:
>
> The errormessage and the linenumber in the file should be enough for the user
> to figure out what to fix.
>
> I did it like you proposed, but still, I think the content can be useful.

Not when there is no content in the error message, printing an empty string for
a line number which isn't a blank line doesn't seem terribly helpful. If we
know the error context is empty, printing a tailored error hint seems more
useful for the user.

> More times you read dynamically generated files, or you read data from stdin, and in complex environments it can be hard regenerate new content for debugging.

That seems odd given that the arguments for this format has been that it's
likely to be handwritten.

> If I create a table called "a\nb" and try to dump it I get an error in parsing
> the file. Isn't this supposed to work?
> $ cat filter.txt
> include table "a
> b"
> $ ./bin/pg_dump --filter=filter.txt
> pg_dump: error: invalid format of filter file "filter.txt": unexpected chars after object name
> 2:
>
> probably there was some issue, because it should work. I tested a new version and this is tested in new regress tests. Please, check

That seems to work, but I am unable to write a filter statement which can
handle this relname:

CREATE TABLE "a""
""b" (a integer);

Are you able to craft one for that?

I am not able to dump this directly in pg_dump. Is it possible?

> Did you consider implementing this in Bison to abstract some of the messier
> parsing logic?
>
> Initially not, but now, when I am thinking about it, I don't think so Bison helps. The syntax of the filter file is nicely linear. Now, the code of the parser is a little bit larger than minimalistic, but it is due to nicer error's messages. The raw implementation in Bison raised just "syntax error" and positions. I did code refactoring, and now the scanning, parsing and processing are divided into separated routines. Parsing related code has 90 lines. In this case, I don't think using a parser grammar file can carry any benefit. grammar is more readable, sure, but we need to include bison, we need to handle errors, and if we want to raise more helpful errors than just "syntax error", then the code will be longer.

I'm not so concerned by code size, but rather parsing of quotations etc and
being able to reason about it's correctness. IMHO that's easier done by
reading a defined grammar than parsing a handwritten parser.

Will do a closer review on the patch shortly.

--
Daniel Gustafsson https://vmware.com/

Re: proposal: possibility to read dumped table's name from file

From

Daniel Gustafsson

Date:

17 September 2021, 11:56:46

> On 17 Sep 2021, at 13:51, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> pá 17. 9. 2021 v 13:42 odesílatel Daniel Gustafsson <daniel@yesql.se <mailto:daniel@yesql.se>> napsal:

> I am unable to write a filter statement which can
> handle this relname:
>
> CREATE TABLE "a""
> ""b" (a integer);
>
> Are you able to craft one for that?
>
> I am not able to dump this directly in pg_dump. Is it possible?

Sure, see below:

$ ./bin/psql filter
psql (15devel)
Type "help" for help.

filter=# create table "a""
filter"# ""b" (a integer);
CREATE TABLE
filter=# select relname from pg_class order by oid desc limit 1;
 relname
---------
 a"     +
 "b
(1 row)

filter=# ^D\q
$ ./bin/pg_dump -s filter
--
-- PostgreSQL database dump
--

-- Dumped from database version 15devel
-- Dumped by pg_dump version 15devel

SET statement_timeout = 0;
SET lock_timeout = 0;
SET idle_in_transaction_session_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = on;
SELECT pg_catalog.set_config('search_path', '', false);
SET check_function_bodies = false;
SET xmloption = content;
SET client_min_messages = warning;
SET row_security = off;

SET default_tablespace = '';

SET default_table_access_method = heap;

--
-- Name: a" "b; Type: TABLE; Schema: public; Owner: danielg
--

CREATE TABLE public."a""
""b" (
    a integer
);


ALTER TABLE public."a""
""b" OWNER TO danielg;

--
-- PostgreSQL database dump complete
--

--
Daniel Gustafsson        https://vmware.com/

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

17 September 2021, 11:59:13

pá 17. 9. 2021 v 13:56 odesílatel Daniel Gustafsson <daniel@yesql.se> napsal:

> On 17 Sep 2021, at 13:51, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> pá 17. 9. 2021 v 13:42 odesílatel Daniel Gustafsson <daniel@yesql.se <mailto:daniel@yesql.se>> napsal:

> I am unable to write a filter statement which can
> handle this relname:
>
> CREATE TABLE "a""
> ""b" (a integer);
>
> Are you able to craft one for that?
>
> I am not able to dump this directly in pg_dump. Is it possible?

Sure, see below:

$ ./bin/psql filter
psql (15devel)
Type "help" for help.

I didn't ask on this

I asked if you can use -t and some for filtering this name

filter=# create table "a""
filter"# ""b" (a integer);
CREATE TABLE
filter=# select relname from pg_class order by oid desc limit 1;
relname
---------
a" +
"b
(1 row)

filter=# ^D\q
$ ./bin/pg_dump -s filter
--
-- PostgreSQL database dump
--

-- Dumped from database version 15devel
-- Dumped by pg_dump version 15devel

SET statement_timeout = 0;
SET lock_timeout = 0;
SET idle_in_transaction_session_timeout = 0;
SET client_encoding = 'UTF8';
SET standard_conforming_strings = on;
SELECT pg_catalog.set_config('search_path', '', false);
SET check_function_bodies = false;
SET xmloption = content;
SET client_min_messages = warning;
SET row_security = off;

SET default_tablespace = '';

SET default_table_access_method = heap;

--
-- Name: a" "b; Type: TABLE; Schema: public; Owner: danielg
--

CREATE TABLE public."a""
""b" (
a integer
);

ALTER TABLE public."a""
""b" OWNER TO danielg;

--
-- PostgreSQL database dump complete
--

--
Daniel Gustafsson https://vmware.com/

Re: proposal: possibility to read dumped table's name from file

From

Stephen Frost

Date:

17 September 2021, 12:06:48

Greetings,

On Fri, Sep 17, 2021 at 13:59 Pavel Stehule <pavel.stehule@gmail.com> wrote:

pá 17. 9. 2021 v 13:56 odesílatel Daniel Gustafsson <daniel@yesql.se> napsal:
> On 17 Sep 2021, at 13:51, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> pá 17. 9. 2021 v 13:42 odesílatel Daniel Gustafsson <daniel@yesql.se <mailto:daniel@yesql.se>> napsal:

> I am unable to write a filter statement which can
> handle this relname:
>
> CREATE TABLE "a""
> ""b" (a integer);
>
> Are you able to craft one for that?
>
> I am not able to dump this directly in pg_dump. Is it possible?

Sure, see below:

$ ./bin/psql filter
psql (15devel)
Type "help" for help.

I didn't ask on this

I asked if you can use -t and some for filtering this name

?

For my part, at least, I don’t see that this particularly matters.. for a new feature that’s being developed to allow users to export specific tables, I would think we’d want to support any table names which can exist.

Thanks,

Stephen

Re: proposal: possibility to read dumped table's name from file

From

Daniel Gustafsson

Date:

17 September 2021, 12:07:12

> On 17 Sep 2021, at 13:59, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> pá 17. 9. 2021 v 13:56 odesílatel Daniel Gustafsson <daniel@yesql.se <mailto:daniel@yesql.se>> napsal:
> > On 17 Sep 2021, at 13:51, Pavel Stehule <pavel.stehule@gmail.com <mailto:pavel.stehule@gmail.com>> wrote:
> > pá 17. 9. 2021 v 13:42 odesílatel Daniel Gustafsson <daniel@yesql.se <mailto:daniel@yesql.se>
<mailto:daniel@yesql.se<mailto:daniel@yesql.se>>> napsal: 

> > I am unable to write a filter statement which can
> > handle this relname:
> >
> > CREATE TABLE "a""
> > ""b" (a integer);
> >
> > Are you able to craft one for that?
> >
> > I am not able to dump this directly in pg_dump. Is it possible?
>
> Sure, see below:
>
> $ ./bin/psql filter
> psql (15devel)
> Type "help" for help.
>
> I didn't ask on this
>
> I asked if you can use -t and some for filtering this name?

I didn't try as I don't see how that's relevant?  Surely we're not limiting the
capabilities of a filtering file format based on the quoting semantics of a
shell?

--
Daniel Gustafsson        https://vmware.com/

Re: proposal: possibility to read dumped table's name from file

From

Stephen Frost

Date:

17 September 2021, 12:09:32

Greetings,

On Fri, Sep 17, 2021 at 14:07 Daniel Gustafsson <daniel@yesql.se> wrote:

> On 17 Sep 2021, at 13:59, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> pá 17. 9. 2021 v 13:56 odesílatel Daniel Gustafsson <daniel@yesql.se <mailto:daniel@yesql.se>> napsal:
> > On 17 Sep 2021, at 13:51, Pavel Stehule <pavel.stehule@gmail.com <mailto:pavel.stehule@gmail.com>> wrote:
> > pá 17. 9. 2021 v 13:42 odesílatel Daniel Gustafsson <daniel@yesql.se <mailto:daniel@yesql.se> <mailto:daniel@yesql.se <mailto:daniel@yesql.se>>> napsal:

> > I am unable to write a filter statement which can
> > handle this relname:
> >
> > CREATE TABLE "a""
> > ""b" (a integer);
> >
> > Are you able to craft one for that?
> >
> > I am not able to dump this directly in pg_dump. Is it possible?
>
> Sure, see below:
>
> $ ./bin/psql filter
> psql (15devel)
> Type "help" for help.
>
> I didn't ask on this
>
> I asked if you can use -t and some for filtering this name?

I didn't try as I don't see how that's relevant? Surely we're not limiting the
capabilities of a filtering file format based on the quoting semantics of a
shell?

Yeah, agreed. I would think that a DBA might specifically want to be able to use a config file to get away from having to deal with shell quoting, in fact…

Thanks,

Stephen

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

17 September 2021, 12:16:15

pá 17. 9. 2021 v 14:07 odesílatel Daniel Gustafsson <daniel@yesql.se> napsal:

> On 17 Sep 2021, at 13:59, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> pá 17. 9. 2021 v 13:56 odesílatel Daniel Gustafsson <daniel@yesql.se <mailto:daniel@yesql.se>> napsal:
> > On 17 Sep 2021, at 13:51, Pavel Stehule <pavel.stehule@gmail.com <mailto:pavel.stehule@gmail.com>> wrote:
> > pá 17. 9. 2021 v 13:42 odesílatel Daniel Gustafsson <daniel@yesql.se <mailto:daniel@yesql.se> <mailto:daniel@yesql.se <mailto:daniel@yesql.se>>> napsal:

> > I am unable to write a filter statement which can
> > handle this relname:
> >
> > CREATE TABLE "a""
> > ""b" (a integer);
> >
> > Are you able to craft one for that?
> >
> > I am not able to dump this directly in pg_dump. Is it possible?
>
> Sure, see below:
>
> $ ./bin/psql filter
> psql (15devel)
> Type "help" for help.
>
> I didn't ask on this
>
> I asked if you can use -t and some for filtering this name?

I didn't try as I don't see how that's relevant? Surely we're not limiting the
capabilities of a filtering file format based on the quoting semantics of a
shell?

this patch just use existing functionality, that can be buggy too.

but I had a bug in this part - if I detect double double quotes on input I have to send double quotes to output too.

It should be fixed in attached patch

[pavel@localhost pg_dump]$ echo 'include table "a""\n""b"' | ./pg_dump --filter=-
--
-- PostgreSQL database dump
--

-- Dumped from database version 15devel
-- Dumped by pg_dump version 15devel

--
Daniel Gustafsson https://vmware.com/

Attachment

pg_dump-filteropt-20210917.patch

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

17 September 2021, 13:06:51

>
> Initially not, but now, when I am thinking about it, I don't think so Bison helps. The syntax of the filter file is nicely linear. Now, the code of the parser is a little bit larger than minimalistic, but it is due to nicer error's messages. The raw implementation in Bison raised just "syntax error" and positions. I did code refactoring, and now the scanning, parsing and processing are divided into separated routines. Parsing related code has 90 lines. In this case, I don't think using a parser grammar file can carry any benefit. grammar is more readable, sure, but we need to include bison, we need to handle errors, and if we want to raise more helpful errors than just "syntax error", then the code will be longer.

I'm not so concerned by code size, but rather parsing of quotations etc and
being able to reason about it's correctness. IMHO that's easier done by
reading a defined grammar than parsing a handwritten parser.

In this case the complex part is not a parser, but the scanner is complex and writing this in flex is not too easy. I wrote so the grammar file can be more readable, but the usual error from Bison is "syntax error" and position, so it does not win from the user perspective. When a parser is not linear, then a generated parser can help a lot, but using it at this moment is premature.

Will do a closer review on the patch shortly.

--
Daniel Gustafsson https://vmware.com/

Re: proposal: possibility to read dumped table's name from file

From

Daniel Gustafsson

Date:

20 September 2021, 12:10:45

> Will do a closer review on the patch shortly.

Had a read through, and tested, the latest posted version today:

+    Read objects filters from the specified file. Specify "-" to read from
+    stdin. Lines of this file must have the following format:
I think this should be <filename>-</filename> and <literal>STDIN</literal> to
match the rest of the docs.


+       <para>
+        With the following filter file, the dump would include table
+        <literal>mytable1</literal> and data from foreign tables of
+        <literal>some_foreign_server</literal> foreign server, but exclude data
+        from table <literal>mytable2</literal>.
+<programlisting>
+include table mytable1
+include foreign_data some_foreign_server
+exclude table mytable2
+</programlisting>
+       </para>
This example is highlighting the issue I've previously raised with the UX/doc
of this feature.  The "exclude table mytable2" is totally pointless in the
above since the exact match of "mytable1" will remove all other objects.  What
we should be doing instead is use the pattern matching aspect along the lines
of the below:

    include table mytable*
    exclude table mytable2

+        The <option>--filter</option> option works just like the other
+        options to include or exclude tables, schemas, table data, or foreign
This should refer to the actual options by name to make it clear which we are
talking about.


+       printf(_("  --filter=FILENAME            dump objects and data based on the filter expressions\n"
+                        "                               from the filter file\n"));
Before we settle on --filter I think we need to conclude whether this file is
intended to be included from a config file, or used on it's own.  If we gow tih
the former then we might not want a separate option for just --filter.


+    if (filter_is_keyword(keyword, size, "include"))
I would prefer if this function call was replaced by just the pg_strcasecmp()
call in filter_is_keyword() and the strlen optimization there removed.  The is
not a hot-path, we can afford the string comparison in case of errors.  Having
the string comparison done inline here will improve readability saving the
reading from jumping to another function to see what it does.


+    initStringInfo(&line);
Why is this using a StringInfo rather than a PQExpBuffer as the rest of pg_dump
does?


+typedef struct
I think these should be at the top of the file with the other typedefs.


When testing strange object names, I was unable to express this name in the filter file:

$ ./bin/psql
psql (15devel)
Type "help" for help.

danielg=# create table "
danielg"# t
danielg"# t
danielg"# " (a integer);
CREATE TABLE
danielg=# select relname from pg_class order by oid desc limit 1;
 relname
---------
        +
 t      +
 t      +

(1 row)

--
Daniel Gustafsson        https://vmware.com/

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

21 September 2021, 06:50:31

po 20. 9. 2021 v 14:10 odesílatel Daniel Gustafsson <daniel@yesql.se> napsal:

> Will do a closer review on the patch shortly.

Had a read through, and tested, the latest posted version today:

+ Read objects filters from the specified file. Specify "-" to read from
+ stdin. Lines of this file must have the following format:
I think this should be <filename>-</filename> and <literal>STDIN</literal> to
match the rest of the docs.

+ <para>
+ With the following filter file, the dump would include table
+ <literal>mytable1</literal> and data from foreign tables of
+ <literal>some_foreign_server</literal> foreign server, but exclude data
+ from table <literal>mytable2</literal>.
+<programlisting>
+include table mytable1
+include foreign_data some_foreign_server
+exclude table mytable2
+</programlisting>
+ </para>
This example is highlighting the issue I've previously raised with the UX/doc
of this feature. The "exclude table mytable2" is totally pointless in the
above since the exact match of "mytable1" will remove all other objects. What
we should be doing instead is use the pattern matching aspect along the lines
of the below:

include table mytable*
exclude table mytable2

+ The <option>--filter</option> option works just like the other
+ options to include or exclude tables, schemas, table data, or foreign
This should refer to the actual options by name to make it clear which we are
talking about.

fixed

+ printf(_(" --filter=FILENAME dump objects and data based on the filter expressions\n"
+ " from the filter file\n"));
Before we settle on --filter I think we need to conclude whether this file is
intended to be included from a config file, or used on it's own. If we gow tih
the former then we might not want a separate option for just --filter.

I prefer to separate two files. Although there is some intersection, I think it is good to have two simple separate files for two really different tasks.

It does filtering, and it should be controlled by option "--filter". When the implementation will be changed, then this option can be changed too.

Filtering is just a pg_dump related feature. Revision of client application configuration is a much more generic task, and if we mix it to one, we can be

in a trap. It can be hard to find one good format for large script generated content, and possibly hand written structured content. For practical

reasons it can be good to have two files too. Filters and configurations can have different life cycles.

+ if (filter_is_keyword(keyword, size, "include"))
I would prefer if this function call was replaced by just the pg_strcasecmp()
call in filter_is_keyword() and the strlen optimization there removed. The is
not a hot-path, we can afford the string comparison in case of errors. Having
the string comparison done inline here will improve readability saving the
reading from jumping to another function to see what it does.

I agree that this is not a hot-path, just I don't feel well if I need to make a zero end string just for comparison pg_strcasecmp. Current design reduces malloc/free cycles. It is used in more places, when Postgres parses strings - SQL parser, plpgsql parser. I am not sure about the benefits and costs - pg_strcasecmp can be more readable, but for any keyword I have to call pstrdup and pfree. Is it necessary? My opinion in this part is not too strong - it is a minor issue, maybe I have a little bit different feelings about benefits and costs in this specific case, and if you really think the benefits of rewriting are higher, I'll do it.

+ initStringInfo(&line);
Why is this using a StringInfo rather than a PQExpBuffer as the rest of pg_dump
does?

The StringInfo is used because I use the pg_get_line_buf function, and this function uses this API.

+typedef struct
I think these should be at the top of the file with the other typedefs.

done

When testing strange object names, I was unable to express this name in the filter file:

$ ./bin/psql
psql (15devel)
Type "help" for help.

danielg=# create table "
danielg"# t
danielg"# t
danielg"# " (a integer);
CREATE TABLE
danielg=# select relname from pg_class order by oid desc limit 1;
relname
---------
+
t +
t +

(1 row)

Good catch - I had badly placed pg_strip_crlf function, fixed and regress tests enhanced

Please check assigned patch

Regards

Pavel

--
Daniel Gustafsson https://vmware.com/

Attachment

pg_dump-filteropt-20210921.patch

Re: proposal: possibility to read dumped table's name from file

From

Daniel Gustafsson

Date:

21 September 2021, 12:37:55

> On 21 Sep 2021, at 08:50, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> po 20. 9. 2021 v 14:10 odesílatel Daniel Gustafsson <daniel@yesql.se <mailto:daniel@yesql.se>> napsal:

> +       printf(_("  --filter=FILENAME            dump objects and data based on the filter expressions\n"
> +                        "                               from the filter file\n"));
> Before we settle on --filter I think we need to conclude whether this file is
> intended to be included from a config file, or used on it's own.  If we gow tih
> the former then we might not want a separate option for just --filter.
>
> I prefer to separate two files. Although there is some intersection, I think it is good to have two simple separate
filesfor two really different tasks. 
> It does filtering, and it should be controlled by option "--filter". When the implementation will be changed, then
thisoption can be changed too.  
> Filtering is just a pg_dump related feature. Revision of client application configuration is a much more generic
task,and if we mix it to one, we can be 
> in a trap. It can be hard to find one good format for large script generated content, and possibly hand written
structuredcontent. For practical 
> reasons it can be good to have two files too. Filters and configurations can have different life cycles.

I'm not convinced that we can/should change or remove a commandline parameter
in a coming version when there might be scripts expecting it to work in a
specific way.  Having a --filter as well as a --config where the configfile can
refer to the filterfile also passed via --filter sounds like problem waiting to
happen, so I think we need to settle how we want to interact with this file
before anything goes in.

Any thoughts from those in the thread who have had strong opinions on config
files etc?

> +    if (filter_is_keyword(keyword, size, "include"))
> I would prefer if this function call was replaced by just the pg_strcasecmp()
> call in filter_is_keyword() and the strlen optimization there removed.  The is
> not a hot-path, we can afford the string comparison in case of errors.  Having
> the string comparison done inline here will improve readability saving the
> reading from jumping to another function to see what it does.
>
> I agree that this is not a hot-path, just I don't feel well if I need to make a zero end string just for comparison
pg_strcasecmp.Current design reduces malloc/free cycles. It is used in more places, when Postgres parses strings - SQL
parser,plpgsql parser. I am not sure about the benefits and costs - pg_strcasecmp can be more readable, but for any
keywordI have to call pstrdup and pfree. Is it necessary? My opinion in this part is not too strong - it is a minor
issue,maybe I have a little bit different feelings about benefits and costs in this specific case, and if you really
thinkthe benefits of rewriting are higher, I'll do it 

Sorry, I typoed my response.  What I meant was to move the pg_strncasecmp call
inline and not do the strlen check, to save readers from jumping around.  So
basically end up with the below in read_filter_item():

+    /* Now we expect sequence of two keywords */
+    if (pg_strncasecmp(keyword, "include", size) == 0)
+        *is_include = true;

> +    initStringInfo(&line);
> Why is this using a StringInfo rather than a PQExpBuffer as the rest of pg_dump
> does?
>
> The StringInfo is used because I use the pg_get_line_buf function, and this function uses this API.

Ah, of course.

A few other comments from another pass over this:

+    exit_nicely(-1);
Why -1?  pg_dump (and all other binaries) exits with 1 on IMO even more serious
errors so I think this should use 1 as well.


+    if (!pg_get_line_buf(fstate->fp, line))
+    {
+        if (ferror(fstate->fp))
+            fatal("could not read from file \"%s\": %m", fstate->filename);
+
+        exit_invalid_filter_format(fstate,"unexpected end of file");
+    }
In the ferror() case this codepath isn't running fclose() on the file pointer
(unless stdin) which we do elsewhere, so this should use pg_log_error and
exit_nicely instead.


+    pg_log_fatal("could not read from file \"%s\": %m", fstate->filename);
Based on how other errors are treated in pg_dump I think this should be
downgraded to a pg_log_error.

The above comments are fixed in the attached, as well as a pass over the docs
and extended tests to actually test matching a foreign server.  What do think
about this version?  I'm still not convinced that there aren't more quoting
bugs in the parser, but I've left that intact for now.

--
Daniel Gustafsson        https://vmware.com/

Attachment

0001-Implement-filter-for-pg_dump.patch

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

21 September 2021, 12:46:27

út 21. 9. 2021 v 14:37 odesílatel Daniel Gustafsson <daniel@yesql.se> napsal:

> On 21 Sep 2021, at 08:50, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> po 20. 9. 2021 v 14:10 odesílatel Daniel Gustafsson <daniel@yesql.se <mailto:daniel@yesql.se>> napsal:

> + printf(_(" --filter=FILENAME dump objects and data based on the filter expressions\n"
> + " from the filter file\n"));
> Before we settle on --filter I think we need to conclude whether this file is
> intended to be included from a config file, or used on it's own. If we gow tih
> the former then we might not want a separate option for just --filter.
>
> I prefer to separate two files. Although there is some intersection, I think it is good to have two simple separate files for two really different tasks.
> It does filtering, and it should be controlled by option "--filter". When the implementation will be changed, then this option can be changed too.
> Filtering is just a pg_dump related feature. Revision of client application configuration is a much more generic task, and if we mix it to one, we can be
> in a trap. It can be hard to find one good format for large script generated content, and possibly hand written structured content. For practical
> reasons it can be good to have two files too. Filters and configurations can have different life cycles.

I'm not convinced that we can/should change or remove a commandline parameter
in a coming version when there might be scripts expecting it to work in a
specific way. Having a --filter as well as a --config where the configfile can
refer to the filterfile also passed via --filter sounds like problem waiting to
happen, so I think we need to settle how we want to interact with this file
before anything goes in.

Any thoughts from those in the thread who have had strong opinions on config
files etc?

> + if (filter_is_keyword(keyword, size, "include"))
> I would prefer if this function call was replaced by just the pg_strcasecmp()
> call in filter_is_keyword() and the strlen optimization there removed. The is
> not a hot-path, we can afford the string comparison in case of errors. Having
> the string comparison done inline here will improve readability saving the
> reading from jumping to another function to see what it does.
>
> I agree that this is not a hot-path, just I don't feel well if I need to make a zero end string just for comparison pg_strcasecmp. Current design reduces malloc/free cycles. It is used in more places, when Postgres parses strings - SQL parser, plpgsql parser. I am not sure about the benefits and costs - pg_strcasecmp can be more readable, but for any keyword I have to call pstrdup and pfree. Is it necessary? My opinion in this part is not too strong - it is a minor issue, maybe I have a little bit different feelings about benefits and costs in this specific case, and if you really think the benefits of rewriting are higher, I'll do it

Sorry, I typoed my response. What I meant was to move the pg_strncasecmp call
inline and not do the strlen check, to save readers from jumping around. So
basically end up with the below in read_filter_item():

+ /* Now we expect sequence of two keywords */
+ if (pg_strncasecmp(keyword, "include", size) == 0)
+ *is_include = true;

I don't think so it is safe (strict). Only pg_strncasecmp(..) is true for keywords "includex", "includedsss", ... You should to compare the size

Regards

Pavel

Re: proposal: possibility to read dumped table's name from file

From

Alvaro Herrera

Date:

21 September 2021, 13:28:51

I definitely agree that we should have two files, one for config and
another one for filter, since their purposes are orthogonal and their
formats are likely different; trying to cram the filter specification in
the config file seems unfriendly because it'd force users to write the
filter in whatever alien grammar used for the config file.  Also, this
would make it easier to use a single config file with a bunch of
different filter files.

On 2021-Sep-21, Daniel Gustafsson wrote:

> I'm not convinced that we can/should change or remove a commandline parameter
> in a coming version when there might be scripts expecting it to work in a
> specific way.  Having a --filter as well as a --config where the configfile can
> refer to the filterfile also passed via --filter sounds like problem waiting to
> happen, so I think we need to settle how we want to interact with this file
> before anything goes in.

I think both the filter and the hypothetical config file are going to
interact (be redundant) with almost all already existing switches, and
there's no need to talk about removing anything (e.g., nobody would
argue for the removal of "-t" even though that's redundant with the
filter file).

I see no problem with the config file specifying a filter file.

AFAICS if the config file specifies a filter and the user also specifies
a filter in the command line, we have two easy options: raise an error
about the redundant option, or have the command line option supersede
the one in the config file.  The latter strikes me as the more useful
behavior, and it's in line with what other tools do in similar cases, so
that's what I propose doing.

(There might be less easy options too, such as somehow combining the two
filters, but offhand I don't see any reason why this is real-world
useful, so I don't propose doing that.)

-- 
Álvaro Herrera              Valdivia, Chile  —  https://www.EnterpriseDB.com/
"How amazing is that? I call it a night and come back to find that a bug has
been identified and patched while I sleep."                (Robert Davidson)
               http://archives.postgresql.org/pgsql-sql/2006-03/msg00378.php

Re: proposal: possibility to read dumped table's name from file

From

Tomas Vondra

Date:

21 September 2021, 16:06:52

On 9/21/21 3:28 PM, Alvaro Herrera wrote:
> I definitely agree that we should have two files, one for config and
> another one for filter, since their purposes are orthogonal and their
> formats are likely different; trying to cram the filter specification in
> the config file seems unfriendly because it'd force users to write the
> filter in whatever alien grammar used for the config file.  Also, this
> would make it easier to use a single config file with a bunch of
> different filter files.
> 

+1, that is pretty much excatly what I argued for not too long ago.

> On 2021-Sep-21, Daniel Gustafsson wrote:
> 
>> I'm not convinced that we can/should change or remove a commandline parameter
>> in a coming version when there might be scripts expecting it to work in a
>> specific way.  Having a --filter as well as a --config where the configfile can
>> refer to the filterfile also passed via --filter sounds like problem waiting to
>> happen, so I think we need to settle how we want to interact with this file
>> before anything goes in.
> 
> I think both the filter and the hypothetical config file are going to
> interact (be redundant) with almost all already existing switches, and
> there's no need to talk about removing anything (e.g., nobody would
> argue for the removal of "-t" even though that's redundant with the
> filter file).
> 
> I see no problem with the config file specifying a filter file.
> 
> AFAICS if the config file specifies a filter and the user also specifies
> a filter in the command line, we have two easy options: raise an error
> about the redundant option, or have the command line option supersede
> the one in the config file.  The latter strikes me as the more useful
> behavior, and it's in line with what other tools do in similar cases, so
> that's what I propose doing.
> 
> (There might be less easy options too, such as somehow combining the two
> filters, but offhand I don't see any reason why this is real-world
> useful, so I don't propose doing that.)
> 

Well, I think we already have to do decisions like that, because you can
do e.g. this:

    pg_dump -T t -t t

So we already do combine the switches, and we do this:

    When both -t and -T are given, the behavior is to dump just the
    tables that match at least one -t switch but no -T switches. If -T
    appears without -t, then tables matching -T are excluded from what
    is otherwise a normal dump.

That seems fairly reasonable, and I don't see why not to use the same
logic for combining patterns no matter where we got them (filter file,
command-line option, etc.).

Just combine everything, and then check if there's any "exclude" rule.
If yes, we're done - exclude. If not, check if there's "include" rule.
If not, still exclude. Otherwise include.

Seems reasonable and consistent to me, and I don't see why not to allow
multiple --filter parameters.


regards

-- 
Tomas Vondra
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

21 September 2021, 16:20:52

The above comments are fixed in the attached, as well as a pass over the docs
and extended tests to actually test matching a foreign server. What do think
about this version? I'm still not convinced that there aren't more quoting
bugs in the parser, but I've left that intact for now.

The problematic points are double quotes and new line char. Any other is just in sequence of bytes.

I have just one note to your patch. When you use pg_strncasecmp, then you have to check the size too

char *xxx = "incl";
int xxx_size = 4;

elog(NOTICE, ">>>>%d<<<<",
pg_strncasecmp(xxx, "include", xxx_size) == 0);

result is NOTICE: >>>>1<<<<

"incl" is not keyword "include"

Regards

Pavel

--
Daniel Gustafsson https://vmware.com/

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

24 September 2021, 03:59:54

The above comments are fixed in the attached, as well as a pass over the docs
and extended tests to actually test matching a foreign server. What do think
about this version? I'm still not convinced that there aren't more quoting
bugs in the parser, but I've left that intact for now.

This patch is based on the version that you sent 21.9. Just I modified string comparison in keyword detection. If we don't allow support abbreviations of keywords (and I dislike it), then the check of size is necessary. Any other is without change.

Regards

Pavel

--
Daniel Gustafsson https://vmware.com/

Attachment

pg_dump-filteropt-20210924.patch

Re: proposal: possibility to read dumped table's name from file

From

Daniel Gustafsson

Date:

01 October 2021, 13:19:30

I took another pass over this today and touched up the documentation (docs and
code) as well as tweaked the code a bit here and there to both make it fit the
pg_dump style better and to clean up a few small things.  I've also added a set
of additional tests to cover more of the functionality.

I'm still not happy with the docs, I need to take another look there and see if
I make them more readable but otherwise I don't think there are any open issues
with this.

As has been discussed upthread, this format strikes a compromise wrt simplicity
and doesn't preclude adding a more structured config file in the future should
we want that.  I think this takes care of most comments and opinions made in
this thread.

--
Daniel Gustafsson        https://vmware.com/

Attachment

pg_dump-filteropt-20211001.patch

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

01 October 2021, 16:00:17

pá 1. 10. 2021 v 15:19 odesílatel Daniel Gustafsson <daniel@yesql.se> napsal:

I took another pass over this today and touched up the documentation (docs and
code) as well as tweaked the code a bit here and there to both make it fit the
pg_dump style better and to clean up a few small things. I've also added a set
of additional tests to cover more of the functionality.

I'm still not happy with the docs, I need to take another look there and see if
I make them more readable but otherwise I don't think there are any open issues
with this.

As has been discussed upthread, this format strikes a compromise wrt simplicity
and doesn't preclude adding a more structured config file in the future should
we want that. I think this takes care of most comments and opinions made in
this thread.

It looks well.

Thank you

Pavel

--
Daniel Gustafsson https://vmware.com/

Re: proposal: possibility to read dumped table's name from file

From

Erik Rijkers

Date:

01 October 2021, 16:19:56

On 10/1/21 3:19 PM, Daniel Gustafsson wrote:
> 
> As has been discussed upthread, this format strikes a compromise wrt simplicity
> and doesn't preclude adding a more structured config file in the future should
> we want that.  I think this takes care of most comments and opinions made in
> this thread.
> 
> --
> Daniel Gustafsson        https://vmware.com/
> 

Hi,

If you try to dump/restore a foreign file from a file_fdw server, the 
restore step will complain and thus leave the returnvalue nonzero. The 
foreign table will be there, with complete 'data'.

A complete runnable exampe is a lot of work; I hope the below bits of 
input and output makes the problem clear.  Main thing: the pg_restore 
contains 2 ERROR lines like:

pg_restore: error: COPY failed for table "ireise1": ERROR:  cannot 
insert into foreign table "ireise1"



----------------------
 From the test bash:

echo "
include table table0       # ok   public
include table test.table1  #
include foreign_data goethe      # foreign server 'goethe' (file_fdw)
include table gutenberg.ireise1  # foreign table
include table gutenberg.ireise2  # foreign table
" > inputfile1.txt

pg_dump --create -Fc -c -p $port -d $db1 -f dump1 --filter=inputfile1.txt
echo

# prepare for restore
server_name=goethe
echo "create schema if not exists test;"      | psql -qaXd $db2
echo "create schema if not exists gutenberg;" | psql -qaXd $db2
echo "create server if not exists $server_name foreign data wrapper 
file_fdw " \
                                               | psql -qaXd $db2

echo "-- pg_restore --if-exists -cvd $db2 dump1 "
          pg_restore --if-exists -cvd $db2 dump1
rc=$?
echo "-- rc [$rc]"  -
echo

----------------------

from the output:

-- pg_dump --create -Fc -c -p 6969 -d testdb1 -f dump1 
--filter=inputfile1.txt


-- pg_restore --if-exists -cvd testdb2 dump1
pg_restore: connecting to database for restore
pg_restore: dropping TABLE table1
pg_restore: dropping TABLE table0
pg_restore: dropping FOREIGN TABLE ireise2
pg_restore: dropping FOREIGN TABLE ireise1
pg_restore: creating FOREIGN TABLE "gutenberg.ireise1"
pg_restore: creating COMMENT "gutenberg.FOREIGN TABLE ireise1"
pg_restore: creating FOREIGN TABLE "gutenberg.ireise2"
pg_restore: creating COMMENT "gutenberg.FOREIGN TABLE ireise2"
pg_restore: creating TABLE "public.table0"
pg_restore: creating TABLE "test.table1"
pg_restore: processing data for table "gutenberg.ireise1"
pg_restore: while PROCESSING TOC:
pg_restore: from TOC entry 5570; 0 23625 TABLE DATA ireise1 aardvark
pg_restore: error: COPY failed for table "ireise1": ERROR:  cannot 
insert into foreign table "ireise1"
pg_restore: processing data for table "gutenberg.ireise2"
pg_restore: from TOC entry 5571; 0 23628 TABLE DATA ireise2 aardvark
pg_restore: error: COPY failed for table "ireise2": ERROR:  cannot 
insert into foreign table "ireise2"
pg_restore: processing data for table "public.table0"
pg_restore: processing data for table "test.table1"
pg_restore: warning: errors ignored on restore: 2
-- rc [1]

---------


A second, separate practical hickup is that schema's are not restored 
from the dumped $schema.$table includes -- but this can be worked 
around; for my inputfile1.txt I had to run separately (as seen above, 
before running the pg_restore):

create schema if not exists test;
create schema if not exists gutenberg;
create server if not exists goethe foreign data wrapper file_fdw;

A bit annoying but still maybe all right.


Thanks,

Erik Rijkers

Re: proposal: possibility to read dumped table's name from file

From

Erik Rijkers

Date:

02 October 2021, 06:18:14

On 10/1/21 6:19 PM, Erik Rijkers wrote:
> On 10/1/21 3:19 PM, Daniel Gustafsson wrote:
>>
>> As has been discussed upthread, this format strikes a compromise wrt 
>> simplicity
>> and doesn't preclude adding a more structured config file in the 
>> future should

> 
> If you try to dump/restore a foreign file from a file_fdw server, the 
> restore step will complain and thus leave the returnvalue nonzero. The 
> foreign table will be there, with complete 'data'.
> 
> A complete runnable exampe is a lot of work; I hope the below bits of 
> input and output makes the problem clear.  Main thing: the pg_restore 
> contains 2 ERROR lines like:
> 
> pg_restore: error: COPY failed for table "ireise1": ERROR:  cannot 
> insert into foreign table "ireise1"

Further testing makes clear that the file_fdw-addressing line
   include foreign_data goethe
was the culprit: it causes a COPY which of course fails in a readonly 
wrapper like file_fdw. Without that line it works (because I run the 
restore on the same machine so the underlying file_fdw .txt files are 
there for testdb2 too)

So the issue is not as serious as it seemed.  The complaint remaining is 
only that this could somehow be documented better.

I attach a running example (careful, it deletes stuff)  of the original 
ERROR-producing bash (remove the 'include foreign_data' line from the 
input file to run it without error).

thanks,

Erik Rijkers

Attachment

dump_restore_foreign_table.sh

Re: proposal: possibility to read dumped table's name from file

From

Daniel Gustafsson

Date:

04 October 2021, 12:54:27

> On 2 Oct 2021, at 08:18, Erik Rijkers <er@xs4all.nl> wrote:

> So the issue is not as serious as it seemed.

This is also not related to this patch in any way, or am I missing a point
here? This can just as well be achieved without this patch.

> The complaint remaining is only that this could somehow be documented better.

The pg_dump documentation today have a large highlighted note about this:

    "When --include-foreign-data is specified, pg_dump does not check that the
    foreign table is writable.  Therefore, there is no guarantee that the
    results of a foreign table dump can be successfully restored."

This was extensively discussed [0] when this went in, is there additional
documentation you'd like to see for this?

--
Daniel Gustafsson        https://vmware.com/

[0] https://postgr.es/m/LEJPR01MB0185483C0079D2F651B16231E7FC0@LEJPR01MB0185.DEUPRD01.PROD.OUTLOOK.DE

Re: proposal: possibility to read dumped table's name from file

From

Daniel Gustafsson

Date:

05 October 2021, 12:30:20

> On 1 Oct 2021, at 15:19, Daniel Gustafsson <daniel@yesql.se> wrote:

> I'm still not happy with the docs, I need to take another look there and see if
> I make them more readable but otherwise I don't think there are any open issues
> with this.

Attached is a rebased version which has rewritten docs which I think are more
in line with the pg_dump documentation.  I've also added tests for
--strict-names operation, as well subjected it to pgindent and pgperltidy.

Unless there are objections, I think this is pretty much ready to go in.

--
Daniel Gustafsson        https://vmware.com/

Attachment

pg_dump-filteropt-20211005.patch

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

05 October 2021, 12:37:52

út 5. 10. 2021 v 14:30 odesílatel Daniel Gustafsson <daniel@yesql.se> napsal:

> On 1 Oct 2021, at 15:19, Daniel Gustafsson <daniel@yesql.se> wrote:

> I'm still not happy with the docs, I need to take another look there and see if
> I make them more readable but otherwise I don't think there are any open issues
> with this.

Attached is a rebased version which has rewritten docs which I think are more
in line with the pg_dump documentation. I've also added tests for
--strict-names operation, as well subjected it to pgindent and pgperltidy.

Unless there are objections, I think this is pretty much ready to go in.

great, thank you

Pavel

--
Daniel Gustafsson https://vmware.com/

Re: proposal: possibility to read dumped table's name from file

From

Erik Rijkers

Date:

05 October 2021, 14:39:05

Op 05-10-2021 om 14:30 schreef Daniel Gustafsson:

> 
> Unless there are objections, I think this is pretty much ready to go in.

Agreed.  One typo:

'This keyword can only be with the exclude keyword.'    should be
'This keyword can only be used with the exclude keyword.'


thanks,

Erik Rijkers




> 
> --
> Daniel Gustafsson        https://vmware.com/
>

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

27 October 2021, 09:15:31

út 5. 10. 2021 v 14:30 odesílatel Daniel Gustafsson <daniel@yesql.se> napsal:

> On 1 Oct 2021, at 15:19, Daniel Gustafsson <daniel@yesql.se> wrote:

> I'm still not happy with the docs, I need to take another look there and see if
> I make them more readable but otherwise I don't think there are any open issues
> with this.

Attached is a rebased version which has rewritten docs which I think are more
in line with the pg_dump documentation. I've also added tests for
--strict-names operation, as well subjected it to pgindent and pgperltidy.

Unless there are objections, I think this is pretty much ready to go in.

I am sending a rebased version of patch pg_dump-filteropt-20211005.patch with fixed regress tests and fixed documentation (reported by Erik).

I found another issue - the stringinfo line used in filter_get_pattern was released too early - the line (memory) was used later in check of unexpected

chars after pattern string. I fixed it by moving this stringinfo buffer to fstate structure. It can be shared by all routines, and it can be safely released at

an end of filter processing, where we are sure, so these data can be free.

Regards

Pavel

--
Daniel Gustafsson https://vmware.com/

Attachment

pg_dump-filteropt-20211027.patch

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

25 April 2022, 17:39:58

fresh rebase

Regards

Pavel

Attachment

pg_dump-filteropt-20220425.patch

Re: proposal: possibility to read dumped table's name from file

From

Andrew Dunstan

Date:

13 July 2022, 20:49:01

On 2022-04-25 Mo 13:39, Pavel Stehule wrote:
> Hi
>
> fresh rebase
>
>


If we're going to do this for pg_dump's include/exclude options,
shouldn't we also provide an equivalent facility for pg_dumpall's
--exclude-database option?


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

14 July 2022, 04:54:49

st 13. 7. 2022 v 22:49 odesílatel Andrew Dunstan <andrew@dunslane.net> napsal:

On 2022-04-25 Mo 13:39, Pavel Stehule wrote:
> Hi
>
> fresh rebase
>
>

If we're going to do this for pg_dump's include/exclude options,
shouldn't we also provide an equivalent facility for pg_dumpall's
--exclude-database option?

It has sense

Regards

Pavel

cheers

andrew

--
Andrew Dunstan
EDB: https://www.enterprisedb.com

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

17 July 2022, 06:20:47

čt 14. 7. 2022 v 6:54 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal:

st 13. 7. 2022 v 22:49 odesílatel Andrew Dunstan <andrew@dunslane.net> napsal:

On 2022-04-25 Mo 13:39, Pavel Stehule wrote:
> Hi
>
> fresh rebase
>
>

If we're going to do this for pg_dump's include/exclude options,
shouldn't we also provide an equivalent facility for pg_dumpall's
--exclude-database option?

It has sense

The attached patch implements the --filter option for pg_dumpall and for pg_restore too.

Regards

Pavel

Attachment

pg_dump-filteropt.patch

Re: proposal: possibility to read dumped table's name from file

From

Justin Pryzby

Date:

17 July 2022, 14:01:46

Thanks for updating the patch.

This failed to build on windows.
http://cfbot.cputube.org/pavel-stehule.html

Some more comments inline.

On Sun, Jul 17, 2022 at 08:20:47AM +0200, Pavel Stehule wrote:
> The attached patch implements the --filter option for pg_dumpall and for
> pg_restore too.

> diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
> index 5efb442b44..ba2920dbee 100644
> --- a/doc/src/sgml/ref/pg_dump.sgml
> +++ b/doc/src/sgml/ref/pg_dump.sgml
> @@ -779,6 +779,80 @@ PostgreSQL documentation
>        </listitem>
>       </varlistentry>
>  
> +     <varlistentry>
> +      <term><option>--filter=<replaceable class="parameter">filename</replaceable></option></term>
> +      <listitem>
> +       <para>
> +        Specify a filename from which to read patterns for objects to include
> +        or exclude from the dump. The patterns are interpreted according to the
> +        same rules as the corresponding options:
> +        <option>-t</option>/<option>--table</option> for tables,
> +        <option>-n</option>/<option>--schema</option> for schemas,
> +        <option>--include-foreign-data</option> for data on foreign servers and
> +        <option>--exclude-table-data</option> for table data.
> +        To read from <literal>STDIN</literal> use <filename>-</filename> as the

STDIN comma

> +       <para>
> +        Lines starting with <literal>#</literal> are considered comments and
> +        are ignored. Comments can be placed after filter as well. Blank lines

change "are ignored" to "ignored", I think.

> diff --git a/doc/src/sgml/ref/pg_dumpall.sgml b/doc/src/sgml/ref/pg_dumpall.sgml
> index 8a081f0080..137491340c 100644
> --- a/doc/src/sgml/ref/pg_dumpall.sgml
> +++ b/doc/src/sgml/ref/pg_dumpall.sgml
> @@ -122,6 +122,29 @@ PostgreSQL documentation
>        </listitem>
>       </varlistentry>
>  
> +     <varlistentry>
> +      <term><option>--filter=<replaceable class="parameter">filename</replaceable></option></term>
> +      <listitem>
> +       <para>
> +        Specify a filename from which to read patterns for databases excluded
> +        from dump. The patterns are interpretted according to the same rules
> +        like <option>--exclude-database</option>.

same rules *as*

> +        To read from <literal>STDIN</literal> use <filename>-</filename> as the

comma

> +        filename.  The <option>--filter</option> option can be specified in
> +        conjunction with the above listed options for including or excluding

For dumpall, remove "for including or"
change "above listed options" to "exclude-database" ?

> diff --git a/doc/src/sgml/ref/pg_restore.sgml b/doc/src/sgml/ref/pg_restore.sgml
> index 526986eadb..5f16c4a333 100644
> --- a/doc/src/sgml/ref/pg_restore.sgml
> +++ b/doc/src/sgml/ref/pg_restore.sgml
> @@ -188,6 +188,31 @@ PostgreSQL documentation
>        </listitem>
>       </varlistentry>
>  
> +     <varlistentry>
> +      <term><option>--filter=<replaceable class="parameter">filename</replaceable></option></term>
> +      <listitem>
> +       <para>
> +        Specify a filename from which to read patterns for objects excluded
> +        or included from restore. The patterns are interpretted according to the
> +        same rules like <option>--schema</option>, <option>--exclude-schema</option>,

s/like/as/

> +        <option>--function</option>, <option>--index</option>, <option>--table</option>
> +        or <option>--trigger</option>.
> +        To read from <literal>STDIN</literal> use <filename>-</filename> as the

STDIN comma

> +/*
> + * filter_get_keyword - read the next filter keyword from buffer
> + *
> + * Search for keywords (limited to containing ascii alphabetic characters) in

remove "containing"

> +    /*
> +     * If the object name pattern has been quoted we must take care parse out
> +     * the entire quoted pattern, which may contain whitespace and can span
> +     * over many lines.

quoted comma
*to parse
remove "over"

> + * The pattern is either simple without any  whitespace, or properly quoted

double space

> + * in case there is whitespace in the object name. The pattern handling follows

s/is/may be/

> +            if (size == 7 && pg_strncasecmp(keyword, "include", 7) == 0)
> +                *is_include = true;
> +            else if (size == 7 && pg_strncasecmp(keyword, "exclude", 7) == 0)
> +                *is_include = false;

Can't you write strncasecmp(keyword, "include", size) to avoid hardcoding "7" ?

> +
> +            if (size == 4 && pg_strncasecmp(keyword, "data", 4) == 0)
> +                *objtype = FILTER_OBJECT_TYPE_DATA;
> +            else if (size == 8 && pg_strncasecmp(keyword, "database", 8) == 0)
> +                *objtype = FILTER_OBJECT_TYPE_DATABASE;
> +            else if (size == 12 && pg_strncasecmp(keyword, "foreign_data", 12) == 0)
> +                *objtype = FILTER_OBJECT_TYPE_FOREIGN_DATA;
> +            else if (size == 8 && pg_strncasecmp(keyword, "function", 8) == 0)
> +                *objtype = FILTER_OBJECT_TYPE_FUNCTION;
> +            else if (size == 5 && pg_strncasecmp(keyword, "index", 5) == 0)
> +                *objtype = FILTER_OBJECT_TYPE_INDEX;
> +            else if (size == 6 && pg_strncasecmp(keyword, "schema", 6) == 0)
> +                *objtype = FILTER_OBJECT_TYPE_SCHEMA;
> +            else if (size == 5 && pg_strncasecmp(keyword, "table", 5) == 0)
> +                *objtype = FILTER_OBJECT_TYPE_TABLE;
> +            else if (size == 7 && pg_strncasecmp(keyword, "trigger", 7) == 0)
> +                *objtype = FILTER_OBJECT_TYPE_TRIGGER;

Avoid hardcoding these constants.

> diff --git a/src/bin/pg_dump/filter.h b/src/bin/pg_dump/filter.h
> new file mode 100644
> index 0000000000..e4a1a74b10
> --- /dev/null
> +++ b/src/bin/pg_dump/filter.h
...
> \ No newline at end of file

:(

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

18 July 2022, 18:48:56

ne 17. 7. 2022 v 16:01 odesílatel Justin Pryzby <pryzby@telsasoft.com> napsal:

Thanks for updating the patch.

This failed to build on windows.
http://cfbot.cputube.org/pavel-stehule.html

Yes, there was a significant problem with the function exit_nicely, that is differently implemented in pg_dump and pg_dumpall.

Some more comments inline.

On Sun, Jul 17, 2022 at 08:20:47AM +0200, Pavel Stehule wrote:
> The attached patch implements the --filter option for pg_dumpall and for
> pg_restore too.

> diff --git a/doc/src/sgml/ref/pg_dump.sgml b/doc/src/sgml/ref/pg_dump.sgml
> index 5efb442b44..ba2920dbee 100644
> --- a/doc/src/sgml/ref/pg_dump.sgml
> +++ b/doc/src/sgml/ref/pg_dump.sgml
> @@ -779,6 +779,80 @@ PostgreSQL documentation
> </listitem>
> </varlistentry>
>
> + <varlistentry>
> + <term><option>--filter=<replaceable class="parameter">filename</replaceable></option></term>
> + <listitem>
> + <para>
> + Specify a filename from which to read patterns for objects to include
> + or exclude from the dump. The patterns are interpreted according to the
> + same rules as the corresponding options:
> + <option>-t</option>/<option>--table</option> for tables,
> + <option>-n</option>/<option>--schema</option> for schemas,
> + <option>--include-foreign-data</option> for data on foreign servers and
> + <option>--exclude-table-data</option> for table data.
> + To read from <literal>STDIN</literal> use <filename>-</filename> as the

STDIN comma

fixed

> + <para>
> + Lines starting with <literal>#</literal> are considered comments and
> + are ignored. Comments can be placed after filter as well. Blank lines

change "are ignored" to "ignored", I think.

changed

> diff --git a/doc/src/sgml/ref/pg_dumpall.sgml b/doc/src/sgml/ref/pg_dumpall.sgml
> index 8a081f0080..137491340c 100644
> --- a/doc/src/sgml/ref/pg_dumpall.sgml
> +++ b/doc/src/sgml/ref/pg_dumpall.sgml
> @@ -122,6 +122,29 @@ PostgreSQL documentation
> </listitem>
> </varlistentry>
>
> + <varlistentry>
> + <term><option>--filter=<replaceable class="parameter">filename</replaceable></option></term>
> + <listitem>
> + <para>
> + Specify a filename from which to read patterns for databases excluded
> + from dump. The patterns are interpretted according to the same rules
> + like <option>--exclude-database</option>.

same rules *as*

fixed

> + To read from <literal>STDIN</literal> use <filename>-</filename> as the

comma

fixed

> + filename. The <option>--filter</option> option can be specified in
> + conjunction with the above listed options for including or excluding

For dumpall, remove "for including or"
change "above listed options" to "exclude-database" ?

fixed

> diff --git a/doc/src/sgml/ref/pg_restore.sgml b/doc/src/sgml/ref/pg_restore.sgml
> index 526986eadb..5f16c4a333 100644
> --- a/doc/src/sgml/ref/pg_restore.sgml
> +++ b/doc/src/sgml/ref/pg_restore.sgml
> @@ -188,6 +188,31 @@ PostgreSQL documentation
> </listitem>
> </varlistentry>
>
> + <varlistentry>
> + <term><option>--filter=<replaceable class="parameter">filename</replaceable></option></term>
> + <listitem>
> + <para>
> + Specify a filename from which to read patterns for objects excluded
> + or included from restore. The patterns are interpretted according to the
> + same rules like <option>--schema</option>, <option>--exclude-schema</option>,

s/like/as/

changed

> + <option>--function</option>, <option>--index</option>, <option>--table</option>
> + or <option>--trigger</option>.
> + To read from <literal>STDIN</literal> use <filename>-</filename> as the

STDIN comma

fixed

> +/*
> + * filter_get_keyword - read the next filter keyword from buffer
> + *
> + * Search for keywords (limited to containing ascii alphabetic characters) in

remove "containing"

fixed

> + /*
> + * If the object name pattern has been quoted we must take care parse out
> + * the entire quoted pattern, which may contain whitespace and can span
> + * over many lines.

quoted comma
*to parse
remove "over"

fixed

> + * The pattern is either simple without any whitespace, or properly quoted

double space

fixed

> + * in case there is whitespace in the object name. The pattern handling follows

s/is/may be/

fixed

> + if (size == 7 && pg_strncasecmp(keyword, "include", 7) == 0)
> + *is_include = true;
> + else if (size == 7 && pg_strncasecmp(keyword, "exclude", 7) == 0)
> + *is_include = false;

Can't you write strncasecmp(keyword, "include", size) to avoid hardcoding "7" ?

I need to compare the size of the keyword with expected size, but I can use strlen(conststr). I wrote new macro is_keyword_str to fix this issue

fixed

> +
> + if (size == 4 && pg_strncasecmp(keyword, "data", 4) == 0)
> + *objtype = FILTER_OBJECT_TYPE_DATA;
> + else if (size == 8 && pg_strncasecmp(keyword, "database", 8) == 0)
> + *objtype = FILTER_OBJECT_TYPE_DATABASE;
> + else if (size == 12 && pg_strncasecmp(keyword, "foreign_data", 12) == 0)
> + *objtype = FILTER_OBJECT_TYPE_FOREIGN_DATA;
> + else if (size == 8 && pg_strncasecmp(keyword, "function", 8) == 0)
> + *objtype = FILTER_OBJECT_TYPE_FUNCTION;
> + else if (size == 5 && pg_strncasecmp(keyword, "index", 5) == 0)
> + *objtype = FILTER_OBJECT_TYPE_INDEX;
> + else if (size == 6 && pg_strncasecmp(keyword, "schema", 6) == 0)
> + *objtype = FILTER_OBJECT_TYPE_SCHEMA;
> + else if (size == 5 && pg_strncasecmp(keyword, "table", 5) == 0)
> + *objtype = FILTER_OBJECT_TYPE_TABLE;
> + else if (size == 7 && pg_strncasecmp(keyword, "trigger", 7) == 0)
> + *objtype = FILTER_OBJECT_TYPE_TRIGGER;

Avoid hardcoding these constants.

fixed

> diff --git a/src/bin/pg_dump/filter.h b/src/bin/pg_dump/filter.h
> new file mode 100644
> index 0000000000..e4a1a74b10
> --- /dev/null
> +++ b/src/bin/pg_dump/filter.h
...
> \ No newline at end of file

fixed

updated patch attached

Regards

Pavel

:(

Attachment

pg_dump-filteropt.patch

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

22 August 2022, 09:52:42

I am sending fresh rebase + enhancing tests for pg_dumpall and pg_restore

Regards

Pavel

Attachment

pg_dump_filteropt.patch

Re: proposal: possibility to read dumped table's name from file

From

Daniel Gustafsson

Date:

07 September 2022, 19:45:54

As noted upthread at some point, I'm not overly excited about the parser in
filter.c, for maintainability and readability reasons.  So, I've reimplemented
the parser in Flex/Bison in the attached patch, which IMHO provides a clear(er)
picture of the grammar and is more per project standards.  This version of the
patch is your latest version with just the parser replaced (at a reduction in
size as a side benefit).

All features supported in your latest patch version are present, and it passes
all the tests added by this patch.  It's been an undisclosed amount of years
since I wrote a Bison parser (well, yacc really) from scratch so I don't rule
out having made silly mistakes.  I would very much appreciate review from those
more well versed in this area.

One thing this patchversion currently lacks is refined error messaging, but if
we feel that this approach is a viable path then that can be tweaked.  The
function which starts the parser can also be refactored to be shared across
pg_dump, pg_dumpall and pg_restore but I've kept it simple for now.

Thoughts?  It would be nice to get this patch across the finishline during this
commitfest.

--
Daniel Gustafsson        https://vmware.com/

Attachment

0001-Add-include-exclude-filtering-via-file-in-pg_dump.patch

Re: proposal: possibility to read dumped table's name from file

From

Erik Rijkers

Date:

08 September 2022, 10:00:17

Op 07-09-2022 om 21:45 schreef Daniel Gustafsson:
> 
> One thing this patchversion currently lacks is refined error messaging, but if
> we feel that this approach is a viable path then that can be tweaked.  The
> function which starts the parser can also be refactored to be shared across
> pg_dump, pg_dumpall and pg_restore but I've kept it simple for now.
> 
> Thoughts?  It would be nice to get this patch across the finishline during this
> commitfest.

 > [0001-Add-include-exclude-filtering-via-file-in-pg_dump.patch]

This seems to dump & restore well (as Pavels patch does).

I did notice one peculiarity (in your patch) where for each table a few 
spaces are omitted by pg_dump.

-------------
#! /bin/bash

psql -qXc "drop database if exists testdb2"
psql -qXc "create database testdb2"

echo "
create schema if not exists test;
create table table0 (id integer);
create table table1 (id integer);
insert into table0 select n from generate_series(1,2) as f(n);
insert into table1 select n from generate_series(1,2) as f(n);
" | psql -qXad testdb2

echo "include table table0" > inputfile1.txt

echo "include table table0
include table table1" > inputfile2.txt

# 1 table, emits 2 spaces
echo -ne ">"
pg_dump -F p -f plainfile1 --filter=inputfile1.txt -d testdb2
echo "<"

# 2 tables, emits 4 space
echo -ne ">"
pg_dump -F p -f plainfile2 --filter=inputfile2.txt -d testdb2
echo "<"

# dump without filter emits no spaces
echo -ne ">"
pg_dump -F c -f plainfile3 -t table0 -table1 -d testdb2
echo "<"
-------------

It's probably a small thing -- but I didn't find it.

thanks,

Erik Rijkers
> 
> --
> Daniel Gustafsson        https://vmware.com/
>

Re: proposal: possibility to read dumped table's name from file

From

Daniel Gustafsson

Date:

08 September 2022, 11:38:42

> On 8 Sep 2022, at 12:00, Erik Rijkers <er@xs4all.nl> wrote:
>
> Op 07-09-2022 om 21:45 schreef Daniel Gustafsson:
>> One thing this patchversion currently lacks is refined error messaging, but if
>> we feel that this approach is a viable path then that can be tweaked.  The
>> function which starts the parser can also be refactored to be shared across
>> pg_dump, pg_dumpall and pg_restore but I've kept it simple for now.
>> Thoughts?  It would be nice to get this patch across the finishline during this
>> commitfest.
>
> > [0001-Add-include-exclude-filtering-via-file-in-pg_dump.patch]
>
> This seems to dump & restore well (as Pavels patch does).

Thanks for looking!

> I did notice one peculiarity (in your patch) where for each table a few spaces are omitted by pg_dump.

Right, I had that on my TODO to fix before submitting but clearly forgot.  It
boils down to consuming the space between commands and object types and object
patterns. The attached v2 fixes that.

--
Daniel Gustafsson        https://vmware.com/

Attachment

v2-0001-Add-include-exclude-filtering-via-file-in-pg_dump.patch

Re: proposal: possibility to read dumped table's name from file

From

Julien Rouhaud

Date:

08 September 2022, 11:44:30

Hi,

On Thu, Sep 08, 2022 at 01:38:42PM +0200, Daniel Gustafsson wrote:
> > On 8 Sep 2022, at 12:00, Erik Rijkers <er@xs4all.nl> wrote:
> > 
> > Op 07-09-2022 om 21:45 schreef Daniel Gustafsson:
> >> One thing this patchversion currently lacks is refined error messaging, but if
> >> we feel that this approach is a viable path then that can be tweaked.  The
> >> function which starts the parser can also be refactored to be shared across
> >> pg_dump, pg_dumpall and pg_restore but I've kept it simple for now.
> >> Thoughts?  It would be nice to get this patch across the finishline during this
> >> commitfest.
> > 
> > > [0001-Add-include-exclude-filtering-via-file-in-pg_dump.patch]
> > 
> > This seems to dump & restore well (as Pavels patch does).
> 
> Thanks for looking!
> 
> > I did notice one peculiarity (in your patch) where for each table a few spaces are omitted by pg_dump.
> 
> Right, I had that on my TODO to fix before submitting but clearly forgot.  It
> boils down to consuming the space between commands and object types and object
> patterns. The attached v2 fixes that.

I only had a quick look at the parser, and one thing that strikes me is:

+Patterns:
+    /* EMPTY */
+    | Patterns Pattern
+    | Pattern
+    ;
+
+Pattern:
+        C_INCLUDE include_object pattern { include_item(priv, $2, $3); }

It seems confusing to mix Pattern(s) (the rules) and pattern (the token).
Maybe instead using Include(s) or Item(s) on the bison side, and/or
name_pattern on the lexer side?

Re: proposal: possibility to read dumped table's name from file

From

Daniel Gustafsson

Date:

08 September 2022, 12:32:22

> On 8 Sep 2022, at 13:44, Julien Rouhaud <rjuju123@gmail.com> wrote:
> On Thu, Sep 08, 2022 at 01:38:42PM +0200, Daniel Gustafsson wrote:
>>> On 8 Sep 2022, at 12:00, Erik Rijkers <er@xs4all.nl> wrote:

>>> I did notice one peculiarity (in your patch) where for each table a few spaces are omitted by pg_dump.
>>
>> Right, I had that on my TODO to fix before submitting but clearly forgot.  It
>> boils down to consuming the space between commands and object types and object
>> patterns. The attached v2 fixes that.
>
> I only had a quick look at the parser,

Thanks for looking!

> .. and one thing that strikes me is:
>
> +Patterns:
> +    /* EMPTY */
> +    | Patterns Pattern
> +    | Pattern
> +    ;
> +
> +Pattern:
> +        C_INCLUDE include_object pattern { include_item(priv, $2, $3); }
>
> It seems confusing to mix Pattern(s) (the rules) and pattern (the token).
> Maybe instead using Include(s) or Item(s) on the bison side, and/or
> name_pattern on the lexer side?

That makes a lot of sense, I renamed the rules in the parser but kept them in
the lexer since that seemed like the clearest scheme.

Also in the attached is a small refactoring to share parser init between
pg_dump and pg_restore (pg_dumpall shares little with these so not there for
now), buffer resize overflow calculation and some error message tweaking.

--
Daniel Gustafsson        https://vmware.com/

Attachment

v3-0001-Add-include-exclude-filtering-via-file-in-pg_dump.patch

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

08 September 2022, 15:03:01

st 7. 9. 2022 v 21:46 odesílatel Daniel Gustafsson <daniel@yesql.se> napsal:

As noted upthread at some point, I'm not overly excited about the parser in
filter.c, for maintainability and readability reasons. So, I've reimplemented
the parser in Flex/Bison in the attached patch, which IMHO provides a clear(er)
picture of the grammar and is more per project standards. This version of the
patch is your latest version with just the parser replaced (at a reduction in
size as a side benefit).

All features supported in your latest patch version are present, and it passes
all the tests added by this patch. It's been an undisclosed amount of years
since I wrote a Bison parser (well, yacc really) from scratch so I don't rule
out having made silly mistakes. I would very much appreciate review from those
more well versed in this area.

One thing this patchversion currently lacks is refined error messaging, but if
we feel that this approach is a viable path then that can be tweaked. The
function which starts the parser can also be refactored to be shared across
pg_dump, pg_dumpall and pg_restore but I've kept it simple for now.

Thoughts? It would be nice to get this patch across the finishline during this
commitfest.

I have no objections to this, and thank you so you try to move this patch forward.

Regards

Pavel

--
Daniel Gustafsson https://vmware.com/

Re: proposal: possibility to read dumped table's name from file

From

John Naylor

Date:

09 September 2022, 07:53:31

On Thu, Sep 8, 2022 at 7:32 PM Daniel Gustafsson <daniel@yesql.se> wrote:
> [v3]

Note that the grammar has shift-reduce conflicts. If you run a fairly
recent Bison, you can show them like this:

bison -Wno-deprecated -Wcounterexamples -d -o filterparse.c filterparse.y

filterparse.y: warning: 2 shift/reduce conflicts [-Wconflicts-sr]
filterparse.y: warning: shift/reduce conflict on token C_INCLUDE
[-Wcounterexamples]
  Example: • C_INCLUDE include_object pattern
  Shift derivation
    Filters
    ↳ 3: Filter
         ↳ 4: • C_INCLUDE include_object pattern
  Reduce derivation
    Filters
    ↳ 2: Filters  Filter
         ↳ 1: ε • ↳ 4: C_INCLUDE include_object pattern
filterparse.y: warning: shift/reduce conflict on token C_EXCLUDE
[-Wcounterexamples]
  Example: • C_EXCLUDE exclude_object pattern
  Shift derivation
    Filters
    ↳ 3: Filter
         ↳ 5: • C_EXCLUDE exclude_object pattern
  Reduce derivation
    Filters
    ↳ 2: Filters  Filter
         ↳ 1: ε • ↳ 5: C_EXCLUDE exclude_object pattern

--
John Naylor
EDB: http://www.enterprisedb.com

Re: proposal: possibility to read dumped table's name from file

From

Andrew Dunstan

Date:

09 September 2022, 09:00:36




> On Sep 9, 2022, at 5:53 PM, John Naylor <john.naylor@enterprisedb.com> wrote:
>
> On Thu, Sep 8, 2022 at 7:32 PM Daniel Gustafsson <daniel@yesql.se> wrote:
>> [v3]
>
> Note that the grammar has shift-reduce conflicts. If you run a fairly
> recent Bison, you can show them like this:
>
> bison -Wno-deprecated -Wcounterexamples -d -o filterparse.c filterparse.y
>
> filterparse.y: warning: 2 shift/reduce conflicts [-Wconflicts-sr]
> filterparse.y: warning: shift/reduce conflict on token C_INCLUDE
> [-Wcounterexamples]
>  Example: • C_INCLUDE include_object pattern
>  Shift derivation
>    Filters
>    ↳ 3: Filter
>         ↳ 4: • C_INCLUDE include_object pattern
>  Reduce derivation
>    Filters
>    ↳ 2: Filters  Filter
>         ↳ 1: ε • ↳ 4: C_INCLUDE include_object pattern
> filterparse.y: warning: shift/reduce conflict on token C_EXCLUDE
> [-Wcounterexamples]
>  Example: • C_EXCLUDE exclude_object pattern
>  Shift derivation
>    Filters
>    ↳ 3: Filter
>         ↳ 5: • C_EXCLUDE exclude_object pattern
>  Reduce derivation
>    Filters
>    ↳ 2: Filters  Filter
>         ↳ 1: ε • ↳ 5: C_EXCLUDE exclude_object pattern
>


Looks like the last rule for Filters should not be there. I do wonder whether we should be using bison/flex here, seems
likeusing a sledgehammer to crack a nut. 

Cheers

Andrew

Re: proposal: possibility to read dumped table's name from file

From

Daniel Gustafsson

Date:

12 September 2022, 07:58:37

> On 9 Sep 2022, at 11:00, Andrew Dunstan <andrew@dunslane.net> wrote:
>
>> On Sep 9, 2022, at 5:53 PM, John Naylor <john.naylor@enterprisedb.com> wrote:
>>
>> Note that the grammar has shift-reduce conflicts.

> Looks like the last rule for Filters should not be there.

Correct, fixed in the attached.

> I do wonder whether we should be using bison/flex here, seems like using a
> sledgehammer to crack a nut.

I don't the capabilities of the tool is all that interesting compared to the
long term maintainability and readability of the source code.  Personally I
think a simple Bison/Flex parser is easier to read and reason about than the
corresponding written in C.

--
Daniel Gustafsson        https://vmware.com/

Attachment

v4-0001-Add-include-exclude-filtering-via-file-in-pg_dump.patch

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

12 September 2022, 13:09:24

po 12. 9. 2022 v 9:59 odesílatel Daniel Gustafsson <daniel@yesql.se> napsal:

> On 9 Sep 2022, at 11:00, Andrew Dunstan <andrew@dunslane.net> wrote:
>
>> On Sep 9, 2022, at 5:53 PM, John Naylor <john.naylor@enterprisedb.com> wrote:
>>
>> Note that the grammar has shift-reduce conflicts.

> Looks like the last rule for Filters should not be there.

Correct, fixed in the attached.

> I do wonder whether we should be using bison/flex here, seems like using a
> sledgehammer to crack a nut.

I don't the capabilities of the tool is all that interesting compared to the
long term maintainability and readability of the source code. Personally I
think a simple Bison/Flex parser is easier to read and reason about than the
corresponding written in C.

When this work is done, then there is no reason to throw it. The parser in bison/flex does the same work and it is true, so code is more readable. Although for this case, a handy written parser was trivial too.

Regards

Pavel

--
Daniel Gustafsson https://vmware.com/

Re: proposal: possibility to read dumped table's name from file

From

Erik Rijkers

Date:

12 September 2022, 14:00:07

Op 12-09-2022 om 09:58 schreef Daniel Gustafsson:
>> On 9 Sep 2022, at 11:00, Andrew Dunstan <andrew@dunslane.net> wrote:
>>
>>> On Sep 9, 2022, at 5:53 PM, John Naylor <john.naylor@enterprisedb.com> wrote:

>>> [v4-0001-Add-include-exclude-filtering-via-file-in-pg_dump.patch]

I noticed that pg_restore --filter cannot, or at last not always, be 
used with the same filter-file that was used to produce a dump with 
pg_dump --filter.

Is that as designed?  It seems a bit counterintuitive.  It'd be nice if 
that could be fixed.  Admittedly, the 'same' problem in pg_restore -t, 
also less than ideal.

(A messy bashdemo below)

thanks,

Erik Rijkers

#! /bin/bash
db2='testdb2' db3='testdb3'
db2='testdb_source' db3='testdb_target'
sql_dropdb="drop database if exists $db2; drop database if exists $db3;"
sql_createdb="create database $db2; create database $db3;"
schema1=s1  table1=table1  t1=$schema1.$table1
schema2=s2  table2=table2  t2=$schema2.$table2
sql_schema_init="create schema if not exists $schema1; create schema if 
not exists $schema2;"
sql_test="select '$t1', n from $t1 order by n; select '$t2', n from $t2 
order by n;"

function sqltest()
{
   for database_name in $db2 $db3 ;do
     port_used=$( echo "show port" |psql -qtAX -d $database_name )
     echo -n "-- $database_name ($port_used):  "
     echo "$sql_test" | psql -qtAX -a -d $database_name | md5sum
   done
   echo
}

echo "setting up orig db $db2, target db $db3"
echo "$sql_dropdb"    | psql -qtAX
echo "$sql_createdb"  | psql -qtAX

psql -X -d $db2 << SQL
$sql_schema_init
create table $t1 as select n from generate_series(1, (10^1)::int) as f(n);
create table $t2 as select n from generate_series(2, (10^2)::int) as f(n);
SQL
echo "
include table $t1
include table $t2
# include schema $s1
# include schema $s2
" > inputfile1.txt

# in filter; out plain
echo "-- pg_dump -F p -f plainfile1 --filter=inputfile1.txt -d $db2"
          pg_dump -F p -f plainfile1 --filter=inputfile1.txt -d $db2

echo "$sql_schema_init" | psql -qX -d $db3
echo  "-- pg_restore -d $db3 dumpfile1"
           pg_restore -d $db3 dumpfile1
           rc=$?
echo "-- pg_restore returned [$rc]  -- pg_restore without --filter"
sqltest

# enable this to see it fail
if [[ 1 -eq 1 ]]
then

# clean out
echo "drop schema $schema1 cascade; drop schema $schema2 cascade; " | 
psql -qtAXad $db3

--filter=inputfile1.txt"
echo "$sql_schema_init" | psql -qX -d $db3
echo "-- pg_restore -d $db3 --filter=inputfile1.txt dumpfile1"
          pg_restore -d $db3 --filter=inputfile1.txt dumpfile1
          rc=$?
echo "-- pg_restore returned [$rc]  -- pg_restore without --filter"
sqltest

fi

Re: proposal: possibility to read dumped table's name from file

From

Erik Rijkers

Date:

12 September 2022, 14:18:41


Op 12-09-2022 om 16:00 schreef Erik Rijkers:
> Op 12-09-2022 om 09:58 schreef Daniel Gustafsson:
>>> On 9 Sep 2022, at 11:00, Andrew Dunstan <andrew@dunslane.net> wrote:
>>>
>>>> On Sep 9, 2022, at 5:53 PM, John Naylor 
>>>> <john.naylor@enterprisedb.com> wrote:
> 
>>>> [v4-0001-Add-include-exclude-filtering-via-file-in-pg_dump.patch]
> 
> I noticed that pg_restore --filter cannot, or at last not always, be 
> used with the same filter-file that was used to produce a dump with 
> pg_dump --filter.
> 
> Is that as designed?  It seems a bit counterintuitive.  It'd be nice if 
> that could be fixed.  Admittedly, the 'same' problem in pg_restore -t, 
> also less than ideal.
> 
> (A messy bashdemo below)

I hope the issue is still clear, even though in the bash I sent, I 
messed up the dumpfile name (i.e., in the bash that I sent the pg_dump 
creates another dump name than what is given to pg_restore. They should 
use the same dumpname, obviously)

> 
> thanks,
> 
> Erik Rijkers

Re: proposal: possibility to read dumped table's name from file

From

John Naylor

Date:

13 September 2022, 08:46:22

On Mon, Sep 12, 2022 at 8:10 PM Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
> po 12. 9. 2022 v 9:59 odesílatel Daniel Gustafsson <daniel@yesql.se> napsal:
>> I don't the capabilities of the tool is all that interesting compared to the
>> long term maintainability and readability of the source code.

With make distprep and maintainer-clean, separate makefile and MSVC
build logic a short time before converting to Meson, I'm not sure that
even the short term maintainability here is a good trade off for what
we're getting.

> The parser in bison/flex does the same work and it is true, so code is more readable. Although for this case, a handy
writtenparser was trivial too. 

If the hand-written version is trivial, then we should prefer it.
--
John Naylor
EDB: http://www.enterprisedb.com

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

13 September 2022, 09:36:43

út 13. 9. 2022 v 10:46 odesílatel John Naylor <john.naylor@enterprisedb.com> napsal:

On Mon, Sep 12, 2022 at 8:10 PM Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
> po 12. 9. 2022 v 9:59 odesílatel Daniel Gustafsson <daniel@yesql.se> napsal:
>> I don't the capabilities of the tool is all that interesting compared to the
>> long term maintainability and readability of the source code.

With make distprep and maintainer-clean, separate makefile and MSVC
build logic a short time before converting to Meson, I'm not sure that
even the short term maintainability here is a good trade off for what
we're getting.

> The parser in bison/flex does the same work and it is true, so code is more readable. Although for this case, a handy written parser was trivial too.

If the hand-written version is trivial, then we should prefer it.

Please, can you check and compare both versions? My view is subjective.

Regards

Pavel

--
John Naylor
EDB: http://www.enterprisedb.com

Re: proposal: possibility to read dumped table's name from file

From

Andres Freund

Date:

22 September 2022, 15:10:07

Hi,

On 2022-09-12 09:58:37 +0200, Daniel Gustafsson wrote:
> > On 9 Sep 2022, at 11:00, Andrew Dunstan <andrew@dunslane.net> wrote:
> > 
> >> On Sep 9, 2022, at 5:53 PM, John Naylor <john.naylor@enterprisedb.com> wrote:
> >> 
> >> Note that the grammar has shift-reduce conflicts. 
> 
> > Looks like the last rule for Filters should not be there.
> 
> Correct, fixed in the attached.

Due to the merge of the meson build, this patch now needs to adjust the
relevant meson.build. This is the cause of the failures at:
https://cirrus-ci.com/build/5788292678418432

See e.g. src/bin/pgbench/meson.build

Greetings,

Andres Freund

Re: proposal: possibility to read dumped table's name from file

From

Andres Freund

Date:

02 October 2022, 06:56:59

Hi,

On 2022-09-12 09:58:37 +0200, Daniel Gustafsson wrote:
> Correct, fixed in the attached.

Updated patch adding meson compatibility attached.

Greetings,

Andres Freund

Attachment

v5-0001-Add-include-exclude-filtering-via-file-in-pg_dump.patch

Re: proposal: possibility to read dumped table's name from file

From

Andres Freund

Date:

02 October 2022, 07:19:59

Hi,

On 2022-10-01 23:56:59 -0700, Andres Freund wrote:
> On 2022-09-12 09:58:37 +0200, Daniel Gustafsson wrote:
> > Correct, fixed in the attached.
> 
> Updated patch adding meson compatibility attached.

Err, forgot to amend one hunk :(

Greetings,

Andres Freund

Attachment

v6-0001-Add-include-exclude-filtering-via-file-in-pg_dump.patch

Re: proposal: possibility to read dumped table's name from file

From

Andres Freund

Date:

02 October 2022, 16:04:13

Hi,

On 2022-10-02 00:19:59 -0700, Andres Freund wrote:
> On 2022-10-01 23:56:59 -0700, Andres Freund wrote:
> > On 2022-09-12 09:58:37 +0200, Daniel Gustafsson wrote:
> > > Correct, fixed in the attached.
> > 
> > Updated patch adding meson compatibility attached.
> 
> Err, forgot to amend one hunk :(

That fixed it on all platforms but windows, due to copy-pasto. I really should
have stopped earlier yesterday...


> +/*-------------------------------------------------------------------------
> + *
> + * filter.h
> + *      Common header file for the parser of filter file
> + *
> + * Portions Copyright (c) 1996-2022, PostgreSQL Global Development Group
> + * Portions Copyright (c) 1994, Regents of the University of California
> + *
> + * src/bin/pg_dump/filter.h
> + *
> + *-------------------------------------------------------------------------
> + */
> +#ifndef FILTER_H
> +#define FILTER_H
> +#include "c.h"

c.h (and postgres.h, postgres_fe.h) shouldn't be included in headers.


This is a common enough mistake that I'm wondering if we could automate
warning about it somehow.

Greetings,

Andres Freund

Attachment

v7-0001-Add-include-exclude-filtering-via-file-in-pg_dump.patch

Re: proposal: possibility to read dumped table's name from file

From

Daniel Gustafsson

Date:

02 October 2022, 20:52:33

> On 2 Oct 2022, at 18:04, Andres Freund <andres@anarazel.de> wrote:
> On 2022-10-02 00:19:59 -0700, Andres Freund wrote:
>> On 2022-10-01 23:56:59 -0700, Andres Freund wrote:
>>> On 2022-09-12 09:58:37 +0200, Daniel Gustafsson wrote:
>>>> Correct, fixed in the attached.
>>>
>>> Updated patch adding meson compatibility attached.
>>
>> Err, forgot to amend one hunk :(
>
> That fixed it on all platforms but windows, due to copy-pasto. I really should
> have stopped earlier yesterday...

Thanks for updating the patch!

The parser in the original submission was -1'd by me, and the current version
proposed as an alternative.  This was subsequently -1'd as well but no updated
patch with a rewritten parser has been posted.  So this is now stalled again.

Having been around in 12 commitfests without a committer feeling confident
about pushing this I plan to mark it returned with feedback, and if a new
parser materializes itc can be readded instead of being dragged along.

> c.h (and postgres.h, postgres_fe.h) shouldn't be included in headers.
>
> This is a common enough mistake that I'm wondering if we could automate
> warning about it somehow.

Maybe we can add a simple git grep invocation in the CompilerWarnings CI job to
catch this in the CFBot?  If something like the below sketch matches then we
can throw an error. (only for illustration, all three files needs to checked).

    git grep "\"c\.h" -- *.h :^src/include/postgres*.h;

--
Daniel Gustafsson        https://vmware.com/

Re: proposal: possibility to read dumped table's name from file

From

Tom Lane

Date:

02 October 2022, 21:02:16

Daniel Gustafsson <daniel@yesql.se> writes:
> On 2 Oct 2022, at 18:04, Andres Freund <andres@anarazel.de> wrote:
>> c.h (and postgres.h, postgres_fe.h) shouldn't be included in headers.
>> This is a common enough mistake that I'm wondering if we could automate
>> warning about it somehow.

> Maybe we can add a simple git grep invocation in the CompilerWarnings CI job to
> catch this in the CFBot?

I'd be inclined to teach headerscheck or cpluspluscheck about it.

            regards, tom lane

Re: proposal: possibility to read dumped table's name from file

From

Andres Freund

Date:

02 October 2022, 21:11:39

On 2022-10-02 22:52:33 +0200, Daniel Gustafsson wrote:
> The parser in the original submission was -1'd by me, and the current version
> proposed as an alternative.  This was subsequently -1'd as well but no updated
> patch with a rewritten parser has been posted.  So this is now stalled again.
> 
> Having been around in 12 commitfests without a committer feeling confident
> about pushing this I plan to mark it returned with feedback, and if a new
> parser materializes itc can be readded instead of being dragged along.

Makes sense to me.

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

03 October 2022, 04:00:12

ne 2. 10. 2022 v 22:52 odesílatel Daniel Gustafsson <daniel@yesql.se> napsal:

> On 2 Oct 2022, at 18:04, Andres Freund <andres@anarazel.de> wrote:
> On 2022-10-02 00:19:59 -0700, Andres Freund wrote:
>> On 2022-10-01 23:56:59 -0700, Andres Freund wrote:
>>> On 2022-09-12 09:58:37 +0200, Daniel Gustafsson wrote:
>>>> Correct, fixed in the attached.
>>>
>>> Updated patch adding meson compatibility attached.
>>
>> Err, forgot to amend one hunk :(
>
> That fixed it on all platforms but windows, due to copy-pasto. I really should
> have stopped earlier yesterday...

Thanks for updating the patch!

The parser in the original submission was -1'd by me, and the current version
proposed as an alternative. This was subsequently -1'd as well but no updated
patch with a rewritten parser has been posted. So this is now stalled again.

You started rewriting it, but you didn't finish it.

Unfortunately, there is not a clean opinion on using bison's parser for this purpose. I understand that the complexity of this language is too low, so the benefit of using bison's gramatic is low too. Personally, I have not any problem using bison for this purpose. For this case, I think we compare two similarly long ways, but unfortunately, customers that have a problem with long command lines still have this problem.

Can we go forward? Daniel is strongly against handwritten parser. Is there somebody strongly against bison's based parser? There is not any other way.

I am able to complete Daniel's patch, if there will not be objections.

Regards

Pavel

Having been around in 12 commitfests without a committer feeling confident
about pushing this I plan to mark it returned with feedback, and if a new
parser materializes itc can be readded instead of being dragged along.

> c.h (and postgres.h, postgres_fe.h) shouldn't be included in headers.
>
> This is a common enough mistake that I'm wondering if we could automate
> warning about it somehow.

Maybe we can add a simple git grep invocation in the CompilerWarnings CI job to
catch this in the CFBot? If something like the below sketch matches then we
can throw an error. (only for illustration, all three files needs to checked).

git grep "\"c\.h" -- *.h :^src/include/postgres*.h;

--
Daniel Gustafsson https://vmware.com/

Re: proposal: possibility to read dumped table's name from file

From

Julien Rouhaud

Date:

03 October 2022, 04:34:35

Hi,

On Mon, Oct 03, 2022 at 06:00:12AM +0200, Pavel Stehule wrote:
> ne 2. 10. 2022 v 22:52 odesílatel Daniel Gustafsson <daniel@yesql.se>
> napsal:
> 
> > > On 2 Oct 2022, at 18:04, Andres Freund <andres@anarazel.de> wrote:
> > > On 2022-10-02 00:19:59 -0700, Andres Freund wrote:
> > >> On 2022-10-01 23:56:59 -0700, Andres Freund wrote:
> > >>> On 2022-09-12 09:58:37 +0200, Daniel Gustafsson wrote:
> > >>>> Correct, fixed in the attached.
> > >>>
> > >>> Updated patch adding meson compatibility attached.
> > >>
> > >> Err, forgot to amend one hunk :(
> > >
> > > That fixed it on all platforms but windows, due to copy-pasto. I really
> > should
> > > have stopped earlier yesterday...
> >
> > Thanks for updating the patch!
> >
> > The parser in the original submission was -1'd by me, and the current
> > version
> > proposed as an alternative.  This was subsequently -1'd as well but no
> > updated
> > patch with a rewritten parser has been posted.  So this is now stalled
> > again.
> >
> 
> You started rewriting it, but you didn't finish it.
> 
> Unfortunately, there is not a clean opinion on using bison's parser for
> this purpose. I understand that the complexity of this language is too low,
> so the benefit of using bison's gramatic is low too. Personally, I have not
> any problem using bison for this purpose. For this case, I think we compare
> two similarly long ways, but unfortunately, customers that have a problem
> with long command lines still have this problem.
> 
> Can we go forward? Daniel is strongly against handwritten parser. Is there
> somebody strongly against bison's based parser? There is not any other way.

I don't have a strong opinion either, but it seems that 2 people argued against
a bison parser (vs only 1 arguing for) and the fact that the current habit is
to rely on hand written parsers for simple cases (e.g. jsonapi.c /
pg_parse_json()), it seems that we should go back to Pavel's original parser.

I only had a quick look but it indeed seems trivial, it just maybe need a bit
of refactoring to avoid some code duplication (getFiltersFromFile is
duplicated, and getDatabaseExcludeFiltersFromFile could be removed if
getFiltersFromFile knew about the 2 patterns).

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

07 October 2022, 05:26:08

I am sending version with handy written parser and meson support

po 3. 10. 2022 v 6:34 odesílatel Julien Rouhaud <rjuju123@gmail.com> napsal:

Hi,

> You started rewriting it, but you didn't finish it.
>
> Unfortunately, there is not a clean opinion on using bison's parser for
> this purpose. I understand that the complexity of this language is too low,
> so the benefit of using bison's gramatic is low too. Personally, I have not
> any problem using bison for this purpose. For this case, I think we compare
> two similarly long ways, but unfortunately, customers that have a problem
> with long command lines still have this problem.
>
> Can we go forward? Daniel is strongly against handwritten parser. Is there
> somebody strongly against bison's based parser? There is not any other way.

I don't have a strong opinion either, but it seems that 2 people argued against
a bison parser (vs only 1 arguing for) and the fact that the current habit is
to rely on hand written parsers for simple cases (e.g. jsonapi.c /
pg_parse_json()), it seems that we should go back to Pavel's original parser.

I only had a quick look but it indeed seems trivial, it just maybe need a bit
of refactoring to avoid some code duplication (getFiltersFromFile is
duplicated, and getDatabaseExcludeFiltersFromFile could be removed if
getFiltersFromFile knew about the 2 patterns).

I checked this code again, and I don't think some refactoring is easy. getFiltersFromFile is not duplicated. It is just probably badly named.

These routines are used from pg_dump, pg_dumpall and pg_restore. There are significant differences in supported objects and in types used for returned lists (dumpOptions, SimpleStringList, and RestoreOptions). If I have one routine, then I need to implement some mechanism for specification of supported objects, and a special type that can be used as a proxy between caller and parser to hold lists of parsed values. To be names less confusing I renamed them to read_dump_filters, read_dumpall_filters and read_restore_filters

Regards

Pavel

Attachment

pg_dump-filter-20221007.patch

Re: proposal: possibility to read dumped table's name from file

From

Julien Rouhaud

Date:

07 October 2022, 14:03:23

On Fri, Oct 07, 2022 at 07:26:08AM +0200, Pavel Stehule wrote:
>
> I checked this code again, and I don't think some refactoring is easy.
> getFiltersFromFile is not duplicated. It is just probably badly named.
>
> These routines are used from pg_dump, pg_dumpall and pg_restore. There are
> significant differences in supported objects and in types used for returned
> lists (dumpOptions, SimpleStringList, and RestoreOptions). If I have one
> routine, then I need to implement some mechanism for specification of
> supported objects, and a special type that can be used as a proxy between
> caller and parser to hold lists of parsed values. To be names less
> confusing I renamed them to read_dump_filters, read_dumpall_filters and
> read_restore_filters

Ah right, I missed the different argument types.  Now that the functions have
improved names it looks way clearer, and it seems just fine!

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

07 October 2022, 16:48:49

pá 7. 10. 2022 v 16:03 odesílatel Julien Rouhaud <rjuju123@gmail.com> napsal:

On Fri, Oct 07, 2022 at 07:26:08AM +0200, Pavel Stehule wrote:
>
> I checked this code again, and I don't think some refactoring is easy.
> getFiltersFromFile is not duplicated. It is just probably badly named.
>
> These routines are used from pg_dump, pg_dumpall and pg_restore. There are
> significant differences in supported objects and in types used for returned
> lists (dumpOptions, SimpleStringList, and RestoreOptions). If I have one
> routine, then I need to implement some mechanism for specification of
> supported objects, and a special type that can be used as a proxy between
> caller and parser to hold lists of parsed values. To be names less
> confusing I renamed them to read_dump_filters, read_dumpall_filters and
> read_restore_filters

Ah right, I missed the different argument types. Now that the functions have
improved names it looks way clearer, and it seems just fine!

Thank you for check

Pavel

Re: proposal: possibility to read dumped table's name from file

From

Andres Freund

Date:

13 October 2022, 18:46:34

Hi,

On 2022-10-07 07:26:08 +0200, Pavel Stehule wrote:
> I am sending version with handy written parser and meson support

Given this is a new approach it seems inaccurate to have the CF entry marked
ready-for-committer. I've updated it to needs-review.

Greetings,

Andres Freund

Re: proposal: possibility to read dumped table's name from file

From

Julien Rouhaud

Date:

18 October 2022, 09:33:29

On Thu, Oct 13, 2022 at 11:46:34AM -0700, Andres Freund wrote:
> Hi,
> 
> On 2022-10-07 07:26:08 +0200, Pavel Stehule wrote:
> > I am sending version with handy written parser and meson support
> 
> Given this is a new approach it seems inaccurate to have the CF entry marked
> ready-for-committer. I've updated it to needs-review.

I just had a quick look at the rest of the patch.

For the parser, it seems that filter_get_pattern is reimplementing an
identifier parsing function but isn't entirely correct.  It can correctly parse
quoted non-qualified identifiers and non-quoted qualified identifiers, but not
quoted and qualified ones.  For instance:

$ echo 'include table nsp.tbl' | pg_dump --filter - >/dev/null
$echo $?
0

$ echo 'include table "TBL"' | pg_dump --filter - >/dev/null
$echo $?
0

$ echo 'include table "NSP"."TBL"' | pg_dump --filter - >/dev/null
pg_dump: error: invalid format of filter on line 1: unexpected extra data after pattern

This should also be covered in the regression tests.

I'm wondering if psql's parse_identifier() could be exported and reused here
rather than creating yet another version.  

Nitpicking: the comments needs some improvements:

+ /*
+  * Simple routines - just don't repeat same code
+  *
+  * Returns true, when filter's file is opened
+  */
+ bool
+ filter_init(FilterStateData *fstate, const char *filename)

also, is there any reason why this function doesn't call exit_nicely in case of
error rather than letting each caller do it without any other cleanup?

+ /*
+  * Release allocated sources for filter
+  */
+ void
+ filter_free_sources(FilterStateData *fstate)

I'm assuming "ressources" not "sources"?

+ /*
+  * log_format_error - Emit error message
+  *
+  * This is mostly a convenience routine to avoid duplicating file closing code
+  * in multiple callsites.
+  */
+ void
+ log_invalid_filter_format(FilterStateData *fstate, char *message)

mismatch between comment and function name (same for filter_read_item)

+ static const char *
+ filter_object_type_name(FilterObjectType fot)

No description.

/*
 * Helper routine to reduce duplicated code
 */
void
log_unsupported_filter_object_type(FilterStateData *fstate,
                                    const char *appname,
                                    FilterObjectType fot)

Need more helpful comment.

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

26 October 2022, 04:26:26

út 18. 10. 2022 v 11:33 odesílatel Julien Rouhaud <rjuju123@gmail.com> napsal:

On Thu, Oct 13, 2022 at 11:46:34AM -0700, Andres Freund wrote:
> Hi,
>
> On 2022-10-07 07:26:08 +0200, Pavel Stehule wrote:
> > I am sending version with handy written parser and meson support
>
> Given this is a new approach it seems inaccurate to have the CF entry marked
> ready-for-committer. I've updated it to needs-review.

I just had a quick look at the rest of the patch.

For the parser, it seems that filter_get_pattern is reimplementing an
identifier parsing function but isn't entirely correct. It can correctly parse
quoted non-qualified identifiers and non-quoted qualified identifiers, but not
quoted and qualified ones. For instance:

$ echo 'include table nsp.tbl' | pg_dump --filter - >/dev/null
$echo $?
0

$ echo 'include table "TBL"' | pg_dump --filter - >/dev/null
$echo $?
0

$ echo 'include table "NSP"."TBL"' | pg_dump --filter - >/dev/null
pg_dump: error: invalid format of filter on line 1: unexpected extra data after pattern

fixed

This should also be covered in the regression tests.

done

I'm wondering if psql's parse_identifier() could be exported and reused here
rather than creating yet another version.

I looked there, and I don't think this parser is usable for this purpose. It is very sensitive on white spaces, and doesn't support multi-lines. It is designed for support readline tab complete, it is designed for simplicity not for correctness.

Nitpicking: the comments needs some improvements:

+ /*
+ * Simple routines - just don't repeat same code
+ *
+ * Returns true, when filter's file is opened
+ */
+ bool
+ filter_init(FilterStateData *fstate, const char *filename)

done

also, is there any reason why this function doesn't call exit_nicely in case of
error rather than letting each caller do it without any other cleanup?

It is commented few lines up

/*
* Following routines are called from pg_dump, pg_dumpall and pg_restore.
* Unfortunatelly, implementation of exit_nicely in pg_dump and pg_restore
* is different from implementation of this rutine in pg_dumpall. So instead
* direct calling exit_nicely we have to return some error flag (in this
* case NULL), and exit_nicelly will be executed from caller's routine.
*/

+ /*
+ * Release allocated sources for filter
+ */
+ void
+ filter_free_sources(FilterStateData *fstate)

I'm assuming "ressources" not "sources"?

changed

+ /*
+ * log_format_error - Emit error message
+ *
+ * This is mostly a convenience routine to avoid duplicating file closing code
+ * in multiple callsites.
+ */
+ void
+ log_invalid_filter_format(FilterStateData *fstate, char *message)

mismatch between comment and function name (same for filter_read_item)

fixes

+ static const char *
+ filter_object_type_name(FilterObjectType fot)

No description.

fixed

/*
* Helper routine to reduce duplicated code
*/
void
log_unsupported_filter_object_type(FilterStateData *fstate,
const char *appname,
FilterObjectType fot)

Need more helpful comment.

fixed

Thank you for comments

attached updated patch

Regards

Pavel

Attachment

pg_dump-filter-20221026.patch

Re: proposal: possibility to read dumped table's name from file

From

Julien Rouhaud

Date:

03 November 2022, 04:09:36

Hi,

On Wed, Oct 26, 2022 at 06:26:26AM +0200, Pavel Stehule wrote:
>
> út 18. 10. 2022 v 11:33 odesílatel Julien Rouhaud <rjuju123@gmail.com>
> napsal:
>
> >
> > I'm wondering if psql's parse_identifier() could be exported and reused
> > here
> > rather than creating yet another version.
> >
>
> I looked there, and I don't think this parser is usable for this purpose.
> It is  very sensitive on white spaces, and doesn't support multi-lines. It
> is designed for support readline tab complete, it is designed for
> simplicity not for correctness.

Ah, sorry I should have checked more thoroughly.  I guess it's ok to have
another identifier parser for the include file then, as this new one wouldn't
really fit the tab-completion use case.

> > also, is there any reason why this function doesn't call exit_nicely in
> > case of
> > error rather than letting each caller do it without any other cleanup?
> >
>
> It is commented few lines up
>
> /*
>  * Following routines are called from pg_dump, pg_dumpall and pg_restore.
>  * Unfortunatelly, implementation of exit_nicely in pg_dump and pg_restore
>  * is different from implementation of this rutine in pg_dumpall. So instead
>  * direct calling exit_nicely we have to return some error flag (in this
>  * case NULL), and exit_nicelly will be executed from caller's routine.
>  */

Oh right, I totally missed it sorry about that!

About the new version, I didn't find any problem with the feature itself so
it's a good thing!

I still have a few comments about the patch.  First, about the behavior:

- is that ok to have just "data" pattern instead of "table_data" or something
  like that, since it's supposed to match --exclude-table-data option?

- the error message are sometimes not super helpful.  For instance:

$ echo "include data t1" | pg_dump --filter -
pg_dump: error: invalid format of filter on line 1: include filter is not allowed for this type of object

It would be nice if the error message mentioned "data" rather than a generic
"this type of object".  Also, maybe we should quote "include" to outline that
we found this keyword?

About the patch itself:
filter.c:

+#include "postgres_fe.h"
+
+#include "filter.h"
+
+#include "common/logging.h"

the filter.h inclusion should be done with the rest of the includes, in
alphabetical order.

+#define        is_keyword_str(cstr, str, bytes) \
+    ((strlen(cstr) == bytes) && (pg_strncasecmp(cstr, str, bytes) == 0))

nit: our guidline is to protect macro arguments with parenthesis.  Some
arguments can be evaluated multiple times but I don't think it's worth adding a
comment for that.

+ * Unfortunatelly, implementation of exit_nicely in pg_dump and pg_restore
+ * is different from implementation of this rutine in pg_dumpall. So instead
+ * direct calling exit_nicely we have to return some error flag (in this

typos: s/Unfortunatelly/Unfortunately/ and s/rutine/routine/
Also, it would probably be better to say "instead of directly calling..."

+static const char *
+filter_object_type_name(FilterObjectType fot)
+{
+    switch (fot)
+    {
+        case FILTER_OBJECT_TYPE_NONE:
+            return "comment or empty line";
+[...]
+    }
+
+    return "unknown object type";
+}

I'm wondering if we should add a pg_unreachable() there, some compilers might
complain otherwise.  See CreateDestReceiver() for instance for similar pattern.

+ * Emit error message "invalid format of filter file ..."
+ *
+ * This is mostly a convenience routine to avoid duplicating file closing code
+ * in multiple callsites.
+ */
+void
+log_invalid_filter_format(FilterStateData *fstate, char *message)

nit: invalid format *in* filter file...?

+void
+log_unsupported_filter_object_type(FilterStateData *fstate,
+                                    const char *appname,
+                                    FilterObjectType fot)
+{
+    PQExpBuffer str = createPQExpBuffer();
+
+    printfPQExpBuffer(str,
+                      "The application \"%s\" doesn't support filter for object type \"%s\".",

nit: there shouldn't be uppercase in error messages, especially since this will
be appended to another message by log_invalid_filter_format.  I would just just
drop "The application" entirely for brevity.

+/*
+ * Release allocated resources for filter
+ */
+void
+filter_free(FilterStateData *fstate)

nit: Release allocated resources for *the given* filter?

+ * Search for keywords (limited to ascii alphabetic characters) in
+ * the passed in line buffer. Returns NULL, when the buffer is empty or first
+ * char is not alpha. The length of the found keyword is returned in the size
+ * parameter.
+ */
+static const char *
+filter_get_keyword(const char **line, int *size)
+{
+   [...]
+    if (isascii(*ptr) && isalpha(*ptr))
+    {
+        result = ptr++;
+
+        while (isascii(*ptr) && (isalpha(*ptr) || *ptr == '_'))
+            ptr++;

Is there any reason to test isascii()?  isalpha() should already cover that and
should be cheaper to test anyway.

Also nit: "Returns NULL when the buffer..." (unnecessary comma), and the '_'
char is also allowed.

+filter_read_item(FilterStateData *fstate,
+                 bool *is_include,
+                 char **objname,
+                 FilterObjectType *objtype)
+{
+    Assert(!fstate->is_error);
+
+    if (pg_get_line_buf(fstate->fp, &fstate->linebuff))
+    {
+        char       *str = fstate->linebuff.data;
+        const char *keyword;
+        int            size;
+
+        fstate->lineno++;
+
+        (void) pg_strip_crlf(str);
+
+        /* Skip initial white spaces */
+        while (isspace(*str))
+            str++;
+[...]
+            keyword = filter_get_keyword((const char **) &str, &size);

Is there any interest with the initial pg_strip_crlf?  AFAICT all the rest of
the code will ignore such caracters using isspace() so it wouldn't change
anything.

Dropping both pg_strip_crlf() would allow you to declare str as const rather
than doing it in function calls.  It would require to add const qualifiers in a
few other places, but it seems like an improvement, as for instance right now
filter_get_pattern is free to rewrite the str (because it's also calling
pg_strip_crlf, but there's no guarantee that it doesn't do anything else).

+/*
+ * filter_get_pattern - Read an object identifier pattern from the buffer
+ *
+ * Parses an object identifier pattern from the passed in buffer and sets
+ * objname to a string with object identifier pattern. Returns pointer to the
+ * first character after the pattern. Returns NULL on error.
+ */
+static char *
+filter_get_pattern(FilterStateData *fstate,

nit: suggestion to reword the comment, maybe something like

/*
 * filter_get_pattern - Identify an object identifier pattern
 *
 * Try to parse an object identifier pattern from the passed buffer.  If one is
 * found, it sets objname to a string with the object identifier pattern and
 * returns a pointer to the first byte after the found pattern. Otherwise NULL
 * is returned.
 */

+bool
+filter_read_item(FilterStateData *fstate,

Another suggestion for comment rewrite:

/*-------------------
 * filter_read_item - Read command/type/pattern triplet from a filter file
 *
 * This will parse one filter item from the filter file, and while it is a
 * row based format a pattern may span more than one line due to how object
 * names can be constructed.  The expected format of the filter file is:
 *
 * <command> <object_type> <pattern>
 *
 * command can be "include" or "exclude"
 * object_type can one of: "table", "schema", "foreign_data", "data",
 * "database", "function", "trigger" or "index"
 * pattern can be any possibly-quoted and possibly-qualified identifier.  It
 * follows the same rules as other object include and exclude functions so it
 * can also use wildcards.
 *
 * Returns true when one filter item was successfully read and parsed.  When
 * object name contains \n chars, then more than one line from input file can
 * be processed.  Returns false when the filter file reaches EOF. In case of
 * error, the function will emit an appropriate error message before returning
 * false.
 */

Note also that your original comment said:
+ * In case of
+ * errors, the function wont return but will exit with an appropriate error
+ * message.

But AFAICS that's not the case: it will indeed log an appropriate error message
but will return false.  I'm assuming that the comment was outdated as the
calling code handles it just fine, so I just modified the comment.

filter.h:

+#ifndef FILTER_H
+#define FILTER_H
+#include "c.h"

It's definitely not ok to include .ch in frontend code.  But AFAICS just
removing it doesn't cause any problem.  Note also that there should be an empty
line after the #define FILTER_H per usual coding style.

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

03 November 2022, 21:22:15

čt 3. 11. 2022 v 5:09 odesílatel Julien Rouhaud <rjuju123@gmail.com> napsal:

Hi,

On Wed, Oct 26, 2022 at 06:26:26AM +0200, Pavel Stehule wrote:
>
> út 18. 10. 2022 v 11:33 odesílatel Julien Rouhaud <rjuju123@gmail.com>
> napsal:
>
> >
> > I'm wondering if psql's parse_identifier() could be exported and reused
> > here
> > rather than creating yet another version.
> >
>
> I looked there, and I don't think this parser is usable for this purpose.
> It is very sensitive on white spaces, and doesn't support multi-lines. It
> is designed for support readline tab complete, it is designed for
> simplicity not for correctness.

Ah, sorry I should have checked more thoroughly. I guess it's ok to have
another identifier parser for the include file then, as this new one wouldn't
really fit the tab-completion use case.

> > also, is there any reason why this function doesn't call exit_nicely in
> > case of
> > error rather than letting each caller do it without any other cleanup?
> >
>
> It is commented few lines up
>
> /*
> * Following routines are called from pg_dump, pg_dumpall and pg_restore.
> * Unfortunatelly, implementation of exit_nicely in pg_dump and pg_restore
> * is different from implementation of this rutine in pg_dumpall. So instead
> * direct calling exit_nicely we have to return some error flag (in this
> * case NULL), and exit_nicelly will be executed from caller's routine.
> */

Oh right, I totally missed it sorry about that!

About the new version, I didn't find any problem with the feature itself so
it's a good thing!

I still have a few comments about the patch. First, about the behavior:

- is that ok to have just "data" pattern instead of "table_data" or something
like that, since it's supposed to match --exclude-table-data option?

done

- the error message are sometimes not super helpful. For instance:

$ echo "include data t1" | pg_dump --filter -
pg_dump: error: invalid format of filter on line 1: include filter is not allowed for this type of object

It would be nice if the error message mentioned "data" rather than a generic
"this type of object". Also, maybe we should quote "include" to outline that
we found this keyword?

done

About the patch itself:
filter.c:

+#include "postgres_fe.h"
+
+#include "filter.h"
+
+#include "common/logging.h"

the filter.h inclusion should be done with the rest of the includes, in
alphabetical order.

done

+#define is_keyword_str(cstr, str, bytes) \
+ ((strlen(cstr) == bytes) && (pg_strncasecmp(cstr, str, bytes) == 0))

nit: our guidline is to protect macro arguments with parenthesis. Some
arguments can be evaluated multiple times but I don't think it's worth adding a
comment for that.

done

+ * Unfortunatelly, implementation of exit_nicely in pg_dump and pg_restore
+ * is different from implementation of this rutine in pg_dumpall. So instead
+ * direct calling exit_nicely we have to return some error flag (in this

typos: s/Unfortunatelly/Unfortunately/ and s/rutine/routine/
Also, it would probably be better to say "instead of directly calling..."

done

+static const char *
+filter_object_type_name(FilterObjectType fot)
+{
+ switch (fot)
+ {
+ case FILTER_OBJECT_TYPE_NONE:
+ return "comment or empty line";
+[...]
+ }
+
+ return "unknown object type";
+}

I'm wondering if we should add a pg_unreachable() there, some compilers might
complain otherwise. See CreateDestReceiver() for instance for similar pattern.

done

+ * Emit error message "invalid format of filter file ..."
+ *
+ * This is mostly a convenience routine to avoid duplicating file closing code
+ * in multiple callsites.
+ */
+void
+log_invalid_filter_format(FilterStateData *fstate, char *message)

nit: invalid format *in* filter file...?

changed

+void
+log_unsupported_filter_object_type(FilterStateData *fstate,
+ const char *appname,
+ FilterObjectType fot)
+{
+ PQExpBuffer str = createPQExpBuffer();
+
+ printfPQExpBuffer(str,
+ "The application \"%s\" doesn't support filter for object type \"%s\".",

nit: there shouldn't be uppercase in error messages, especially since this will
be appended to another message by log_invalid_filter_format. I would just just
drop "The application" entirely for brevity.

changed

+/*
+ * Release allocated resources for filter
+ */
+void
+filter_free(FilterStateData *fstate)

nit: Release allocated resources for *the given* filter?

changed

+ * Search for keywords (limited to ascii alphabetic characters) in
+ * the passed in line buffer. Returns NULL, when the buffer is empty or first
+ * char is not alpha. The length of the found keyword is returned in the size
+ * parameter.
+ */
+static const char *
+filter_get_keyword(const char **line, int *size)
+{
+ [...]
+ if (isascii(*ptr) && isalpha(*ptr))
+ {
+ result = ptr++;
+
+ while (isascii(*ptr) && (isalpha(*ptr) || *ptr == '_'))
+ ptr++;

Is there any reason to test isascii()? isalpha() should already cover that and
should be cheaper to test anyway.

changed. I wanted to limit keyword's char just for basic ascii alphabets, but the benefit probably is not too strong, and the real effect can be messy, so I removed isascii test

Also nit: "Returns NULL when the buffer..." (unnecessary comma), and the '_'
char is also allowed.

done

+filter_read_item(FilterStateData *fstate,
+ bool *is_include,
+ char **objname,
+ FilterObjectType *objtype)
+{
+ Assert(!fstate->is_error);
+
+ if (pg_get_line_buf(fstate->fp, &fstate->linebuff))
+ {
+ char *str = fstate->linebuff.data;
+ const char *keyword;
+ int size;
+
+ fstate->lineno++;
+
+ (void) pg_strip_crlf(str);
+
+ /* Skip initial white spaces */
+ while (isspace(*str))
+ str++;
+[...]
+ keyword = filter_get_keyword((const char **) &str, &size);

Is there any interest with the initial pg_strip_crlf? AFAICT all the rest of
the code will ignore such caracters using isspace() so it wouldn't change
anything.

I think reading multiline identifiers is a little bit easier, because I don't need to check the ending \n and \r

When I read multiline identifiers, I cannot ignore white spaces.

Dropping both pg_strip_crlf() would allow you to declare str as const rather
than doing it in function calls. It would require to add const qualifiers in a
few other places, but it seems like an improvement, as for instance right now
filter_get_pattern is free to rewrite the str (because it's also calling
pg_strip_crlf, but there's no guarantee that it doesn't do anything else).

+/*
+ * filter_get_pattern - Read an object identifier pattern from the buffer
+ *
+ * Parses an object identifier pattern from the passed in buffer and sets
+ * objname to a string with object identifier pattern. Returns pointer to the
+ * first character after the pattern. Returns NULL on error.
+ */
+static char *
+filter_get_pattern(FilterStateData *fstate,

nit: suggestion to reword the comment, maybe something like

/*
* filter_get_pattern - Identify an object identifier pattern
*
* Try to parse an object identifier pattern from the passed buffer. If one is
* found, it sets objname to a string with the object identifier pattern and
* returns a pointer to the first byte after the found pattern. Otherwise NULL
* is returned.
*/

replaced

+bool
+filter_read_item(FilterStateData *fstate,

Another suggestion for comment rewrite:

/*-------------------
* filter_read_item - Read command/type/pattern triplet from a filter file
*
* This will parse one filter item from the filter file, and while it is a
* row based format a pattern may span more than one line due to how object
* names can be constructed. The expected format of the filter file is:
*
* <command> <object_type> <pattern>
*
* command can be "include" or "exclude"
* object_type can one of: "table", "schema", "foreign_data", "data",
* "database", "function", "trigger" or "index"
* pattern can be any possibly-quoted and possibly-qualified identifier. It
* follows the same rules as other object include and exclude functions so it
* can also use wildcards.
*
* Returns true when one filter item was successfully read and parsed. When
* object name contains \n chars, then more than one line from input file can
* be processed. Returns false when the filter file reaches EOF. In case of
* error, the function will emit an appropriate error message before returning
* false.
*/

replaced, thank you for the text

Note also that your original comment said:
+ * In case of
+ * errors, the function wont return but will exit with an appropriate error
+ * message.

But AFAICS that's not the case: it will indeed log an appropriate error message
but will return false. I'm assuming that the comment was outdated as the
calling code handles it just fine, so I just modified the comment.

yes

filter.h:

+#ifndef FILTER_H
+#define FILTER_H
+#include "c.h"

It's definitely not ok to include .ch in frontend code. But AFAICS just
removing it doesn't cause any problem. Note also that there should be an empty
line after the #define FILTER_H per usual coding style.

fixed - it looks so it was some garbage

updated patch attached

big thanks for these comments and tips

Regards

Pavel

Attachment

pg_dump-filter-20221103.patch

Re: proposal: possibility to read dumped table's name from file

From

Julien Rouhaud

Date:

04 November 2022, 11:59:01

On Thu, Nov 03, 2022 at 10:22:15PM +0100, Pavel Stehule wrote:
> čt 3. 11. 2022 v 5:09 odesílatel Julien Rouhaud <rjuju123@gmail.com> napsal:
>
> >
> > Is there any interest with the initial pg_strip_crlf?  AFAICT all the rest
> > of
> > the code will ignore such caracters using isspace() so it wouldn't change
> > anything.
> >
>
> I think reading multiline identifiers is a little bit easier, because I
> don't need to check the ending \n and \r
> When I read multiline identifiers, I cannot ignore white spaces.

Ok.  I don't have a strong objection to it.

>
> updated patch attached
>
> big thanks for these comments and tips

Thanks for the updated patch!  As far as I'm concerned the patch is in a good
shape, passes the CI and I don't have anything more to say so I'm marking it as
Ready for Committer!

Re: proposal: possibility to read dumped table's name from file

From

Justin Pryzby

Date:

04 November 2022, 12:19:27

On Fri, Nov 04, 2022 at 07:59:01PM +0800, Julien Rouhaud wrote:
> On Thu, Nov 03, 2022 at 10:22:15PM +0100, Pavel Stehule wrote:
> > čt 3. 11. 2022 v 5:09 odesílatel Julien Rouhaud <rjuju123@gmail.com> napsal:
> > updated patch attached
> >
> > big thanks for these comments and tips
> 
> Thanks for the updated patch!  As far as I'm concerned the patch is in a good
> shape, passes the CI and I don't have anything more to say so I'm marking it as
> Ready for Committer!

+1

I started looking to see if it's possible to simplify the patch at all,
but nothing to show yet.

But one thing I noticed is that "optarg" looks wrong here:

simple_string_list_append(&opts->triggerNames, optarg);

-- 
Justin

Re: proposal: possibility to read dumped table's name from file

From

Julien Rouhaud

Date:

04 November 2022, 13:28:30

On Fri, Nov 04, 2022 at 07:19:27AM -0500, Justin Pryzby wrote:
> On Fri, Nov 04, 2022 at 07:59:01PM +0800, Julien Rouhaud wrote:
> > On Thu, Nov 03, 2022 at 10:22:15PM +0100, Pavel Stehule wrote:
> > > čt 3. 11. 2022 v 5:09 odesílatel Julien Rouhaud <rjuju123@gmail.com> napsal:
> > > updated patch attached
> > >
> > > big thanks for these comments and tips
> > 
> > Thanks for the updated patch!  As far as I'm concerned the patch is in a good
> > shape, passes the CI and I don't have anything more to say so I'm marking it as
> > Ready for Committer!
> 
> +1
> 
> I started looking to see if it's possible to simplify the patch at all,
> but nothing to show yet.
> 
> But one thing I noticed is that "optarg" looks wrong here:
> 
> simple_string_list_append(&opts->triggerNames, optarg);

Ah indeed, good catch!  Maybe there should be an explicit test for every
(include|exclude) / objtype combination?  It would be a bit verbose (and
possibly hard to maintain).

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

04 November 2022, 13:37:05

pá 4. 11. 2022 v 14:28 odesílatel Julien Rouhaud <rjuju123@gmail.com> napsal:

On Fri, Nov 04, 2022 at 07:19:27AM -0500, Justin Pryzby wrote:
> On Fri, Nov 04, 2022 at 07:59:01PM +0800, Julien Rouhaud wrote:
> > On Thu, Nov 03, 2022 at 10:22:15PM +0100, Pavel Stehule wrote:
> > > čt 3. 11. 2022 v 5:09 odesílatel Julien Rouhaud <rjuju123@gmail.com> napsal:
> > > updated patch attached
> > >
> > > big thanks for these comments and tips
> >
> > Thanks for the updated patch! As far as I'm concerned the patch is in a good
> > shape, passes the CI and I don't have anything more to say so I'm marking it as
> > Ready for Committer!
>
> +1
>
> I started looking to see if it's possible to simplify the patch at all,
> but nothing to show yet.
>
> But one thing I noticed is that "optarg" looks wrong here:
>
> simple_string_list_append(&opts->triggerNames, optarg);

Ah indeed, good catch! Maybe there should be an explicit test for every
(include|exclude) / objtype combination? It would be a bit verbose (and
possibly hard to maintain).

I'll do it

Re: proposal: possibility to read dumped table's name from file

From

Julien Rouhaud

Date:

05 November 2022, 05:39:29

On Fri, Nov 04, 2022 at 02:37:05PM +0100, Pavel Stehule wrote:
> pá 4. 11. 2022 v 14:28 odesílatel Julien Rouhaud <rjuju123@gmail.com>
> napsal:
>
> > >
> > > But one thing I noticed is that "optarg" looks wrong here:
> > >
> > > simple_string_list_append(&opts->triggerNames, optarg);
> >
> > Ah indeed, good catch!  Maybe there should be an explicit test for every
> > (include|exclude) / objtype combination?  It would be a bit verbose (and
> > possibly hard to maintain).
> >
>
> I'll do it

Thanks a lot Pavel!  I switched the CF entry back to "Waiting on Author" in the
meantime.

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

05 November 2022, 19:54:57

pá 4. 11. 2022 v 14:28 odesílatel Julien Rouhaud <rjuju123@gmail.com> napsal:

On Fri, Nov 04, 2022 at 07:19:27AM -0500, Justin Pryzby wrote:
> On Fri, Nov 04, 2022 at 07:59:01PM +0800, Julien Rouhaud wrote:
> > On Thu, Nov 03, 2022 at 10:22:15PM +0100, Pavel Stehule wrote:
> > > čt 3. 11. 2022 v 5:09 odesílatel Julien Rouhaud <rjuju123@gmail.com> napsal:
> > > updated patch attached
> > >
> > > big thanks for these comments and tips
> >
> > Thanks for the updated patch! As far as I'm concerned the patch is in a good
> > shape, passes the CI and I don't have anything more to say so I'm marking it as
> > Ready for Committer!
>
> +1
>
> I started looking to see if it's possible to simplify the patch at all,
> but nothing to show yet.
>
> But one thing I noticed is that "optarg" looks wrong here:
>
> simple_string_list_append(&opts->triggerNames, optarg);

Ah indeed, good catch! Maybe there should be an explicit test for every
(include|exclude) / objtype combination? It would be a bit verbose (and
possibly hard to maintain).

yes - pg_restore is not well covered by tests, fixed

I found another issue. The pg_restore requires a full signature of the function and it is pretty sensitive on white spaces (pg_restore). I made a mistake when I partially parsed patterns like SQL identifiers. It can work for simple cases, but when I parse the function's signature it stops working. So I rewrote the parsing pattern part. Now, I just read an input string and I try to reduce spaces. Still multiline identifiers are supported. Against the previous method of pattern parsing, I needed to change just one regress test - now I am not able to detect garbage after pattern :-/. It is possible to enter types like "double precision" or "timestamp with time zone", without needing to check it on the server side.

When I wroted regress tests I found some issues of pg_restore filtering options (not related to this patch)

* function's filtering doesn't support schema - when the name of function is specified with schema, then the function is not found

* the function has to be specified with an argument type list - the separator has to be exactly ", " string. Without space or with one space more, the filtering doesn't work (new implementation of pattern parsing reduces white spaces sensitivity). This is not a bug, but it is not well documented.

* the trigger filtering is probably broken (on pg_restore side). The name should be entered in form "tablename triggername"

attached updated patch

Regards

Pavel

Attachment

pg_dump-filter-20221005.patch

Re: proposal: possibility to read dumped table's name from file

From

Julien Rouhaud

Date:

11 November 2022, 08:11:25

Hi,

On Sat, Nov 05, 2022 at 08:54:57PM +0100, Pavel Stehule wrote:
>
> pá 4. 11. 2022 v 14:28 odesílatel Julien Rouhaud <rjuju123@gmail.com>
> napsal:
>
> > > But one thing I noticed is that "optarg" looks wrong here:
> > >
> > > simple_string_list_append(&opts->triggerNames, optarg);
> >
> > Ah indeed, good catch!  Maybe there should be an explicit test for every
> > (include|exclude) / objtype combination?  It would be a bit verbose (and
> > possibly hard to maintain).
> >
>
> yes - pg_restore is not well covered by  tests, fixed
>
> I found another issue. The pg_restore requires a full signature of the
> function and it is pretty sensitive on white spaces (pg_restore).

Argh, indeed.  It's a good thing to have expanded the regression tests :)

> I made a
> mistake when I partially parsed patterns like SQL identifiers. It can work
> for simple cases, but when I parse the function's signature it stops
> working. So I rewrote the parsing pattern part. Now, I just read an input
> string and I try to reduce spaces. Still multiline identifiers are
> supported. Against the previous method of pattern parsing, I needed to
> change just one regress test - now I am not able to detect garbage after
> pattern :-/.

I'm not sure it's really problematic.  It looks POLA-violation compatible with
regular pg_dump options, for instance:

$ echo "include table t1()" | pg_dump --filter - | ag CREATE
CREATE TABLE public.t1 (

$ pg_dump -t "t1()" | ag CREATE
CREATE TABLE public.t1 (

$ echo "include table t1()blabla" | pg_dump --filter - | ag CREATE
pg_dump: error: no matching tables were found

$ pg_dump -t "t1()blabla" | ag CREATE
pg_dump: error: no matching tables were found

I don't think the file parsing code should try to be smart about checking the
found patterns.

> * function's filtering doesn't support schema - when the name of function
> is specified with schema, then the function is not found

Ah I didn't know that.  Indeed it only expect a non-qualified identifier, and
would restore any function that matches the name (and arguments), possibly
multiple ones if there are variants in different schema.  That's unrelated to
this patch though.

> * the function has to be specified with an argument type list - the
> separator has to be exactly ", " string. Without space or with one space
> more, the filtering doesn't work (new implementation of pattern parsing
> reduces white spaces sensitivity). This is not a bug, but it is not well
> documented.

Agreed.

> attached updated patch

It looks overall good to me!  I just have a few minor nitpicking complaints:

- you removed the pg_strip_clrf() calls and declared everything as "const char
  *", so there's no need to explicitly cast the filter_get_keyword() arguments
  anymore

Note also that the code now relies on the fact that there are some non-zero
bytes after a pattern to know that no errors happened.  It's not a problem as
you should find an EOF marker anyway if CLRF were stripped.

+ * Following routines are called from pg_dump, pg_dumpall and pg_restore.
+ * Unfortunately, implementation of exit_nicely in pg_dump and pg_restore
+ * is different from implementation of this routine in pg_dumpall. So instead
+ * of directly calling exit_nicely we have to return some error flag (in this
+ * case NULL), and exit_nicelly will be executed from caller's routine.

Slight improvement:
[...]
Unfortunately, the implementation of exit_nicely in pg_dump and pg_restore is
different from the one in pg_dumpall, so instead of...

+ * read_pattern - reads an pattern from input. The pattern can be mix of
+ * single line or multi line subpatterns. Single line subpattern starts first
+ * non white space char, and ending last non space char on line or by char
+ * '#'. The white spaces inside are removed (around char ".()"), or reformated
+ * around char ',' or reduced (the multiple spaces are replaced by one).
+ * Multiline subpattern starts by double quote and ending by this char too.
+ * The escape rules are same like for SQL quoted literal.
+ *
+ * Routine signalizes error by returning NULL. Otherwise returns pointer
+ * to next char after last processed char in input string.

typo: reads "a" pattern from input...

I don't fully understand the part about subpatterns, but is that necessary to
describe it?  Simply saying that any valid and possibly-quoted identifier can
be parsed should make it clear that identifiers containing \n characters should
work too.  Maybe also just mention that whitespaces are removed and special
care is taken to output routines in exactly the same way calling code will
expect it (that is comma-and-single-space type delimiter).

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

12 November 2022, 20:35:59

pá 11. 11. 2022 v 9:11 odesílatel Julien Rouhaud <rjuju123@gmail.com> napsal:

Hi,

On Sat, Nov 05, 2022 at 08:54:57PM +0100, Pavel Stehule wrote:
>
> pá 4. 11. 2022 v 14:28 odesílatel Julien Rouhaud <rjuju123@gmail.com>
> napsal:
>
> > > But one thing I noticed is that "optarg" looks wrong here:
> > >
> > > simple_string_list_append(&opts->triggerNames, optarg);
> >
> > Ah indeed, good catch! Maybe there should be an explicit test for every
> > (include|exclude) / objtype combination? It would be a bit verbose (and
> > possibly hard to maintain).
> >
>
> yes - pg_restore is not well covered by tests, fixed
>
> I found another issue. The pg_restore requires a full signature of the
> function and it is pretty sensitive on white spaces (pg_restore).

Argh, indeed. It's a good thing to have expanded the regression tests :)

> I made a
> mistake when I partially parsed patterns like SQL identifiers. It can work
> for simple cases, but when I parse the function's signature it stops
> working. So I rewrote the parsing pattern part. Now, I just read an input
> string and I try to reduce spaces. Still multiline identifiers are
> supported. Against the previous method of pattern parsing, I needed to
> change just one regress test - now I am not able to detect garbage after
> pattern :-/.

I'm not sure it's really problematic. It looks POLA-violation compatible with
regular pg_dump options, for instance:

$ echo "include table t1()" | pg_dump --filter - | ag CREATE
CREATE TABLE public.t1 (

$ pg_dump -t "t1()" | ag CREATE
CREATE TABLE public.t1 (

$ echo "include table t1()blabla" | pg_dump --filter - | ag CREATE
pg_dump: error: no matching tables were found

$ pg_dump -t "t1()blabla" | ag CREATE
pg_dump: error: no matching tables were found

I don't think the file parsing code should try to be smart about checking the
found patterns.

> * function's filtering doesn't support schema - when the name of function
> is specified with schema, then the function is not found

Ah I didn't know that. Indeed it only expect a non-qualified identifier, and
would restore any function that matches the name (and arguments), possibly
multiple ones if there are variants in different schema. That's unrelated to
this patch though.

> * the function has to be specified with an argument type list - the
> separator has to be exactly ", " string. Without space or with one space
> more, the filtering doesn't work (new implementation of pattern parsing
> reduces white spaces sensitivity). This is not a bug, but it is not well
> documented.

Agreed.

> attached updated patch

It looks overall good to me! I just have a few minor nitpicking complaints:

- you removed the pg_strip_clrf() calls and declared everything as "const char
*", so there's no need to explicitly cast the filter_get_keyword() arguments
anymore

removed

Note also that the code now relies on the fact that there are some non-zero
bytes after a pattern to know that no errors happened. It's not a problem as
you should find an EOF marker anyway if CLRF were stripped.

I am not sure if I understand this note well?

+ * Following routines are called from pg_dump, pg_dumpall and pg_restore.
+ * Unfortunately, implementation of exit_nicely in pg_dump and pg_restore
+ * is different from implementation of this routine in pg_dumpall. So instead
+ * of directly calling exit_nicely we have to return some error flag (in this
+ * case NULL), and exit_nicelly will be executed from caller's routine.

Slight improvement:
[...]
Unfortunately, the implementation of exit_nicely in pg_dump and pg_restore is
different from the one in pg_dumpall, so instead of...

+ * read_pattern - reads an pattern from input. The pattern can be mix of
+ * single line or multi line subpatterns. Single line subpattern starts first
+ * non white space char, and ending last non space char on line or by char
+ * '#'. The white spaces inside are removed (around char ".()"), or reformated
+ * around char ',' or reduced (the multiple spaces are replaced by one).
+ * Multiline subpattern starts by double quote and ending by this char too.
+ * The escape rules are same like for SQL quoted literal.
+ *
+ * Routine signalizes error by returning NULL. Otherwise returns pointer
+ * to next char after last processed char in input string.

typo: reads "a" pattern from input...

fixed

I don't fully understand the part about subpatterns, but is that necessary to
describe it? Simply saying that any valid and possibly-quoted identifier can
be parsed should make it clear that identifiers containing \n characters should
work too. Maybe also just mention that whitespaces are removed and special
care is taken to output routines in exactly the same way calling code will
expect it (that is comma-and-single-space type delimiter).

In this case I hit the limits of my English language skills.

I rewrote this comment, but it needs more care. Please, can you look at it?

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

12 November 2022, 20:37:03

and updated patch

Regards

Pavel

Attachment

pg_dump-filter-20221112.patch

Re: proposal: possibility to read dumped table's name from file

From

Julien Rouhaud

Date:

13 November 2022, 08:58:38

On Sat, Nov 12, 2022 at 09:35:59PM +0100, Pavel Stehule wrote:

Thanks for the updated patch.  Apart from the function comment it looks good to
me.

Justin, did you have any other comment on the patch?

> > I don't fully understand the part about subpatterns, but is that necessary
> > to
> > describe it?  Simply saying that any valid and possibly-quoted identifier
> > can
> > be parsed should make it clear that identifiers containing \n characters
> > should
> > work too.  Maybe also just mention that whitespaces are removed and special
> > care is taken to output routines in exactly the same way calling code will
> > expect it (that is comma-and-single-space type delimiter).
> >
>
> In this case I hit the limits of my English language skills.
>
> I rewrote this comment, but it needs more care. Please, can you look at it?

I'm also not a native English speaker so I'm far for writing perfect comments
myself :)

Maybe something like

/*
 * read_pattern - reads on object pattern from input
 *
 * This function will parse any valid identifier (quoted or not, qualified or
 * not), which can also includes the full signature for routines.
 * Note that this function takes special care to sanitize the detected
 * identifier (removing extraneous whitespaces or other unnecessary
 * characters).  This is necessary as most backup/restore filtering functions
 * only recognize identifiers if they are written exactly way as they are
 * regenerated.
 * Returns a pointer to next character after the found identifier, or NULL on
 * error.
 */

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

13 November 2022, 19:32:47

ne 13. 11. 2022 v 9:58 odesílatel Julien Rouhaud <rjuju123@gmail.com> napsal:

On Sat, Nov 12, 2022 at 09:35:59PM +0100, Pavel Stehule wrote:

Thanks for the updated patch. Apart from the function comment it looks good to
me.

Justin, did you have any other comment on the patch?

> > I don't fully understand the part about subpatterns, but is that necessary
> > to
> > describe it? Simply saying that any valid and possibly-quoted identifier
> > can
> > be parsed should make it clear that identifiers containing \n characters
> > should
> > work too. Maybe also just mention that whitespaces are removed and special
> > care is taken to output routines in exactly the same way calling code will
> > expect it (that is comma-and-single-space type delimiter).
> >
>
> In this case I hit the limits of my English language skills.
>
> I rewrote this comment, but it needs more care. Please, can you look at it?

I'm also not a native English speaker so I'm far for writing perfect comments
myself :)

far better than mine :)

Thank you very much

updated patch attached

Regards

Pavel

Maybe something like

/*
* read_pattern - reads on object pattern from input
*
* This function will parse any valid identifier (quoted or not, qualified or
* not), which can also includes the full signature for routines.
* Note that this function takes special care to sanitize the detected
* identifier (removing extraneous whitespaces or other unnecessary
* characters). This is necessary as most backup/restore filtering functions
* only recognize identifiers if they are written exactly way as they are
* regenerated.
* Returns a pointer to next character after the found identifier, or NULL on
* error.
*/

Attachment

pg_dump-filter-20221113.patch

Re: proposal: possibility to read dumped table's name from file

From

Julien Rouhaud

Date:

22 November 2022, 05:25:56

Hi,

On Sun, Nov 13, 2022 at 08:32:47PM +0100, Pavel Stehule wrote:
>
> updated patch attached

Thanks!

Some enhancement could probably be done by a native english speaker, but apart
from that it looks good to me, so hearing no other complaints I'm marking the
CF entry as Ready for Committer!

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

22 November 2022, 05:30:59

út 22. 11. 2022 v 6:26 odesílatel Julien Rouhaud <rjuju123@gmail.com> napsal:

Hi,

On Sun, Nov 13, 2022 at 08:32:47PM +0100, Pavel Stehule wrote:
>
> updated patch attached

Thanks!

Some enhancement could probably be done by a native english speaker, but apart
from that it looks good to me, so hearing no other complaints I'm marking the
CF entry as Ready for Committer!

Thank you very much for check and help

Regards

Pavel

Re: proposal: possibility to read dumped table's name from file

From

Andres Freund

Date:

22 November 2022, 07:39:08

Hi,

On 2022-11-13 20:32:47 +0100, Pavel Stehule wrote:
> updated patch attached

It fails with address sanitizer that's now part of CI:

https://cirrus-ci.com/task/6031397744279552?logs=test_world#L2659

[06:33:11.271] # ==31965==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x619000000480 at pc 0x559f1ac40822
bp0x7ffea83e1ad0 sp 0x7ffea83e1ac8
 
[06:33:11.271] # READ of size 1 at 0x619000000480 thread T0
[06:33:11.271] #     #0 0x559f1ac40821 in read_pattern /tmp/cirrus-ci-build/src/bin/pg_dump/filter.c:302
[06:33:11.271] #     #1 0x559f1ac40e4d in filter_read_item /tmp/cirrus-ci-build/src/bin/pg_dump/filter.c:459
[06:33:11.271] #     #2 0x559f1abe6fa5 in read_dump_filters /tmp/cirrus-ci-build/src/bin/pg_dump/pg_dump.c:18229
[06:33:11.271] #     #3 0x559f1ac2bb1b in main /tmp/cirrus-ci-build/src/bin/pg_dump/pg_dump.c:630
[06:33:11.271] #     #4 0x7fd91fabfd09 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x23d09)
[06:33:11.271] #     #5 0x559f1abe5d29 in _start
(/tmp/cirrus-ci-build/tmp_install/usr/local/pgsql/bin/pg_dump+0x39d29)
[06:33:11.271] # 
[06:33:11.271] # 0x619000000480 is located 0 bytes to the right of 1024-byte region [0x619000000080,0x619000000480)
[06:33:11.271] # allocated by thread T0 here:
[06:33:11.271] #     #0 0x7fd91fe14e8f in __interceptor_malloc
../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:145
[06:33:11.271] #     #1 0x559f1ac69f35 in pg_malloc_internal /tmp/cirrus-ci-build/src/common/fe_memutils.c:30
[06:33:11.271] #     #2 0x559f1ac69f35 in palloc /tmp/cirrus-ci-build/src/common/fe_memutils.c:117
[06:33:11.271] # 
[06:33:11.271] # SUMMARY: AddressSanitizer: heap-buffer-overflow /tmp/cirrus-ci-build/src/bin/pg_dump/filter.c:302 in
read_pattern


Greetings,

Andres Freund

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

22 November 2022, 07:41:24

út 22. 11. 2022 v 8:39 odesílatel Andres Freund <andres@anarazel.de> napsal:

Hi,

On 2022-11-13 20:32:47 +0100, Pavel Stehule wrote:
> updated patch attached

It fails with address sanitizer that's now part of CI:

https://cirrus-ci.com/task/6031397744279552?logs=test_world#L2659

[06:33:11.271] # ==31965==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x619000000480 at pc 0x559f1ac40822 bp 0x7ffea83e1ad0 sp 0x7ffea83e1ac8
[06:33:11.271] # READ of size 1 at 0x619000000480 thread T0
[06:33:11.271] # #0 0x559f1ac40821 in read_pattern /tmp/cirrus-ci-build/src/bin/pg_dump/filter.c:302
[06:33:11.271] # #1 0x559f1ac40e4d in filter_read_item /tmp/cirrus-ci-build/src/bin/pg_dump/filter.c:459
[06:33:11.271] # #2 0x559f1abe6fa5 in read_dump_filters /tmp/cirrus-ci-build/src/bin/pg_dump/pg_dump.c:18229
[06:33:11.271] # #3 0x559f1ac2bb1b in main /tmp/cirrus-ci-build/src/bin/pg_dump/pg_dump.c:630
[06:33:11.271] # #4 0x7fd91fabfd09 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x23d09)
[06:33:11.271] # #5 0x559f1abe5d29 in _start (/tmp/cirrus-ci-build/tmp_install/usr/local/pgsql/bin/pg_dump+0x39d29)
[06:33:11.271] #
[06:33:11.271] # 0x619000000480 is located 0 bytes to the right of 1024-byte region [0x619000000080,0x619000000480)
[06:33:11.271] # allocated by thread T0 here:
[06:33:11.271] # #0 0x7fd91fe14e8f in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:145
[06:33:11.271] # #1 0x559f1ac69f35 in pg_malloc_internal /tmp/cirrus-ci-build/src/common/fe_memutils.c:30
[06:33:11.271] # #2 0x559f1ac69f35 in palloc /tmp/cirrus-ci-build/src/common/fe_memutils.c:117
[06:33:11.271] #
[06:33:11.271] # SUMMARY: AddressSanitizer: heap-buffer-overflow /tmp/cirrus-ci-build/src/bin/pg_dump/filter.c:302 in read_pattern

I'll check it

Greetings,

Andres Freund

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

22 November 2022, 17:02:21

út 22. 11. 2022 v 8:39 odesílatel Andres Freund <andres@anarazel.de> napsal:

Hi,

On 2022-11-13 20:32:47 +0100, Pavel Stehule wrote:
> updated patch attached

It fails with address sanitizer that's now part of CI:

https://cirrus-ci.com/task/6031397744279552?logs=test_world#L2659

[06:33:11.271] # ==31965==ERROR: AddressSanitizer: heap-buffer-overflow on address 0x619000000480 at pc 0x559f1ac40822 bp 0x7ffea83e1ad0 sp 0x7ffea83e1ac8
[06:33:11.271] # READ of size 1 at 0x619000000480 thread T0
[06:33:11.271] # #0 0x559f1ac40821 in read_pattern /tmp/cirrus-ci-build/src/bin/pg_dump/filter.c:302
[06:33:11.271] # #1 0x559f1ac40e4d in filter_read_item /tmp/cirrus-ci-build/src/bin/pg_dump/filter.c:459
[06:33:11.271] # #2 0x559f1abe6fa5 in read_dump_filters /tmp/cirrus-ci-build/src/bin/pg_dump/pg_dump.c:18229
[06:33:11.271] # #3 0x559f1ac2bb1b in main /tmp/cirrus-ci-build/src/bin/pg_dump/pg_dump.c:630
[06:33:11.271] # #4 0x7fd91fabfd09 in __libc_start_main (/lib/x86_64-linux-gnu/libc.so.6+0x23d09)
[06:33:11.271] # #5 0x559f1abe5d29 in _start (/tmp/cirrus-ci-build/tmp_install/usr/local/pgsql/bin/pg_dump+0x39d29)
[06:33:11.271] #
[06:33:11.271] # 0x619000000480 is located 0 bytes to the right of 1024-byte region [0x619000000080,0x619000000480)
[06:33:11.271] # allocated by thread T0 here:
[06:33:11.271] # #0 0x7fd91fe14e8f in __interceptor_malloc ../../../../src/libsanitizer/asan/asan_malloc_linux.cpp:145
[06:33:11.271] # #1 0x559f1ac69f35 in pg_malloc_internal /tmp/cirrus-ci-build/src/common/fe_memutils.c:30
[06:33:11.271] # #2 0x559f1ac69f35 in palloc /tmp/cirrus-ci-build/src/common/fe_memutils.c:117
[06:33:11.271] #
[06:33:11.271] # SUMMARY: AddressSanitizer: heap-buffer-overflow /tmp/cirrus-ci-build/src/bin/pg_dump/filter.c:302 in read_pattern

should be fixed in attached patch

I found and fix small memleak 24bytes per filter row (PQExpBufferData)

Regards

Pavel

Greetings,

Andres Freund

Attachment

pg_dump-filter-20221122.patch

Re: proposal: possibility to read dumped table's name from file

From

"Gregory Stark (as CFM)"

Date:

06 March 2023, 20:45:31

So.... This patch has been through a lot of commitfests. And it really
doesn't seem that hard to resolve -- Pavel has seemingly been willing
to go along whichever way the wind has been blowing but honestly it
kind of seems like he's just gotten drive-by suggestions and he's put
a lot of work into trying to satisfy them.

He implemented --include-tables-from-file=... etc. Then he implemented
a hand-written parser for a DSL to select objects, then he implemented
a bison parser, then he went back to the hand-written parser.

Can we get some consensus on whether the DSL looks right and whether
the hand-written parser is sensible. And if so then can a committer
step up to actual review and commit the patch? The last review said it
might need a native English speaker to tweak some wording but
otherwise looked good.

-- 
Gregory Stark
As Commitfest Manager

Re: proposal: possibility to read dumped table's name from file

From

Daniel Gustafsson

Date:

06 March 2023, 21:20:32

> On 6 Mar 2023, at 21:45, Gregory Stark (as CFM) <stark.cfm@gmail.com> wrote:
>
> So.... This patch has been through a lot of commitfests. And it really
> doesn't seem that hard to resolve -- Pavel has seemingly been willing
> to go along whichever way the wind has been blowing but honestly it
> kind of seems like he's just gotten drive-by suggestions and he's put
> a lot of work into trying to satisfy them.

Agreed.

> He implemented --include-tables-from-file=... etc. Then he implemented
> a hand-written parser for a DSL to select objects, then he implemented
> a bison parser, then he went back to the hand-written parser.

Well, kind of.  I was trying to take the patch to the finishing line but was
uncomfortable with the hand written parser so I implemented a parser in Bison
to replace it with.  Not that hand-written parsers are bad per se (or that my
bison parser was perfect), but reading quoted identifiers across line
boundaries tend to require a fair amount of handwritten code.  Pavel did not
object to this version, but it was objected to by two other committers.

At this point [0] I stepped down from trying to finish it as the approach I was
comfortable didn't gain traction (which is totally fine).

Downthread from this the patch got a lot of reviews from Julien with the old
parser back in place.

> Can we get some consensus on whether the DSL looks right

I would consider this pretty settled.

> and whether the hand-written parser is sensible.

This is the part where a committer who wants to pursue the hand-written parser
need to step up. With the amount of review received it's hopefully pretty close.

--
Daniel Gustafsson

[0] 098531E1-FBA9-4B7D-884E-0A4363EEE6DF@yesql.se

Re: proposal: possibility to read dumped table's name from file

From

Julien Rouhaud

Date:

07 March 2023, 02:47:09

Hi,

On Mon, Mar 06, 2023 at 10:20:32PM +0100, Daniel Gustafsson wrote:
> > On 6 Mar 2023, at 21:45, Gregory Stark (as CFM) <stark.cfm@gmail.com> wrote:
> >
> > So.... This patch has been through a lot of commitfests. And it really
> > doesn't seem that hard to resolve -- Pavel has seemingly been willing
> > to go along whichever way the wind has been blowing but honestly it
> > kind of seems like he's just gotten drive-by suggestions and he's put
> > a lot of work into trying to satisfy them.
>
> Agreed.

Indeed, I'm not sure I would have had that much patience.

> > He implemented --include-tables-from-file=... etc. Then he implemented
> > a hand-written parser for a DSL to select objects, then he implemented
> > a bison parser, then he went back to the hand-written parser.
>
> Well, kind of.  I was trying to take the patch to the finishing line but was
> uncomfortable with the hand written parser so I implemented a parser in Bison
> to replace it with.  Not that hand-written parsers are bad per se (or that my
> bison parser was perfect), but reading quoted identifiers across line
> boundaries tend to require a fair amount of handwritten code.  Pavel did not
> object to this version, but it was objected to by two other committers.
>
> At this point [0] I stepped down from trying to finish it as the approach I was
> comfortable didn't gain traction (which is totally fine).
>
> Downthread from this the patch got a lot of reviews from Julien with the old
> parser back in place.

Yeah, and the current state seems quite good to me.

> > Can we get some consensus on whether the DSL looks right
>
> I would consider this pretty settled.

Agreed.

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

16 March 2023, 12:05:41

út 7. 3. 2023 v 3:47 odesílatel Julien Rouhaud <rjuju123@gmail.com> napsal:

Hi,

On Mon, Mar 06, 2023 at 10:20:32PM +0100, Daniel Gustafsson wrote:
> > On 6 Mar 2023, at 21:45, Gregory Stark (as CFM) <stark.cfm@gmail.com> wrote:
> >
> > So.... This patch has been through a lot of commitfests. And it really
> > doesn't seem that hard to resolve -- Pavel has seemingly been willing
> > to go along whichever way the wind has been blowing but honestly it
> > kind of seems like he's just gotten drive-by suggestions and he's put
> > a lot of work into trying to satisfy them.
>
> Agreed.

Indeed, I'm not sure I would have had that much patience.

> > He implemented --include-tables-from-file=... etc. Then he implemented
> > a hand-written parser for a DSL to select objects, then he implemented
> > a bison parser, then he went back to the hand-written parser.
>
> Well, kind of. I was trying to take the patch to the finishing line but was
> uncomfortable with the hand written parser so I implemented a parser in Bison
> to replace it with. Not that hand-written parsers are bad per se (or that my
> bison parser was perfect), but reading quoted identifiers across line
> boundaries tend to require a fair amount of handwritten code. Pavel did not
> object to this version, but it was objected to by two other committers.
>
> At this point [0] I stepped down from trying to finish it as the approach I was
> comfortable didn't gain traction (which is totally fine).
>
> Downthread from this the patch got a lot of reviews from Julien with the old
> parser back in place.

Yeah, and the current state seems quite good to me.

> > Can we get some consensus on whether the DSL looks right
>
> I would consider this pretty settled.

Agreed.

rebase + enhancing about related option from a563c24

Regards

Pavel

Attachment

0001-possibility-to-read-options-for-dump-from-file.patch

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

18 March 2023, 19:48:33

fresh rebase

regards

Pavel

Attachment

v20230318-0001-possibility-to-read-options-for-dump-from-file.patch

Re: proposal: possibility to read dumped table's name from file

From

Justin Pryzby

Date:

19 March 2023, 14:01:51

On Thu, Mar 16, 2023 at 01:05:41PM +0100, Pavel Stehule wrote:
> rebase + enhancing about related option from a563c24

Thanks.

It looks like this doesn't currently handle extensions, which were added
at 6568cef26e.

> +           <literal>table_and_children</literal>: tables, works like
> +           <option>-t</option>/<option>--table</option>, except that
> +           it also includes any partitions or inheritance child
> +           tables of the table(s) matching the
> +           <replaceable class="parameter">pattern</replaceable>.

Why doesn't this just say "works like --table-and-children" ?

I think as you wrote log_invalid_filter_format(), the messages wouldn't
be translated, because they're printed via %s.  One option is to call
_() on the message.

> +ok($dump !=~ qr/^CREATE TABLE public\.bootab/m,   "exclude dumped children table");

!=~ is being interpretted as as numeric "!=" and throwing warnings.
It should be a !~ b, right ?
It'd be nice if perl warnings during the tests were less easy to miss.

> + * char is not alpha. The char '_' is allowed too (exclude first position).

Why is it treated specially?  Could it be treated the same as alpha?

> +                log_invalid_filter_format(&fstate,
> +                                          "\"include\" table data filter is not allowed");
> +                log_invalid_filter_format(&fstate,
> +                                          "\"include\" table data and children filter is not allowed");

For these, it might be better to write the literal option:

> +                                          "include filter for \"table_data_and_children\" is not allowed");

Because the option is a literal and shouldn't be translated.
And it's probably better to write that using %s, like:

> +                                          "include filter for \"%s\" is not allowed");

That makes shorter and fewer strings.

Find attached a bunch of other corrections as 0002.txt

I also dug up what I'd started in november, trying to reduce the code
duplication betwen pg_restore/dump/all.  This isn't done, but I might
never finish it, so I'll at least show what I have in case you think
it's a good idea.  This passes tests on CI, except for autoconf, due to
using exit_nicely() differently.

-- 
Justin

Attachment

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

19 March 2023, 20:27:34

ne 19. 3. 2023 v 15:01 odesílatel Justin Pryzby <pryzby@telsasoft.com> napsal:

On Thu, Mar 16, 2023 at 01:05:41PM +0100, Pavel Stehule wrote:
> rebase + enhancing about related option from a563c24

Thanks.

It looks like this doesn't currently handle extensions, which were added
at 6568cef26e.

> + <literal>table_and_children</literal>: tables, works like
> + <option>-t</option>/<option>--table</option>, except that
> + it also includes any partitions or inheritance child
> + tables of the table(s) matching the
> + <replaceable class="parameter">pattern</replaceable>.

Why doesn't this just say "works like --table-and-children" ?

I think as you wrote log_invalid_filter_format(), the messages wouldn't
be translated, because they're printed via %s. One option is to call
_() on the message.

> +ok($dump !=~ qr/^CREATE TABLE public\.bootab/m, "exclude dumped children table");

!=~ is being interpretted as as numeric "!=" and throwing warnings.
It should be a !~ b, right ?
It'd be nice if perl warnings during the tests were less easy to miss.

> + * char is not alpha. The char '_' is allowed too (exclude first position).

Why is it treated specially? Could it be treated the same as alpha?

> + log_invalid_filter_format(&fstate,
> + "\"include\" table data filter is not allowed");
> + log_invalid_filter_format(&fstate,
> + "\"include\" table data and children filter is not allowed");

For these, it might be better to write the literal option:

> + "include filter for \"table_data_and_children\" is not allowed");

Because the option is a literal and shouldn't be translated.
And it's probably better to write that using %s, like:

> + "include filter for \"%s\" is not allowed");

That makes shorter and fewer strings.

Find attached a bunch of other corrections as 0002.txt

Thank you very much - I'll recheck the mentioned points tomorrow.

I also dug up what I'd started in november, trying to reduce the code
duplication betwen pg_restore/dump/all. This isn't done, but I might
never finish it, so I'll at least show what I have in case you think
it's a good idea. This passes tests on CI, except for autoconf, due to
using exit_nicely() differently.

Your implementation reduced 60 lines, but the interface and code is more complex. I cannot say what is significantly better. Personally, in this case, I prefer my variant, because I think it is a little bit more readable, and possible modification can be more simple. But this is just my opinion, and I have no problem accepting other opinions. I can imagine to define some configuration array like getopt, but it looks like over engineering

Regards

Pavel

--
Justin

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

20 March 2023, 07:01:13

ne 19. 3. 2023 v 15:01 odesílatel Justin Pryzby <pryzby@telsasoft.com> napsal:

On Thu, Mar 16, 2023 at 01:05:41PM +0100, Pavel Stehule wrote:
> rebase + enhancing about related option from a563c24

Thanks.

It looks like this doesn't currently handle extensions, which were added
at 6568cef26e.

> + <literal>table_and_children</literal>: tables, works like
> + <option>-t</option>/<option>--table</option>, except that
> + it also includes any partitions or inheritance child
> + tables of the table(s) matching the
> + <replaceable class="parameter">pattern</replaceable>.

Why doesn't this just say "works like --table-and-children" ?

changed

I think as you wrote log_invalid_filter_format(), the messages wouldn't
be translated, because they're printed via %s. One option is to call
_() on the message.

fixed

> +ok($dump !=~ qr/^CREATE TABLE public\.bootab/m, "exclude dumped children table");

!=~ is being interpretted as as numeric "!=" and throwing warnings.
It should be a !~ b, right ?
It'd be nice if perl warnings during the tests were less easy to miss.

should be fixed by you

> + * char is not alpha. The char '_' is allowed too (exclude first position).

Why is it treated specially? Could it be treated the same as alpha?

It is usual behaviour in Postgres for keywords. Important is the complete sentence "Returns NULL when the buffer is empty or the first char is not alpha."

In this case this implementation has no big impact on behaviour - probably you got a message "unknown keyword" instead of "missing keyword". But I would

implement behaviour consistent with other places. My opinion in this case is not extra strong - we can define the form of keywords like we want, just this is consistent

with other parsers in Postgres.

> + log_invalid_filter_format(&fstate,
> + "\"include\" table data filter is not allowed");
> + log_invalid_filter_format(&fstate,
> + "\"include\" table data and children filter is not allowed");

For these, it might be better to write the literal option:

> + "include filter for \"table_data_and_children\" is not allowed");

Because the option is a literal and shouldn't be translated.
And it's probably better to write that using %s, like:

> + "include filter for \"%s\" is not allowed");

done

That makes shorter and fewer strings.

Find attached a bunch of other corrections as 0002.txt

merged

Regards

Pavel

Attachment

v20230320-0001-possibility-to-read-options-for-dump-from-file.patch

Re: proposal: possibility to read dumped table's name from file

From

Justin Pryzby

Date:

21 March 2023, 15:32:07

On Mon, Mar 20, 2023 at 08:01:13AM +0100, Pavel Stehule wrote:
> ne 19. 3. 2023 v 15:01 odesílatel Justin Pryzby <pryzby@telsasoft.com> napsal:
> 
> > On Thu, Mar 16, 2023 at 01:05:41PM +0100, Pavel Stehule wrote:
> > > rebase + enhancing about related option from a563c24
> >
> > Thanks.
> >
> > It looks like this doesn't currently handle extensions, which were added
> > at 6568cef26e.

What about this part ?  Should extension filters be supported ?

I think the comment that I'd patched that lists all the filter types
should be minimized, rather than duplicating the list of all the
possible filters that's already in the usrr-facing documentation.

One new typo: childrent

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

21 March 2023, 15:33:18

út 21. 3. 2023 v 16:32 odesílatel Justin Pryzby <pryzby@telsasoft.com> napsal:

On Mon, Mar 20, 2023 at 08:01:13AM +0100, Pavel Stehule wrote:
> ne 19. 3. 2023 v 15:01 odesílatel Justin Pryzby <pryzby@telsasoft.com> napsal:
>
> > On Thu, Mar 16, 2023 at 01:05:41PM +0100, Pavel Stehule wrote:
> > > rebase + enhancing about related option from a563c24
> >
> > Thanks.
> >
> > It looks like this doesn't currently handle extensions, which were added
> > at 6568cef26e.

What about this part ? Should extension filters be supported ?

I missed this, yes, it should be supported.

I think the comment that I'd patched that lists all the filter types
should be minimized, rather than duplicating the list of all the
possible filters that's already in the usrr-facing documentation.

One new typo: childrent

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

21 March 2023, 22:00:07

út 21. 3. 2023 v 16:32 odesílatel Justin Pryzby <pryzby@telsasoft.com> napsal:

On Mon, Mar 20, 2023 at 08:01:13AM +0100, Pavel Stehule wrote:
> ne 19. 3. 2023 v 15:01 odesílatel Justin Pryzby <pryzby@telsasoft.com> napsal:
>
> > On Thu, Mar 16, 2023 at 01:05:41PM +0100, Pavel Stehule wrote:
> > > rebase + enhancing about related option from a563c24
> >
> > Thanks.
> >
> > It looks like this doesn't currently handle extensions, which were added
> > at 6568cef26e.

What about this part ? Should extension filters be supported ?

should be fixed

I think the comment that I'd patched that lists all the filter types
should be minimized, rather than duplicating the list of all the
possible filters that's already in the usrr-facing documentation.

I modified this comment. Please, check

One new typo: childrent

fixed

Regards

Pavel

Attachment

v20230321-0001-possibility-to-read-options-for-dump-from-file.patch

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

11 September 2023, 04:34:56

only rebase

Regards

Pavel

Attachment

v20230911-0001-possibility-to-read-options-for-dump-from-file.patch

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

11 September 2023, 04:57:12

po 11. 9. 2023 v 6:34 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal:

Hi

only rebase

Unfortunately this rebase was not correct. I am sorry.

fixed version

Regards

Pavel

Regards

Pavel

Attachment

v20230911-2-0001-possibility-to-read-options-for-dump-from-file.patch

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

11 September 2023, 06:33:42

po 11. 9. 2023 v 6:57 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal:

Hi

po 11. 9. 2023 v 6:34 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal:
Hi

only rebase

Unfortunately this rebase was not correct. I am sorry.

fixed version

and fixed forgotten "break" in switch

Regards

Pavel

Regards

Pavel

Regards

Pavel

Attachment

v20230911-2-0001-possibility-to-read-options-for-dump-from-file.patch

Re: proposal: possibility to read dumped table's name from file

From

Daniel Gustafsson

Date:

09 November 2023, 14:26:10

I went and had another look at this.  The patch has been around for 18
commitfests and is widely considered to add a good feature, so it seems about
time to get reach closure.

As I've mentioned in the past I'm not a big fan of the parser, but the thread
has overruled on that.  Another thing I think is a bit overcomplicated is the
layered error handling for printing log messages, and bubbling up of errors to
get around not being able to call exit_nicely.

In the attached version I've boiled down the error logging into a single new
function pg_log_filter_error() which takes a variable format string.  This
removes a fair bit of the extra calls and makes logging easier.  I've also
added a function pointer to the FilterStateData for passing the exit function
via filter_init.  This allows the filtering code to exit gracefully regardless
of which application is using it.  Finally, I've also reimplemented the logic
for checking the parsed tokens into switch statements without defaults in order
to get the compilerwarning on a missed case.  It's easy to miss adding code to
handle a state, especially when adding new ones, and this should help highlight
that.

Overall, this does shave a bit off the patch in size for what IMHO is better
readability and maintainability.  (I've also made a pgindent pass over it of
course).

What are your thoughts on this version?  It's not in a committable state as it
needs a bit more comments here and there and a triplecheck that nothing was
missed in changing this, but I prefer to get your thoughts before spending the
extra time.

--
Daniel Gustafsson

Attachment

v20231109-0001-possibility-to-read-options-for-dump-from-.patch

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

12 November 2023, 13:17:03

What are your thoughts on this version? It's not in a committable state as it
needs a bit more comments here and there and a triplecheck that nothing was
missed in changing this, but I prefer to get your thoughts before spending the
extra time.

I think using pointer to exit function is an elegant solution. I checked the code and I found only one issue. I fixed warning

[13:57:22.578] time make -s -j${BUILD_JOBS} world-bin

[13:58:20.858] filter.c: In function ‘pg_log_filter_error’:

[13:58:20.858] filter.c:161:2: error: function ‘pg_log_filter_error’ might be a candidate for ‘gnu_printf’ format attribute [-Werror=suggest-attribute=format]

[13:58:20.858] 161 | vsnprintf(buf, sizeof(buf), fmt, argp);

[13:58:20.858] | ^~~~~~~~~

[13:58:20.858] cc1: all warnings being treated as errors

and probably copy/paste bug

diff --git a/src/bin/pg_dump/pg_restore.c b/src/bin/pg_dump/pg_restore.c
index f647bde28d..ab2abedf5f 100644
--- a/src/bin/pg_dump/pg_restore.c
+++ b/src/bin/pg_dump/pg_restore.c
@@ -535,7 +535,7 @@ read_restore_filters(const char *filename, RestoreOptions *opts)
case FILTER_OBJECT_TYPE_EXTENSION:
case FILTER_OBJECT_TYPE_FOREIGN_DATA:
pg_log_filter_error(&fstate, _("%s filter for \"%s\" is not allowed."),
- "exclude",
+ "include",
filter_object_type_name(objtype));
exit_nicely(1);

Regards

Pavel

--
Daniel Gustafsson

Attachment

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

13 November 2023, 13:15:50

ne 12. 11. 2023 v 14:17 odesílatel Pavel Stehule <pavel.stehule@gmail.com> napsal:

Hi

What are your thoughts on this version? It's not in a committable state as it
needs a bit more comments here and there and a triplecheck that nothing was
missed in changing this, but I prefer to get your thoughts before spending the
extra time.

I think using pointer to exit function is an elegant solution. I checked the code and I found only one issue. I fixed warning

[13:57:22.578] time make -s -j${BUILD_JOBS} world-bin
[13:58:20.858] filter.c: In function ‘pg_log_filter_error’:
[13:58:20.858] filter.c:161:2: error: function ‘pg_log_filter_error’ might be a candidate for ‘gnu_printf’ format attribute [-Werror=suggest-attribute=format]
[13:58:20.858] 161 | vsnprintf(buf, sizeof(buf), fmt, argp);
[13:58:20.858] | ^~~~~~~~~
[13:58:20.858] cc1: all warnings being treated as errors

and probably copy/paste bug

diff --git a/src/bin/pg_dump/pg_restore.c b/src/bin/pg_dump/pg_restore.c
index f647bde28d..ab2abedf5f 100644
--- a/src/bin/pg_dump/pg_restore.c
+++ b/src/bin/pg_dump/pg_restore.c
@@ -535,7 +535,7 @@ read_restore_filters(const char *filename, RestoreOptions *opts)
case FILTER_OBJECT_TYPE_EXTENSION:
case FILTER_OBJECT_TYPE_FOREIGN_DATA:
pg_log_filter_error(&fstate, _("%s filter for \"%s\" is not allowed."),
- "exclude",
+ "include",
filter_object_type_name(objtype));
exit_nicely(1);

Regards

Pavel

next update - fix used, but uninitialized "is_include" variable, when filter is of FILTER_OBJECT_TYPE_NONE

fix crash

# Running: pg_ctl -w -D /tmp/cirrus-ci-build/build-32/testrun/pg_dump/005_pg_dump_filterfile/data/t_005_pg_dump_filterfile_main_data/pgdata -l /tmp/cirrus-ci-build/build-32/testrun/pg_dump/005_pg_dump_filterfile/log/005_pg_dump_filterfile_main.log -o --cluster-name=main start
waiting for server to start.... done
server started
# Postmaster PID for node "main" is 71352
# Running: pg_dump -p 65454 -f /tmp/cirrus-ci-build/build-32/testrun/pg_dump/005_pg_dump_filterfile/data/t_005_pg_dump_filterfile_main_data/backup/plain.sql --filter=/tmp/cirrus-ci-build/build-32/testrun/pg_dump/005_pg_dump_filterfile/data/tmp_test_0mO3/inputfile.txt postgres
../src/bin/pg_dump/pg_dump.c:18800:7: runtime error: load of value 86, which is not a valid value for type '_Bool'
==71579==Using libbacktrace symbolizer.    #0 0x566302cd in read_dump_filters ../src/bin/pg_dump/pg_dump.c:18800    #1 0x56663429 in main ../src/bin/pg_dump/pg_dump.c:670    #2 0xf7694e45 in __libc_start_main (/lib/i386-linux-gnu/libc.so.6+0x1ae45)    #3 0x56624d50 in _start (/tmp/cirrus-ci-build/build-32/tmp_install/usr/local/pgsql/bin/pg_dump+0x1ad50)

Regards

Pavel

--
Daniel Gustafsson

Attachment

Re: proposal: possibility to read dumped table's name from file

From

Daniel Gustafsson

Date:

13 November 2023, 13:39:15

> On 13 Nov 2023, at 14:15, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
> Hi
>
> ne 12. 11. 2023 v 14:17 odesílatel Pavel Stehule <pavel.stehule@gmail.com <mailto:pavel.stehule@gmail.com>> napsal:
> Hi
>
>
> What are your thoughts on this version?  It's not in a committable state as it
> needs a bit more comments here and there and a triplecheck that nothing was
> missed in changing this, but I prefer to get your thoughts before spending the
> extra time.
>
> I think using pointer to exit function is an elegant solution. I checked the code and I found only one issue. I fixed
warning
>
> [13:57:22.578] time make -s -j${BUILD_JOBS} world-bin
> [13:58:20.858] filter.c: In function ‘pg_log_filter_error’:
> [13:58:20.858] filter.c:161:2: error: function ‘pg_log_filter_error’ might be a candidate for ‘gnu_printf’ format
attribute[-Werror=suggest-attribute=format] 
> [13:58:20.858] 161 | vsnprintf(buf, sizeof(buf), fmt, argp);
> [13:58:20.858] | ^~~~~~~~~
> [13:58:20.858] cc1: all warnings being treated as errors
>
> and probably copy/paste bug
>
> diff --git a/src/bin/pg_dump/pg_restore.c b/src/bin/pg_dump/pg_restore.c
> index f647bde28d..ab2abedf5f 100644
> --- a/src/bin/pg_dump/pg_restore.c
> +++ b/src/bin/pg_dump/pg_restore.c
> @@ -535,7 +535,7 @@ read_restore_filters(const char *filename, RestoreOptions *opts)
>                 case FILTER_OBJECT_TYPE_EXTENSION:
>                 case FILTER_OBJECT_TYPE_FOREIGN_DATA:
>                     pg_log_filter_error(&fstate, _("%s filter for \"%s\" is not allowed."),
> -                                       "exclude",
> +                                       "include",
>                                         filter_object_type_name(objtype));
>                     exit_nicely(1);
>
> Regards
>
> Pavel
>
> next update - fix used, but uninitialized  "is_include" variable, when filter is of FILTER_OBJECT_TYPE_NONE

Thanks, the posted patchset was indeed a bit of a sketch, thanks for fixing up
these.  I'll go over it again too to clean it up and try to make into something
committable.

I was pondering replacing the is_include handling with returning an enum for
the operation, to keep things more future proof in case we add more operations
(and also a bit less magic IMHO).

--
Daniel Gustafsson

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

13 November 2023, 16:07:29

po 13. 11. 2023 v 14:39 odesílatel Daniel Gustafsson <daniel@yesql.se> napsal:

> On 13 Nov 2023, at 14:15, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
> Hi
>
> ne 12. 11. 2023 v 14:17 odesílatel Pavel Stehule <pavel.stehule@gmail.com <mailto:pavel.stehule@gmail.com>> napsal:
> Hi
>
>
> What are your thoughts on this version? It's not in a committable state as it
> needs a bit more comments here and there and a triplecheck that nothing was
> missed in changing this, but I prefer to get your thoughts before spending the
> extra time.
>
> I think using pointer to exit function is an elegant solution. I checked the code and I found only one issue. I fixed warning
>
> [13:57:22.578] time make -s -j${BUILD_JOBS} world-bin
> [13:58:20.858] filter.c: In function ‘pg_log_filter_error’:
> [13:58:20.858] filter.c:161:2: error: function ‘pg_log_filter_error’ might be a candidate for ‘gnu_printf’ format attribute [-Werror=suggest-attribute=format]
> [13:58:20.858] 161 | vsnprintf(buf, sizeof(buf), fmt, argp);
> [13:58:20.858] | ^~~~~~~~~
> [13:58:20.858] cc1: all warnings being treated as errors
>
> and probably copy/paste bug
>
> diff --git a/src/bin/pg_dump/pg_restore.c b/src/bin/pg_dump/pg_restore.c
> index f647bde28d..ab2abedf5f 100644
> --- a/src/bin/pg_dump/pg_restore.c
> +++ b/src/bin/pg_dump/pg_restore.c
> @@ -535,7 +535,7 @@ read_restore_filters(const char *filename, RestoreOptions *opts)
> case FILTER_OBJECT_TYPE_EXTENSION:
> case FILTER_OBJECT_TYPE_FOREIGN_DATA:
> pg_log_filter_error(&fstate, _("%s filter for \"%s\" is not allowed."),
> - "exclude",
> + "include",
> filter_object_type_name(objtype));
> exit_nicely(1);
>
> Regards
>
> Pavel
>
> next update - fix used, but uninitialized "is_include" variable, when filter is of FILTER_OBJECT_TYPE_NONE

Thanks, the posted patchset was indeed a bit of a sketch, thanks for fixing up
these. I'll go over it again too to clean it up and try to make into something
committable.

I was pondering replacing the is_include handling with returning an enum for
the operation, to keep things more future proof in case we add more operations
(and also a bit less magic IMHO).

Pavel

--
Daniel Gustafsson

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

20 November 2023, 05:20:27

I was pondering replacing the is_include handling with returning an enum for
the operation, to keep things more future proof in case we add more operations
(and also a bit less magic IMHO).

+1

I did it.

Regards

Pavel

Pavel

--
Daniel Gustafsson

Attachment

v20231120-0001-possibility-to-read-options-for-dump-from-file.patch

Re: proposal: possibility to read dumped table's name from file

From

Daniel Gustafsson

Date:

21 November 2023, 21:10:22

> On 20 Nov 2023, at 06:20, Pavel Stehule <pavel.stehule@gmail.com> wrote:

> I was pondering replacing the is_include handling with returning an enum for
> the operation, to keep things more future proof in case we add more operations
> (and also a bit less magic IMHO).
>
> +1
>
> I did it.

Nice, I think it's an improvement.

+           <literal>extension</literal>: data on foreign servers, works like
+           <option>--extension</option>. This keyword can only be
+           used with the <literal>include</literal> keyword.
This seems like a copy-pasteo, fixed in the attached.

I've spent some time polishing this version of the patch, among other things
trying to make the docs and --help screen consistent across the tools.  I've
added the diff as a txt file to this email (to keep the CFbot from applying
it), it's mainly reformatting a few comments and making things consistent.

The attached is pretty close to a committable patch IMO, review is welcome on
both the patch and commit message.  I tried to identify all reviewers over the
past 3+ years but I might have missed someone.

--
Daniel Gustafsson

Attachment

Re: proposal: possibility to read dumped table's name from file

From

Erik Rijkers

Date:

22 November 2023, 04:27:14

Op 11/21/23 om 22:10 schreef Daniel Gustafsson:
>> On 20 Nov 2023, at 06:20, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> 

> The attached is pretty close to a committable patch IMO, review is welcome on
> both the patch and commit message.  I tried to identify all reviewers over the
> past 3+ years but I might have missed someone.

I've tested this, albeit mostly in the initial iterations  (*shrug* but 
a mention is nice)

Erik Rijkers

> 
> --
> Daniel Gustafsson
>

Re: proposal: possibility to read dumped table's name from file

From

Daniel Gustafsson

Date:

29 November 2023, 14:43:58

> On 22 Nov 2023, at 05:27, Erik Rijkers <er@xs4all.nl> wrote:
>
> Op 11/21/23 om 22:10 schreef Daniel Gustafsson:
>>> On 20 Nov 2023, at 06:20, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
>> The attached is pretty close to a committable patch IMO, review is welcome on
>> both the patch and commit message.  I tried to identify all reviewers over the
>> past 3+ years but I might have missed someone.

I took another look at this, found some more polish that was needed, added
another testcase and ended up pushing it.

> I've tested this, albeit mostly in the initial iterations  (*shrug* but a mention is nice)

As I mentioned above it's easy to miss when reviewing three years worth of
emails, no-one was intentionally left out.  I went back and looked and added
you as a reviewer. Thanks for letting me know.

--
Daniel Gustafsson

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

29 November 2023, 15:50:42

st 29. 11. 2023 v 15:44 odesílatel Daniel Gustafsson <daniel@yesql.se> napsal:

> On 22 Nov 2023, at 05:27, Erik Rijkers <er@xs4all.nl> wrote:
>
> Op 11/21/23 om 22:10 schreef Daniel Gustafsson:
>>> On 20 Nov 2023, at 06:20, Pavel Stehule <pavel.stehule@gmail.com> wrote:
>
>> The attached is pretty close to a committable patch IMO, review is welcome on
>> both the patch and commit message. I tried to identify all reviewers over the
>> past 3+ years but I might have missed someone.

I took another look at this, found some more polish that was needed, added
another testcase and ended up pushing it.

> I've tested this, albeit mostly in the initial iterations (*shrug* but a mention is nice)

As I mentioned above it's easy to miss when reviewing three years worth of
emails, no-one was intentionally left out. I went back and looked and added
you as a reviewer. Thanks for letting me know.

Thank you very much

Regards

Pavel

--
Daniel Gustafsson

Re: proposal: possibility to read dumped table's name from file

From

Tom Lane

Date:

30 November 2023, 03:39:46

Daniel Gustafsson <daniel@yesql.se> writes:
> I took another look at this, found some more polish that was needed, added
> another testcase and ended up pushing it.

mamba is unhappy because this uses <ctype.h> functions without
casting their arguments to unsigned char:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2023-11-30%2002%3A53%3A25

(I had not realized that we still had buildfarm animals that would
complain about this ... but I'm glad we do, because it's a hazard.
POSIX is quite clear that the behavior is undefined for signed chars.)

            regards, tom lane

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

30 November 2023, 06:13:19

čt 30. 11. 2023 v 4:40 odesílatel Tom Lane <tgl@sss.pgh.pa.us> napsal:

Daniel Gustafsson <daniel@yesql.se> writes:
> I took another look at this, found some more polish that was needed, added
> another testcase and ended up pushing it.

mamba is unhappy because this uses <ctype.h> functions without
casting their arguments to unsigned char:

https://buildfarm.postgresql.org/cgi-bin/show_log.pl?nm=mamba&dt=2023-11-30%2002%3A53%3A25

(I had not realized that we still had buildfarm animals that would
complain about this ... but I'm glad we do, because it's a hazard.
POSIX is quite clear that the behavior is undefined for signed chars.)

here is a patch

Regards

Pavel

regards, tom lane

Attachment

fix_warning_array_subscript_has_type_char.patch

Re: proposal: possibility to read dumped table's name from file

From

Daniel Gustafsson

Date:

30 November 2023, 13:05:47

> On 30 Nov 2023, at 07:13, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> čt 30. 11. 2023 v 4:40 odesílatel Tom Lane <tgl@sss.pgh.pa.us <mailto:tgl@sss.pgh.pa.us>> napsal:
> Daniel Gustafsson <daniel@yesql.se <mailto:daniel@yesql.se>> writes:
> > I took another look at this, found some more polish that was needed, added
> > another testcase and ended up pushing it.
>
> mamba is unhappy because this uses <ctype.h> functions without
> casting their arguments to unsigned char:

Thanks for the heads-up.

> here is a patch

I agree with this fix, and have applied it.

--
Daniel Gustafsson

Re: proposal: possibility to read dumped table's name from file

From

Pavel Stehule

Date:

30 November 2023, 15:45:26

čt 30. 11. 2023 v 14:05 odesílatel Daniel Gustafsson <daniel@yesql.se> napsal:

> On 30 Nov 2023, at 07:13, Pavel Stehule <pavel.stehule@gmail.com> wrote:
> čt 30. 11. 2023 v 4:40 odesílatel Tom Lane <tgl@sss.pgh.pa.us <mailto:tgl@sss.pgh.pa.us>> napsal:
> Daniel Gustafsson <daniel@yesql.se <mailto:daniel@yesql.se>> writes:
> > I took another look at this, found some more polish that was needed, added
> > another testcase and ended up pushing it.
>
> mamba is unhappy because this uses <ctype.h> functions without
> casting their arguments to unsigned char:

Thanks for the heads-up.

> here is a patch

I agree with this fix, and have applied it.

Thank you

Pavel

--
Daniel Gustafsson