Hi Fujii-san,
Thanks for working on this.
On Sat, Aug 2, 2025 at 5:48 PM Fujii Masao <masao.fujii@gmail.com> wrote:
>
> Hi,
>
> It looks like pg_dump --filter can mistakenly treat invalid object types
> in the filter file as valid ones. For example, the invalid type "table-data"
> (probably a typo for "table_data") is incorrectly recognized as "table",
> and pg_dump runs without error when it should fail.
>
> --------------------------------------------
> $ cat filter.txt
> exclude table-data one
>
> $ pg_dump --filter filter.txt
> --
> -- PostgreSQL database dump
> --
> ...
>
> $ echo $?
> 0
> --------------------------------------------
>
> This happens because pg_dump (filter_get_keyword() in pg_dump/filter.c)
> identifies tokens as sequences of ASCII alphabetic characters, treating
> non-alphabetic characters (like hyphens) as token boundaries. As a result,
> "table-data" is parsed as "table".
>
> To fix this, I've attached the patch that updates pg_dump --filter so that
> it treats tokens as strings of non-space characters separated by spaces
> or line endings, ensuring invalid types like "table-data" are correctly
> rejected. Thought?
>
> With the patch:
> --------------------------------------------
> $ cat filter.txt
> exclude table-data one
>
> $ pg_dump --filter filter.txt
> pg_dump: error: invalid format in filter read from file "filter.txt"
> on line 1: unsupported filter object type: "table-data"
> --------------------------------------------
After testing, the patch LGTM. I noticed two very small possible nits:
1) Comment wording
The loop now calls isspace((unsigned char)*ptr), so a token ends at
any whitespace, not just at ASCII space (0x20). Could we revise the
comment—from
“strings of non-space characters bounded by space characters”
to something like
“strings of non-space characters bounded by whitespace”
—to match the behavior?
2) Variable name
const char *keyword = filter_get_token(&str, &size);
keyword = filter_get_token(&str, &size);
After the patch, filter_get_token() no longer returns a keyword
(letters-only identifier); it now returns any non-whitespace token.
Renaming the variable from keyword to token (or similar) might make
the intent clearer..
Best,
Xuneng