Thread: hashing bpchar for nondeterministic collations is broken

hashing bpchar for nondeterministic collations is broken

From
Jeff Davis
Date:
CREATE COLLATION ctest_nondet(
  provider = icu, locale = '', deterministic = false);

=# select hashbpcharextended('a' collate ctest_nondet, 3);
 hashbpcharextended
---------------------
 2792701603979490504
(1 row)

=# select hashbpcharextended('a  ' collate ctest_nondet, 3);
  hashbpcharextended
----------------------
 -4885217598372536483
(1 row)


I don't see any major consequences, because in both hash indexes and
partitioning it appears that the trailing spaces on the key are already
gone before hashbpcharextended is called. If someone does see bigger
consequences here, please let me know.

Seems to be an oversight. Patch attached.

Also, does someone have an opinion on backporting this? I'm inclined
to.


--
Jeff Davis
PostgreSQL Contributor Team - AWS



Attachment

Re: hashing bpchar for nondeterministic collations is broken

From
Richard Guo
Date:

On Fri, Dec 2, 2022 at 4:11 AM Jeff Davis <pgsql@j-davis.com> wrote:
CREATE COLLATION ctest_nondet(
  provider = icu, locale = '', deterministic = false);

=# select hashbpcharextended('a' collate ctest_nondet, 3);
 hashbpcharextended 
---------------------
 2792701603979490504
(1 row)

=# select hashbpcharextended('a  ' collate ctest_nondet, 3);
  hashbpcharextended 
----------------------
 -4885217598372536483
(1 row)
 
Good catch.  This really shouldn't happen for bpchar as the trailing
blanks are supposed to be ignored.  I can see hashbpchar is doing it
with the correct "true" length, so maybe this is a copy-and-pasteo from
hashtextextended?
 
Also, does someone have an opinion on backporting this? I'm inclined
to.
 
I'm not familiar with the backporting policy, but I also think we need
to as this fixes an obvious oversight.

Thanks
Richard

Re: hashing bpchar for nondeterministic collations is broken

From
Jeff Davis
Date:
On Thu, 2022-12-01 at 12:11 -0800, Jeff Davis wrote:
> I don't see any major consequences

Yikes, it looks like this is a problem for BPCHAR (without typmod
specified):

create table p_bpchar(t bpchar collate ctest_nondet, i int)
  partition by hash(t);
create table p0_bpchar partition of p_bpchar
  for values with (modulus 4, remainder 0);
create table p1_bpchar partition of p_bpchar
  for values with (modulus 4, remainder 1);
create table p2_bpchar partition of p_bpchar
  for values with (modulus 4, remainder 2);
create table p3_bpchar partition of p_bpchar
  for values with (modulus 4, remainder 3);

insert into p_bpchar values
  ('a',        0),
  ('a ',       1),
  ('a  ',      2),
  ('a   ',     3),
  ('a    ',    4),
  ('a     ',   5),
  ('a      ',  6),
  ('a       ', 7);

select count(*) from p0_bpchar; -- 2
select count(*) from p1_bpchar; -- 2
select count(*) from p2_bpchar; -- 3
select count(*) from p3_bpchar; -- 1

It seems like CHAR is not a problem, even though BPCHAR is documented
as an alias, because the planner treats BPCHAR->CHAR as a length
coercion, which trims trailing spaces.

And we just documented BPCHAR in v16 (0937f6d172), so the problem is
about to be worse. I suppose as of v15 we could argue that BPCHAR is
just an internal detail and that people shouldn't be creating columns
of that type?


--
Jeff Davis
PostgreSQL Contributor Team - AWS





Re: hashing bpchar for nondeterministic collations is broken

From
Jeff Davis
Date:
On Fri, 2022-12-02 at 10:30 +0800, Richard Guo wrote:
>
> I'm not familiar with the backporting policy, but I also think we
> need
> to as this fixes an obvious oversight.


The standard I use is that backpatching something should be safe enough
that users feel safer updating to a minor release (containing ~100 such
fixes) than *not* updating to a minor release.

The only demonstrable problem that I'm aware of[1] relies on an
undocumented type (undocumented before 16, that is) combined with a
non-deterministic collation and hash partitioning. The intersection of
those users is likely to be the empty set.

But assuming someone does use that combination of features in 15.1, and
we release 15.2 with my fix in it, then the consequences could be
severe: a pg_dump with default options on their bpchar-
nondeterministic-collation-hash-partitioned table in 15.1 would be
unrestorable in 15.2. They'd need to take a new pg_dump from 15.1 with
"--load-via-partition-root" and then restore it in 15.2. Obviously that
would be bad in any case, but it's more tolerable going between servers
of different major versions.

So, given that this fix is unlikely to help many people in 15 and
earlier, and that there's a chance that the fix is really painful for
someone; I'm electing not to backpatch it.


[1]
https://www.postgresql.org/message-id/7692740d4736e79032a5dac689cf2e304c03fa78.camel@j-davis.com



Re: hashing bpchar for nondeterministic collations is broken

From
Tom Lane
Date:
Jeff Davis <pgsql@j-davis.com> writes:
> But assuming someone does use that combination of features in 15.1, and
> we release 15.2 with my fix in it, then the consequences could be
> severe: a pg_dump with default options on their bpchar-
> nondeterministic-collation-hash-partitioned table in 15.1 would be
> unrestorable in 15.2. They'd need to take a new pg_dump from 15.1 with
> "--load-via-partition-root" and then restore it in 15.2. Obviously that
> would be bad in any case, but it's more tolerable going between servers
> of different major versions.

Yeah.  Also, do we have this issue with hash indexes?  If so
reindexing might be necessary, which again would be more
palatable in a major release.

> So, given that this fix is unlikely to help many people in 15 and
> earlier, and that there's a chance that the fix is really painful for
> someone; I'm electing not to backpatch it.

+1

            regards, tom lane



Re: hashing bpchar for nondeterministic collations is broken

From
Jeff Davis
Date:
On Fri, 2022-12-02 at 16:49 -0500, Tom Lane wrote:
> Yeah.  Also, do we have this issue with hash indexes?

No, it appears that hash indexes only call hashbpchar, which doesn't
have the bug.

Extended hash functions seem to only be used in hash partitioning, and
other extended hash functions.

Regards,
    Jeff Davis