Thread: BUG #16207: localization functions upper() and lower() does not work for text returned by convert_from()

BUG #16207: localization functions upper() and lower() does not work for text returned by convert_from()

From

PG Bug reporting form

Date:

15 January 2020, 15:41:41

The following bug has been logged on the website:

Bug reference:      16207
Logged by:          toha.bakanovsky
Email address:      im-scooter@yandex.ru
PostgreSQL version: 12.0
Operating system:   Ubuntu 16.04.11 64-bit
Description:

Localization functions upper() and lower() does not work for text returned
by convert_from()  for cyrillic symbols.
select upper(convert_from(convert_to('абв', 'utf8'), 'utf8')) =
upper('абв'); -- return false
select upper(convert_from('абв'::bytea, 'utf8')), upper('абв'); -- return
false

hexdecimal presentation of   text is identical 
select convert_to(convert_from('абв'::bytea, 'utf8'), 'utf8') =
convert_to('абв', 'utf-8'); -- returns true

SHOW lc_ctype; -- -- "en_US.UTF-8"

On Postgresql 10.0 it works fine.

Re: BUG #16207: localization functions upper() and lower() does not work for text returned by convert_from()

From

Tom Lane

Date:

15 January 2020, 16:30:54

PG Bug reporting form <noreply@postgresql.org> writes:
> Localization functions upper() and lower() does not work for text returned
> by convert_from()  for cyrillic symbols.
> select upper(convert_from(convert_to('абв', 'utf8'), 'utf8')) =
> upper('абв'); -- return false

That happens because upper() depends on the collation that the parser
determines for its input expression.  When you write just upper('абв')
you get the database's default collation attached to the literal, since
it's implicitly typed as text.  However the convert_from() invocation
ends up with collation "C", in which non-ASCII characters aren't
going to be changed by upper().  You can fix it by assigning another
collation explicitly, e.g.

# select upper(convert_from(convert_to('абв', 'utf8'), 'utf8'));
 upper
--------
 абв
(1 row)

# select upper(convert_from(convert_to('абв', 'utf8'), 'utf8') collate "default");
 upper
--------
 АБВ
(1 row)

> On Postgresql 10.0 it works fine.

This case sort of accidentally works in pre-v12 versions, because
the convert_from call has no collatable inputs so it gets a default
collation assignment anyway.  However, the function's collation-name
argument is of type name, and in v12 that type has collation "C"
rather than no collation, so the convert_from result ends up with
collation "C" not the database's default.

I had not realized till just now that that change would have any
side-effects on the behavior of convert_from (or more accurately,
on the behavior of expressions involving convert_from as an
input to a collation-sensitive function).  I'm not sure whether
to call it a bug or not --- it's not really obvious that
convert_from() ought to be regarded as producing some particular
collation rather than another one.  In any case, there seems very
little that we could do about it in the v12 release series, since
those catalog entries are fixed now.

Poking around, it seems that all of these functions would
be similarly affected:

# select oid::regprocedure from pg_proc where 'name'::regtype = any(proargtypes) and prorettype = 'text'::regtype;
             oid
-----------------------------
 text(name)
 obj_description(oid,name)
 shobj_description(oid,name)
 convert_from(bytea,name)
 to_ascii(text,name)
(5 rows)

            regards, tom lane