Thread: BUG #16512: Character considered as a number by regex but can not convert to numeric

BUG #16512: Character considered as a number by regex but can not convert to numeric

From
PG Bug reporting form
Date:
The following bug has been logged on the website:

Bug reference:      16512
Logged by:          tester375
Email address:      oytf@drivetagdev.com
PostgreSQL version: 12.3
Operating system:   Windows 10
Description:

When installed Postgresql 12.3 for testing purposes we have discovered
following possible bug:
Character "8" (different from standard number 8) is treated as a number by
regular expression
  ~'\d+' 
but can not be converted to numeric by
 ::numeric

How to reproduce the bug:
Command:
SELECT '8' ~ '\d+'
returns "false" for PostgreSQL 10.4 on both Windows 10 and CentOS Linux 7
returns "false" for PostgreSQL 11.5 on CentOS Linux 7
returns "true" for PostgreSQL 12.3 on Windows 10 - but in this case,
character still can not be converted to numeric.

This behavior causes obvious problems when using data type conversion
like:
CASE WHEN column1 ~ '\d+' THEN column1::numeric AS column2
Error: '8' can not be cast to numeric


PG Bug reporting form <noreply@postgresql.org> writes:
> When installed Postgresql 12.3 for testing purposes we have discovered
> following possible bug:
> Character "8" (different from standard number 8) is treated as a number by
> regular expression
>   ~'\d+' 
> but can not be converted to numeric by
>  ::numeric

I see no reason to consider this a bug.

The set of characters matched by \d is, as documented, dependent on
your collation choice or locale settings (LC_CTYPE in particular).
We are not going to try to guess at what random Unicode glyphs might
be included in a particular locale, much less make numeric_in take
them all.

If you want to know whether a string is acceptable to numeric_in
I'd suggest writing something based around [0-9] not \d.

            regards, tom lane



Behavior has been tested on both Postgresql 10 and Postgresql 12 with same
LC_CTYPE set. Actualy it was the same Windows computer.

The problem might be fixed by using [0-9\.]+ but it does not explain the
different behaviour among Postgres versions.



--
Sent from: https://www.postgresql-archive.org/PostgreSQL-bugs-f2117394.html



tester357 <tester357.postgres@email.cz> writes:
> Behavior has been tested on both Postgresql 10 and Postgresql 12 with same
> LC_CTYPE set. Actualy it was the same Windows computer.
> The problem might be fixed by using [0-9\.]+ but it does not explain the
> different behaviour among Postgres versions.

True.  If the older version were pre-v10 the explanation would be that it
lacked support for regex character-class checks on characters above U+7FF.
But AFAICS, we should act the same in v10 and later versions: we just ask
the platform's iswdigit() whether the character is a digit.  I have to
conclude there's something inconsistent in iswdigit().

I notice that POSIX seems to mandate that the [:digit:] class be
exactly '0' through '9': the "Locale" chapter says

digit
    Define the characters to be classified as numeric digits.

    In the POSIX locale, only:
    0 1 2 3 4 5 6 7 8 9
    shall be included.

    In a locale definition file, only the digits <zero>, <one>, <two>,
    <three>, <four>, <five>, <six>, <seven>, <eight>, and <nine> shall
    be specified, and in contiguous ascending sequence by numerical
    value. The digits <zero> to <nine> of the portable character set
    are automatically included in this class.

The Linux man page for iswdigit() says the same, so it's no surprise
that the test case returns "false" on Linux.  Sadly, it's also little
surprise if Microsoft thinks they're not bound by the POSIX spec.

In general, I'm not finding anything particularly investigation-worthy
in this report.  If somebody with an appropriate system wanted to chase
down exactly why the v10 vs v12 difference exists, that might be
marginally interesting, but I doubt we'd consider it a bug.
Platform-dependent results are expected here.

            regards, tom lane