Thread: Inconsistency between TO_CHAR() and TO_NUMBER()

Inconsistency between TO_CHAR() and TO_NUMBER()

From

Patryk Kordylewski

Date:

06 May 2013, 10:15:06

Hi,

i have the following test-case:

http://psql.privatepaste.com/b3b431851a

We want to convert a TO_CHAR() formated numeric back to numeric using
TO_NUMBER() with the same format string. This does not work with a
german locale.

By removing the FM prefix and using the exact amount of digits in the
format string for TO_NUMBER() it starts to work. Switching the locale
back to en_US does work too.

Is this a bug or am i missing something?

Thanks!

Best regards,
Patryk

Re: Inconsistency between TO_CHAR() and TO_NUMBER()

From

Tom Lane

Date:

06 May 2013, 14:26:29

Patryk Kordylewski <pk@fooby.de> writes:
> Hi,
> i have the following test-case:

> http://psql.privatepaste.com/b3b431851a

Just for the record: this is a completely unacceptable method of
submitting a bug report.  After a month from now, when that paste has
expired, nobody looking at the PG archives will have any way to find out
what you were talking about.  Even without the archival consideration,
you've added an extra step to fetch the test case for anyone reading
your mail.

Having said that, though, this does look wrong ...

            regards, tom lane

Re: Inconsistency between TO_CHAR() and TO_NUMBER()

From

Patryk Kordylewski

Date:

06 May 2013, 14:49:53

Am 06.05.2013 16:26, schrieb Tom Lane:
> Patryk Kordylewski <pk@fooby.de> writes:
>> Hi,
>> i have the following test-case:
>
>> http://psql.privatepaste.com/b3b431851a
>
> Just for the record: this is a completely unacceptable method of
> submitting a bug report.  After a month from now, when that paste has
> expired, nobody looking at the PG archives will have any way to find out
> what you were talking about.  Even without the archival consideration,
> you've added an extra step to fetch the test case for anyone reading
> your mail.
>
> Having said that, though, this does look wrong ...
>
>             regards, tom lane

Hmpf, you're right. I'm sorry. Here is the test-case:

PostgreSQL 9.2.4 (also tested on 9.1 and 8.2)

### WORKS

SET lc_numeric TO 'en_US.UTF-8';
SET

SELECT
   TO_CHAR(123456789.123, 'FM99G999G999G999G999G999G999D000'),
   TO_NUMBER(TO_CHAR(123456789.123, 'FM99G999G999G999G999G999G999D000'),
'FM99G999G999G999G999G999G999D000');
      to_char     |   to_number
-----------------+---------------
  123,456,789.123 | 123456789.123
(1 row)

### DOES NOT WORK

SET lc_numeric TO 'de_DE.UTF-8';
SET

SELECT
   TO_CHAR(123456789.123, 'FM99G999G999G999G999G999G999D000'),
   TO_NUMBER(TO_CHAR(123456789.123, 'FM99G999G999G999G999G999G999D000'),
'FM99G999G999G999G999G999G999D000');
      to_char     | to_number
-----------------+-----------
  123.456.789,123 |   123.456
(1 row)


### FORMAT STRING WITHOUT FM PREFIX AND EXACT NUMBER OF DIGITS WORKS

SELECT
   TO_CHAR(123456789.123, 'FM99G999G999G999G999G999G999D000'),
   TO_NUMBER(TO_CHAR(123456789.123, 'FM99G999G999G999G999G999G999D000'),
'999G999G999D000');
      to_char     |   to_number
-----------------+---------------
  123.456.789,123 | 123456789.123
(1 row)

Thank you.

Best regards,
Patryk

Re: Inconsistency between TO_CHAR() and TO_NUMBER()

From

Tom Lane

Date:

10 May 2013, 15:49:27

Patryk Kordylewski <pk@fooby.de> writes:
> SET lc_numeric TO 'de_DE.UTF-8';
> SET

> SELECT
>    TO_CHAR(123456789.123, 'FM99G999G999G999G999G999G999D000'),
>    TO_NUMBER(TO_CHAR(123456789.123, 'FM99G999G999G999G999G999G999D000'),
> 'FM99G999G999G999G999G999G999D000');
>       to_char     | to_number
> -----------------+-----------
>   123.456.789,123 |   123.456
> (1 row)

I looked into this, and find that the reason it misbehaves is that
NUM_numpart_from_char() will treat a '.' as being a decimal point
*without any regard to locale considerations*.  So even if we have
a locale-dependent format string and a locale that says '.' is a
thousands separator, it does the wrong thing.

It's a bit surprising nobody's complained of this before.

I propose the attached patch.  I'm slightly worried though about whether
this might break any existing applications that are (incorrectly)
depending on a D format specifier being able to match '.' regardless of
locale.  Perhaps we should only apply this to HEAD and not back-patch?

            regards, tom lane

diff --git a/src/backend/utils/adt/formatting.c b/src/backend/utils/adt/formatting.c
index db5dfca51d477d3e9b33b8d2c264495b3b2ec433..81e3329ef60ce4f835fedba50208b8d0f4b19d63 100644
*** a/src/backend/utils/adt/formatting.c
--- b/src/backend/utils/adt/formatting.c
*************** NUM_numpart_from_char(NUMProc *Np, int i
*** 4131,4137 ****
  #endif

      /*
!      * read digit
       */
      if (isdigit((unsigned char) *Np->inout_p))
      {
--- 4131,4137 ----
  #endif

      /*
!      * read digit or decimal point
       */
      if (isdigit((unsigned char) *Np->inout_p))
      {
*************** NUM_numpart_from_char(NUMProc *Np, int i
*** 4151,4190 ****
  #ifdef DEBUG_TO_FROM_CHAR
          elog(DEBUG_elog_output, "Read digit (%c)", *Np->inout_p);
  #endif
-
-         /*
-          * read decimal point
-          */
      }
      else if (IS_DECIMAL(Np->Num) && Np->read_dec == FALSE)
      {
  #ifdef DEBUG_TO_FROM_CHAR
!         elog(DEBUG_elog_output, "Try read decimal point (%c)", *Np->inout_p);
  #endif
!         if (*Np->inout_p == '.')
          {
              *Np->number_p = '.';
              Np->number_p++;
              Np->read_dec = TRUE;
              isread = TRUE;
          }
-         else
-         {
-             int            x = strlen(Np->decimal);
-
- #ifdef DEBUG_TO_FROM_CHAR
-             elog(DEBUG_elog_output, "Try read locale point (%c)",
-                  *Np->inout_p);
- #endif
-             if (x && AMOUNT_TEST(x) && strncmp(Np->inout_p, Np->decimal, x) == 0)
-             {
-                 Np->inout_p += x - 1;
-                 *Np->number_p = '.';
-                 Np->number_p++;
-                 Np->read_dec = TRUE;
-                 isread = TRUE;
-             }
-         }
      }

      if (OVERLOAD_TEST)
--- 4151,4178 ----
  #ifdef DEBUG_TO_FROM_CHAR
          elog(DEBUG_elog_output, "Read digit (%c)", *Np->inout_p);
  #endif
      }
      else if (IS_DECIMAL(Np->Num) && Np->read_dec == FALSE)
      {
+         /*
+          * We need not test IS_LDECIMAL(Np->Num) explicitly here, because
+          * Np->decimal is always just "." if we don't have a D format token.
+          * So we just unconditionally match to Np->decimal.
+          */
+         int            x = strlen(Np->decimal);
+
  #ifdef DEBUG_TO_FROM_CHAR
!         elog(DEBUG_elog_output, "Try read decimal point (%c)",
!              *Np->inout_p);
  #endif
!         if (x && AMOUNT_TEST(x) && strncmp(Np->inout_p, Np->decimal, x) == 0)
          {
+             Np->inout_p += x - 1;
              *Np->number_p = '.';
              Np->number_p++;
              Np->read_dec = TRUE;
              isread = TRUE;
          }
      }

      if (OVERLOAD_TEST)

Re: Inconsistency between TO_CHAR() and TO_NUMBER()

From

Thomas Kellerer

Date:

10 May 2013, 16:10:13

Tom Lane wrote on 10.05.2013 17:49:
> I looked into this, and find that the reason it misbehaves is that
> NUM_numpart_from_char() will treat a '.' as being a decimal point
> *without any regard to locale considerations*.  So even if we have
> a locale-dependent format string and a locale that says '.' is a
> thousands separator, it does the wrong thing.
>
> It's a bit surprising nobody's complained of this before.
>
> I propose the attached patch.  I'm slightly worried though about whether
> this might break any existing applications that are (incorrectly)
> depending on a D format specifier being able to match '.' regardless of
> locale.  Perhaps we should only apply this to HEAD and not back-patch?

The manual claims that 'D' is locale dependent (whereas '.' is not), so
_theoretically_ a back patch would make sense I guess.

Re: Inconsistency between TO_CHAR() and TO_NUMBER()

From

Euler Taveira

Date:

10 May 2013, 22:17:50

On 10-05-2013 13:09, Thomas Kellerer wrote:
> Tom Lane wrote on 10.05.2013 17:49:
>> I looked into this, and find that the reason it misbehaves is that
>> NUM_numpart_from_char() will treat a '.' as being a decimal point
>> *without any regard to locale considerations*.  So even if we have
>> a locale-dependent format string and a locale that says '.' is a
>> thousands separator, it does the wrong thing.
>>
>> It's a bit surprising nobody's complained of this before.
>>
>> I propose the attached patch.  I'm slightly worried though about whether
>> this might break any existing applications that are (incorrectly)
>> depending on a D format specifier being able to match '.' regardless of
>> locale.  Perhaps we should only apply this to HEAD and not back-patch?
>
+1 only in HEAD. That's because (a) it doesn't crash, (b) it doesn't
always produce the "wrong" answer (only in some specific situation) and
(c) it has been like that for years without a complain. For those
reasons, it is better to continue with this "wrong" behavior in back
branches than prevent important security updates to be applied (without
applying a patch to preserve the "wrong" answer). This argument is only
valid for legacy closed-source apps but seems to have more weight than
the bug scenario.

> The manual claims that 'D' is locale dependent (whereas '.' is not), so
> _theoretically_ a back patch would make sense I guess.
>
I would consider a documentation bug in back branches because fix it
means break apps.


--
   Euler Taveira de Oliveira - Timbira       http://www.timbira.com.br/
   PostgreSQL: Consultoria, Desenvolvimento, Suporte 24x7 e Treinamento

Re: Inconsistency between TO_CHAR() and TO_NUMBER()

From

Heikki Linnakangas

Date:

13 May 2013, 10:45:35

On 11.05.2013 01:17, Euler Taveira wrote:
> On 10-05-2013 13:09, Thomas Kellerer wrote:
>> Tom Lane wrote on 10.05.2013 17:49:
>>> I looked into this, and find that the reason it misbehaves is that
>>> NUM_numpart_from_char() will treat a '.' as being a decimal point
>>> *without any regard to locale considerations*.  So even if we have
>>> a locale-dependent format string and a locale that says '.' is a
>>> thousands separator, it does the wrong thing.
>>>
>>> It's a bit surprising nobody's complained of this before.
>>>
>>> I propose the attached patch.  I'm slightly worried though about whether
>>> this might break any existing applications that are (incorrectly)
>>> depending on a D format specifier being able to match '.' regardless of
>>> locale.  Perhaps we should only apply this to HEAD and not back-patch?
>>
> +1 only in HEAD. That's because (a) it doesn't crash, (b) it doesn't
> always produce the "wrong" answer (only in some specific situation) and
> (c) it has been like that for years without a complain. For those
> reasons, it is better to continue with this "wrong" behavior in back
> branches than prevent important security updates to be applied (without
> applying a patch to preserve the "wrong" answer). This argument is only
> valid for legacy closed-source apps but seems to have more weight than
> the bug scenario.

+1 for HEAD-only. The Finnish language and locale uses comma (,) as the
decimal separator, and it's a real pain in the ass. And if something
goes wrong there, it can be *really* subtle. I once had to debug an
application where all prices were suddenly rounded down to the nearest
euro. And it only happened on some servers (those with locale set to
Finnish). It was not a PostgreSQL application, but it turned out to be a
bug in the JDBC driver of another DBMS.

Would it be possible to be lenient, and also accept . as the decimal
separator, when there is no ambiguity? Ie. when . is not the thousands
separator.

- Heikki

Re: Inconsistency between TO_CHAR() and TO_NUMBER()

From

Tom Lane

Date:

13 May 2013, 14:09:25

Heikki Linnakangas <hlinnakangas@vmware.com> writes:
> Would it be possible to be lenient, and also accept . as the decimal
> separator, when there is no ambiguity? Ie. when . is not the thousands
> separator.

I originally coded it that way, but concluded that it was probably a
waste of code space.  How many locales can you point to where neither
the decimal point nor thousands_sep is "."?  It certainly wouldn't be
enough to noticeably reduce the potential pain from this change, so
I decided that it'd be better to keep the behavior straightforward
and as-documented.

            regards, tom lane

Re: Inconsistency between TO_CHAR() and TO_NUMBER()

From

Heikki Linnakangas

Date:

13 May 2013, 14:56:28

On 13.05.2013 17:09, Tom Lane wrote:
> Heikki Linnakangas<hlinnakangas@vmware.com>  writes:
>> Would it be possible to be lenient, and also accept . as the decimal
>> separator, when there is no ambiguity? Ie. when . is not the thousands
>> separator.
>
> I originally coded it that way, but concluded that it was probably a
> waste of code space.  How many locales can you point to where neither
> the decimal point nor thousands_sep is "."?

On my laptop, there are eight locales that use "," as the decimal
separator and " " as the thousands separator.

$ grep -l "^thousands_sep.*U00A0" /usr/share/i18n/locales/* | xargs grep
-l "^decimal_point.*U002C"
/usr/share/i18n/locales/cs_CZ
/usr/share/i18n/locales/et_EE
/usr/share/i18n/locales/fi_FI
/usr/share/i18n/locales/lv_LV
/usr/share/i18n/locales/nb_NO
/usr/share/i18n/locales/ru_RU
/usr/share/i18n/locales/sk_SK
/usr/share/i18n/locales/uk_UA

Out of these, ru_RU actually uses "." as the LC_MONETARY decimal point,
even though it uses "," as the LC_NUMERIC decimal point. I think that
strengthens the argument for accepting both. I don't speak Russian, but
if you pass a monetary value to TO_NUMBER in ru_RU locale, using "." as
the decimal separator, you probably would expect it to work.

According to
http://en.wikipedia.org/wiki/Decimal_separator#Examples_of_use, many
countries accept either "1 234 567,89" or "1.234.567,89" style, but
looking at the locale files installed on my system, the latter style is
the one actually used (e.g Germany).

- Heikki

Re: Inconsistency between TO_CHAR() and TO_NUMBER()

From

Bruce Momjian

Date:

17 May 2013, 00:23:22

On Fri, May 10, 2013 at 11:49:18AM -0400, Tom Lane wrote:
> Patryk Kordylewski <pk@fooby.de> writes:
> > SET lc_numeric TO 'de_DE.UTF-8';
> > SET
>
> > SELECT
> >    TO_CHAR(123456789.123, 'FM99G999G999G999G999G999G999D000'),
> >    TO_NUMBER(TO_CHAR(123456789.123, 'FM99G999G999G999G999G999G999D000'),
> > 'FM99G999G999G999G999G999G999D000');
> >       to_char     | to_number
> > -----------------+-----------
> >   123.456.789,123 |   123.456
> > (1 row)
>
> I looked into this, and find that the reason it misbehaves is that
> NUM_numpart_from_char() will treat a '.' as being a decimal point
> *without any regard to locale considerations*.  So even if we have
> a locale-dependent format string and a locale that says '.' is a
> thousands separator, it does the wrong thing.
>
> It's a bit surprising nobody's complained of this before.
>
> I propose the attached patch.  I'm slightly worried though about whether
> this might break any existing applications that are (incorrectly)
> depending on a D format specifier being able to match '.' regardless of
> locale.  Perhaps we should only apply this to HEAD and not back-patch?

I never trust to_char() changes for backpatching.  ;-)

--
  Bruce Momjian  <bruce@momjian.us>        http://momjian.us
  EnterpriseDB                             http://enterprisedb.com

  + It's impossible for everything to be true. +