Thread: Re: [PATCH] Add roman support for to_number function

Re: [PATCH] Add roman support for to_number function

From
Maciek Sakrejda
Date:
The following review has been posted through the commitfest application:
make installcheck-world:  tested, passed
Implements feature:       tested, failed
Spec compliant:           tested, failed
Documentation:            tested, passed

Tested again, and the patch looks good. It does not accept leading or trailing whitespace, which seems reasonable,
giventhe unclear behavior of to_number with other format strings. It also rejects less common Roman spellings like
"IIII".I don't feel strongly about that one way or the other, but perhaps a test codifying that behavior would be
usefulto make it clear it's intentional.
 

I'm marking it RfC.

The new status of this patch is: Ready for Committer

Re: [PATCH] Add roman support for to_number function

From
Maciek Sakrejda
Date:
Sorry, it looks like I failed to accurately log my review in the
review app due to the current broken layout issues [1]. The summary
should be:

make installcheck-world:  tested, passed
Implements feature:       tested, passed
Spec compliant:           not tested (not sure what the spec has to
say about this)
Documentation:            tested, passed

[1]:
https://www.postgresql.org/message-id/flat/CAD68Dp1GgTeBiA0YVWXpfPCMC7%3DnqBCLn6%2BhuOkWURcKRnFw5g%40mail.gmail.com



Re: [PATCH] Add roman support for to_number function

From
Tom Lane
Date:
Maciek Sakrejda <m.sakrejda@gmail.com> writes:
> Tested again, and the patch looks good. It does not accept leading or trailing whitespace, which seems reasonable,
giventhe unclear behavior of to_number with other format strings. It also rejects less common Roman spellings like
"IIII".I don't feel strongly about that one way or the other, but perhaps a test codifying that behavior would be
usefulto make it clear it's intentional. 

Yeah, I don't have a strong feeling about that either, but probably
being strict is better.  to_number has a big problem with "garbage in
garbage out" already, and being lax will make that worse.

A few notes from a quick read of the patch:

* roman_to_int() should have a header comment, notably explaining its
result convention.  I find it fairly surprising that "0" means an
error --- I realize that Roman notation can't represent zero, but
wouldn't it be better to use -1?

* Do *not* rely on toupper().  There are multiple reasons why not,
but in this context the worst one is that in Turkish locales it's
likely to translate "i" to "İ", on which you will fail.  I'd use
pg_ascii_toupper().

* I think roman_to_int() is under-commented internally too.
To start with, where did the magic "15" come from?  And why
should we have such a test anyway --- what if the format allows
for trailing stuff after the roman numeral?  (That is, I think
you need to fix this so that roman_to_int reports how much data
it ate, instead of assuming that it must read to the end of the
input.)  The logic for detecting invalid numeral combinations
feels a little opaque too.  Do you have a source for the rules
you're implementing, and if so could you cite it?

* This code will get run through pgindent before commit, so you
might want to revisit whether your comments will still look nice
afterwards.  There's not a lot of consistency in them about initial
cap versus lower case or trailing period versus not, too.

* roman_result could be declared where used, no?

* I'm not really convinced that we need a new errcode
ERRCODE_INVALID_ROMAN_NUMERAL rather than using a generic one like
ERRCODE_INVALID_TEXT_REPRESENTATION.  However, if we do do it
like that, why did you pick the out-of-sequence value 22P07?

* Might be good to have a few test cases demonstrating what happens
when there's other stuff combined with the RN format spec.  Notably,
even if RN itself won't eat whitespace, there had better be a way
to write the format string to allow that.

* Further to Aleksander's point about lack of test coverage for
the to_char direction, I see from
https://coverage.postgresql.org/src/backend/utils/adt/formatting.c.gcov.html
that almost none of the existing roman-number code paths are covered
today.  While it's not strictly within the charter of this patch
to improve that, maybe we should try to do so --- at the very
least it'd raise formatting.c's coverage score a few notches.

            regards, tom lane



Re: [PATCH] Add roman support for to_number function

From
Tom Lane
Date:
I wrote:
> * Further to Aleksander's point about lack of test coverage for
> the to_char direction, I see from
> https://coverage.postgresql.org/src/backend/utils/adt/formatting.c.gcov.html
> that almost none of the existing roman-number code paths are covered
> today.  While it's not strictly within the charter of this patch
> to improve that, maybe we should try to do so --- at the very
> least it'd raise formatting.c's coverage score a few notches.

For the archives' sake: I created a patch and a separate discussion
thread [1] for that.

            regards, tom lane

[1] https://www.postgresql.org/message-id/flat/2956175.1725831136@sss.pgh.pa.us



Re: [PATCH] Add roman support for to_number function

From
Hunaid Sohail
Date:
Hi,

On Sun, Sep 8, 2024 at 4:09 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
A few notes from a quick read of the patch:

* roman_to_int() should have a header comment, notably explaining its
result convention.  I find it fairly surprising that "0" means an
error --- I realize that Roman notation can't represent zero, but
wouldn't it be better to use -1?

Noted.
 

* Do *not* rely on toupper().  There are multiple reasons why not,
but in this context the worst one is that in Turkish locales it's
likely to translate "i" to "İ", on which you will fail.  I'd use
pg_ascii_toupper().

Noted.
 

* I think roman_to_int() is under-commented internally too.
To start with, where did the magic "15" come from?  And why
should we have such a test anyway --- what if the format allows
for trailing stuff after the roman numeral?  (That is, I think
you need to fix this so that roman_to_int reports how much data
it ate, instead of assuming that it must read to the end of the
input.) 

MMMDCCCLXXXVIII is the longest valid standard roman numeral (15 characters). I will add appropriate comment on length check.

I am not sure I am able to understand the latter part. Could you please elaborate it?
Are you referring to cases like these?
SELECT to_number('CXXIII foo', 'RN foo');
SELECT to_number('CXXIII foo', 'RN');
Please check my comments on Oracle's behaviour below.

 
The logic for detecting invalid numeral combinations
feels a little opaque too.  Do you have a source for the rules
you're implementing, and if so could you cite it?

There are many sources on the internet.
A few sources:

Note that we are following the standard form: https://en.wikipedia.org/wiki/Roman_numerals#Standard_form
 

* This code will get run through pgindent before commit, so you
might want to revisit whether your comments will still look nice
afterwards.  There's not a lot of consistency in them about initial
cap versus lower case or trailing period versus not, too.

Noted.
 

* roman_result could be declared where used, no?

Noted.
 

* I'm not really convinced that we need a new errcode
ERRCODE_INVALID_ROMAN_NUMERAL rather than using a generic one like
ERRCODE_INVALID_TEXT_REPRESENTATION.  However, if we do do it
like that, why did you pick the out-of-sequence value 22P07?

22P07 is not out-of-sequence. 22P01 to 22P06 are already used.
However, I do agree with you that we can use ERRCODE_INVALID_TEXT_REPRESENTATION. I will change it.
 

* Might be good to have a few test cases demonstrating what happens
when there's other stuff combined with the RN format spec.  Notably,
even if RN itself won't eat whitespace, there had better be a way
to write the format string to allow that.

The current patch (v3) simply ignores other formats with RN except when RN is used EEEE which returns error.
```
postgres=# SELECT to_number('CXXIII', 'foo RN');
 to_number
-----------
       123
(1 row)

postgres=# SELECT to_number('CXXIII', 'MIrn');
 to_number
-----------
       123
(1 row)

postgres=# SELECT to_number('CXXIII', 'EEEErn');
ERROR:  "EEEE" must be the last pattern used
```

However, I think that other formats except FM should be rejected when used with RN in NUMDesc_prepare function. Any opinions?

If we look into Oracle, they are strict in this regard and don't allow any other format with RN.
1. SELECT to_char(12, 'MIrn') from dual;
2. SELECT to_char(12, 'foo RN') from dual;
results in error:
ORA-01481: invalid number format model

I also found that there is no check when RN is used twice (both in to_char and to_number) and that results in unexpected output.

```
postgres=# SELECT to_number('CXXIII', 'RNrn');
 to_number
-----------
       123
(1 row)

postgres=# SELECT to_char(12, 'RNrn');
            to_char
--------------------------------
             XII            xii
(1 row)

postgres=# SELECT to_char(12, 'rnrn');
            to_char
--------------------------------
             xii            xii
(1 row)

postgres=# SELECT to_char(12, 'FMrnrn');
 to_char
---------
 xiixii
(1 row)
``` 


* Further to Aleksander's point about lack of test coverage for
the to_char direction, I see from
https://coverage.postgresql.org/src/backend/utils/adt/formatting.c.gcov.html
that almost none of the existing roman-number code paths are covered
today.  While it's not strictly within the charter of this patch
to improve that, maybe we should try to do so --- at the very
least it'd raise formatting.c's coverage score a few notches.


I see that you have provided a patch to increase test coverage of to_char roman format including some other formats. I will try to add more tests for the to_number roman format.


I will provide the next patch soon.

Regards,
Hunaid Sohail

Re: [PATCH] Add roman support for to_number function

From
Hunaid Sohail
Date:
Hi,

I have started working on the next version of the patch and have addressed the smaller issues identified by Tom. Before proceeding further, I would like to have opinions on some comments/questions in my previous email.

Regards,
Hunaid Sohail

On Mon, Sep 9, 2024 at 5:45 PM Hunaid Sohail <hunaidpgml@gmail.com> wrote:
Hi,

On Sun, Sep 8, 2024 at 4:09 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
A few notes from a quick read of the patch:

* roman_to_int() should have a header comment, notably explaining its
result convention.  I find it fairly surprising that "0" means an
error --- I realize that Roman notation can't represent zero, but
wouldn't it be better to use -1?

Noted.
 

* Do *not* rely on toupper().  There are multiple reasons why not,
but in this context the worst one is that in Turkish locales it's
likely to translate "i" to "İ", on which you will fail.  I'd use
pg_ascii_toupper().

Noted.
 

* I think roman_to_int() is under-commented internally too.
To start with, where did the magic "15" come from?  And why
should we have such a test anyway --- what if the format allows
for trailing stuff after the roman numeral?  (That is, I think
you need to fix this so that roman_to_int reports how much data
it ate, instead of assuming that it must read to the end of the
input.) 

MMMDCCCLXXXVIII is the longest valid standard roman numeral (15 characters). I will add appropriate comment on length check.

I am not sure I am able to understand the latter part. Could you please elaborate it?
Are you referring to cases like these?
SELECT to_number('CXXIII foo', 'RN foo');
SELECT to_number('CXXIII foo', 'RN');
Please check my comments on Oracle's behaviour below.

 
The logic for detecting invalid numeral combinations
feels a little opaque too.  Do you have a source for the rules
you're implementing, and if so could you cite it?

There are many sources on the internet.
A few sources:

Note that we are following the standard form: https://en.wikipedia.org/wiki/Roman_numerals#Standard_form
 

* This code will get run through pgindent before commit, so you
might want to revisit whether your comments will still look nice
afterwards.  There's not a lot of consistency in them about initial
cap versus lower case or trailing period versus not, too.

Noted.
 

* roman_result could be declared where used, no?

Noted.
 

* I'm not really convinced that we need a new errcode
ERRCODE_INVALID_ROMAN_NUMERAL rather than using a generic one like
ERRCODE_INVALID_TEXT_REPRESENTATION.  However, if we do do it
like that, why did you pick the out-of-sequence value 22P07?

22P07 is not out-of-sequence. 22P01 to 22P06 are already used.
However, I do agree with you that we can use ERRCODE_INVALID_TEXT_REPRESENTATION. I will change it.
 

* Might be good to have a few test cases demonstrating what happens
when there's other stuff combined with the RN format spec.  Notably,
even if RN itself won't eat whitespace, there had better be a way
to write the format string to allow that.

The current patch (v3) simply ignores other formats with RN except when RN is used EEEE which returns error.
```
postgres=# SELECT to_number('CXXIII', 'foo RN');
 to_number
-----------
       123
(1 row)

postgres=# SELECT to_number('CXXIII', 'MIrn');
 to_number
-----------
       123
(1 row)

postgres=# SELECT to_number('CXXIII', 'EEEErn');
ERROR:  "EEEE" must be the last pattern used
```

However, I think that other formats except FM should be rejected when used with RN in NUMDesc_prepare function. Any opinions?

If we look into Oracle, they are strict in this regard and don't allow any other format with RN.
1. SELECT to_char(12, 'MIrn') from dual;
2. SELECT to_char(12, 'foo RN') from dual;
results in error:
ORA-01481: invalid number format model

I also found that there is no check when RN is used twice (both in to_char and to_number) and that results in unexpected output.

```
postgres=# SELECT to_number('CXXIII', 'RNrn');
 to_number
-----------
       123
(1 row)

postgres=# SELECT to_char(12, 'RNrn');
            to_char
--------------------------------
             XII            xii
(1 row)

postgres=# SELECT to_char(12, 'rnrn');
            to_char
--------------------------------
             xii            xii
(1 row)

postgres=# SELECT to_char(12, 'FMrnrn');
 to_char
---------
 xiixii
(1 row)
``` 


* Further to Aleksander's point about lack of test coverage for
the to_char direction, I see from
https://coverage.postgresql.org/src/backend/utils/adt/formatting.c.gcov.html
that almost none of the existing roman-number code paths are covered
today.  While it's not strictly within the charter of this patch
to improve that, maybe we should try to do so --- at the very
least it'd raise formatting.c's coverage score a few notches.


I see that you have provided a patch to increase test coverage of to_char roman format including some other formats. I will try to add more tests for the to_number roman format.


I will provide the next patch soon.

Regards,
Hunaid Sohail