Thread: Re: [PATCH] Add roman support for to_number function
The following review has been posted through the commitfest application: make installcheck-world: tested, passed Implements feature: tested, failed Spec compliant: tested, failed Documentation: tested, passed Tested again, and the patch looks good. It does not accept leading or trailing whitespace, which seems reasonable, giventhe unclear behavior of to_number with other format strings. It also rejects less common Roman spellings like "IIII".I don't feel strongly about that one way or the other, but perhaps a test codifying that behavior would be usefulto make it clear it's intentional. I'm marking it RfC. The new status of this patch is: Ready for Committer
Sorry, it looks like I failed to accurately log my review in the review app due to the current broken layout issues [1]. The summary should be: make installcheck-world: tested, passed Implements feature: tested, passed Spec compliant: not tested (not sure what the spec has to say about this) Documentation: tested, passed [1]: https://www.postgresql.org/message-id/flat/CAD68Dp1GgTeBiA0YVWXpfPCMC7%3DnqBCLn6%2BhuOkWURcKRnFw5g%40mail.gmail.com
Maciek Sakrejda <m.sakrejda@gmail.com> writes: > Tested again, and the patch looks good. It does not accept leading or trailing whitespace, which seems reasonable, giventhe unclear behavior of to_number with other format strings. It also rejects less common Roman spellings like "IIII".I don't feel strongly about that one way or the other, but perhaps a test codifying that behavior would be usefulto make it clear it's intentional. Yeah, I don't have a strong feeling about that either, but probably being strict is better. to_number has a big problem with "garbage in garbage out" already, and being lax will make that worse. A few notes from a quick read of the patch: * roman_to_int() should have a header comment, notably explaining its result convention. I find it fairly surprising that "0" means an error --- I realize that Roman notation can't represent zero, but wouldn't it be better to use -1? * Do *not* rely on toupper(). There are multiple reasons why not, but in this context the worst one is that in Turkish locales it's likely to translate "i" to "İ", on which you will fail. I'd use pg_ascii_toupper(). * I think roman_to_int() is under-commented internally too. To start with, where did the magic "15" come from? And why should we have such a test anyway --- what if the format allows for trailing stuff after the roman numeral? (That is, I think you need to fix this so that roman_to_int reports how much data it ate, instead of assuming that it must read to the end of the input.) The logic for detecting invalid numeral combinations feels a little opaque too. Do you have a source for the rules you're implementing, and if so could you cite it? * This code will get run through pgindent before commit, so you might want to revisit whether your comments will still look nice afterwards. There's not a lot of consistency in them about initial cap versus lower case or trailing period versus not, too. * roman_result could be declared where used, no? * I'm not really convinced that we need a new errcode ERRCODE_INVALID_ROMAN_NUMERAL rather than using a generic one like ERRCODE_INVALID_TEXT_REPRESENTATION. However, if we do do it like that, why did you pick the out-of-sequence value 22P07? * Might be good to have a few test cases demonstrating what happens when there's other stuff combined with the RN format spec. Notably, even if RN itself won't eat whitespace, there had better be a way to write the format string to allow that. * Further to Aleksander's point about lack of test coverage for the to_char direction, I see from https://coverage.postgresql.org/src/backend/utils/adt/formatting.c.gcov.html that almost none of the existing roman-number code paths are covered today. While it's not strictly within the charter of this patch to improve that, maybe we should try to do so --- at the very least it'd raise formatting.c's coverage score a few notches. regards, tom lane
I wrote: > * Further to Aleksander's point about lack of test coverage for > the to_char direction, I see from > https://coverage.postgresql.org/src/backend/utils/adt/formatting.c.gcov.html > that almost none of the existing roman-number code paths are covered > today. While it's not strictly within the charter of this patch > to improve that, maybe we should try to do so --- at the very > least it'd raise formatting.c's coverage score a few notches. For the archives' sake: I created a patch and a separate discussion thread [1] for that. regards, tom lane [1] https://www.postgresql.org/message-id/flat/2956175.1725831136@sss.pgh.pa.us
Hi,
On Sun, Sep 8, 2024 at 4:09 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
A few notes from a quick read of the patch:
* roman_to_int() should have a header comment, notably explaining its
result convention. I find it fairly surprising that "0" means an
error --- I realize that Roman notation can't represent zero, but
wouldn't it be better to use -1?
Noted.
* Do *not* rely on toupper(). There are multiple reasons why not,
but in this context the worst one is that in Turkish locales it's
likely to translate "i" to "İ", on which you will fail. I'd use
pg_ascii_toupper().
Noted.
* I think roman_to_int() is under-commented internally too.
To start with, where did the magic "15" come from? And why
should we have such a test anyway --- what if the format allows
for trailing stuff after the roman numeral? (That is, I think
you need to fix this so that roman_to_int reports how much data
it ate, instead of assuming that it must read to the end of the
input.)
MMMDCCCLXXXVIII is the longest valid standard roman numeral (15 characters). I will add appropriate comment on length check.
I am not sure I am able to understand the latter part. Could you please elaborate it?
Are you referring to cases like these?
SELECT to_number('CXXIII foo', 'RN foo');
SELECT to_number('CXXIII foo', 'RN');
Please check my comments on Oracle's behaviour below.
Please check my comments on Oracle's behaviour below.
The logic for detecting invalid numeral combinations
feels a little opaque too. Do you have a source for the rules
you're implementing, and if so could you cite it?
There are many sources on the internet.
A few sources:
Note that we are following the standard form: https://en.wikipedia.org/wiki/Roman_numerals#Standard_form
* This code will get run through pgindent before commit, so you
might want to revisit whether your comments will still look nice
afterwards. There's not a lot of consistency in them about initial
cap versus lower case or trailing period versus not, too.
Noted.
* roman_result could be declared where used, no?
Noted.
* I'm not really convinced that we need a new errcode
ERRCODE_INVALID_ROMAN_NUMERAL rather than using a generic one like
ERRCODE_INVALID_TEXT_REPRESENTATION. However, if we do do it
like that, why did you pick the out-of-sequence value 22P07?
22P07 is not out-of-sequence. 22P01 to 22P06 are already used.
However, I do agree with you that we can use ERRCODE_INVALID_TEXT_REPRESENTATION. I will change it.
* Might be good to have a few test cases demonstrating what happens
when there's other stuff combined with the RN format spec. Notably,
even if RN itself won't eat whitespace, there had better be a way
to write the format string to allow that.
The current patch (v3) simply ignores other formats with RN except when RN is used EEEE which returns error.
```
postgres=# SELECT to_number('CXXIII', 'foo RN');
to_number
-----------
123
(1 row)
to_number
-----------
123
(1 row)
postgres=# SELECT to_number('CXXIII', 'MIrn');
to_number
-----------
123
(1 row)
postgres=# SELECT to_number('CXXIII', 'EEEErn');
ERROR: "EEEE" must be the last pattern used
```
However, I think that other formats except FM should be rejected when used with RN in NUMDesc_prepare function. Any opinions?
If we look into Oracle, they are strict in this regard and don't allow any other format with RN.
1. SELECT to_char(12, 'MIrn') from dual;
2. SELECT to_char(12, 'foo RN') from dual;
results in error:
ORA-01481: invalid number format model
I also found that there is no check when RN is used twice (both in to_char and to_number) and that results in unexpected output.
```
postgres=# SELECT to_number('CXXIII', 'RNrn');
to_number
-----------
123
(1 row)
postgres=# SELECT to_char(12, 'RNrn');
to_char
--------------------------------
XII xii
(1 row)
postgres=# SELECT to_char(12, 'rnrn');
to_char
--------------------------------
xii xii
(1 row)
postgres=# SELECT to_char(12, 'FMrnrn');
to_char
---------
xiixii
(1 row)
to_number
-----------
123
(1 row)
postgres=# SELECT to_char(12, 'RNrn');
to_char
--------------------------------
XII xii
(1 row)
postgres=# SELECT to_char(12, 'rnrn');
to_char
--------------------------------
xii xii
(1 row)
postgres=# SELECT to_char(12, 'FMrnrn');
to_char
---------
xiixii
(1 row)
```
* Further to Aleksander's point about lack of test coverage for
the to_char direction, I see from
https://coverage.postgresql.org/src/backend/utils/adt/formatting.c.gcov.html
that almost none of the existing roman-number code paths are covered
today. While it's not strictly within the charter of this patch
to improve that, maybe we should try to do so --- at the very
least it'd raise formatting.c's coverage score a few notches.
I see that you have provided a patch to increase test coverage of to_char roman format including some other formats. I will try to add more tests for the to_number roman format.
I will provide the next patch soon.
Regards,
Hunaid Sohail
Hi,
I have started working on the next version of the patch and have addressed the smaller issues identified by Tom. Before proceeding further, I would like to have opinions on some comments/questions in my previous email.
Regards,
Hunaid Sohail
Hunaid Sohail
On Mon, Sep 9, 2024 at 5:45 PM Hunaid Sohail <hunaidpgml@gmail.com> wrote:
Hi,On Sun, Sep 8, 2024 at 4:09 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:A few notes from a quick read of the patch:
* roman_to_int() should have a header comment, notably explaining its
result convention. I find it fairly surprising that "0" means an
error --- I realize that Roman notation can't represent zero, but
wouldn't it be better to use -1?Noted.
* Do *not* rely on toupper(). There are multiple reasons why not,
but in this context the worst one is that in Turkish locales it's
likely to translate "i" to "İ", on which you will fail. I'd use
pg_ascii_toupper().Noted.
* I think roman_to_int() is under-commented internally too.
To start with, where did the magic "15" come from? And why
should we have such a test anyway --- what if the format allows
for trailing stuff after the roman numeral? (That is, I think
you need to fix this so that roman_to_int reports how much data
it ate, instead of assuming that it must read to the end of the
input.)MMMDCCCLXXXVIII is the longest valid standard roman numeral (15 characters). I will add appropriate comment on length check.I am not sure I am able to understand the latter part. Could you please elaborate it?Are you referring to cases like these?SELECT to_number('CXXIII foo', 'RN foo');SELECT to_number('CXXIII foo', 'RN');
Please check my comments on Oracle's behaviour below.The logic for detecting invalid numeral combinations
feels a little opaque too. Do you have a source for the rules
you're implementing, and if so could you cite it?There are many sources on the internet.A few sources:Note that we are following the standard form: https://en.wikipedia.org/wiki/Roman_numerals#Standard_form
* This code will get run through pgindent before commit, so you
might want to revisit whether your comments will still look nice
afterwards. There's not a lot of consistency in them about initial
cap versus lower case or trailing period versus not, too.Noted.
* roman_result could be declared where used, no?Noted.
* I'm not really convinced that we need a new errcode
ERRCODE_INVALID_ROMAN_NUMERAL rather than using a generic one like
ERRCODE_INVALID_TEXT_REPRESENTATION. However, if we do do it
like that, why did you pick the out-of-sequence value 22P07?22P07 is not out-of-sequence. 22P01 to 22P06 are already used.However, I do agree with you that we can use ERRCODE_INVALID_TEXT_REPRESENTATION. I will change it.
* Might be good to have a few test cases demonstrating what happens
when there's other stuff combined with the RN format spec. Notably,
even if RN itself won't eat whitespace, there had better be a way
to write the format string to allow that.The current patch (v3) simply ignores other formats with RN except when RN is used EEEE which returns error.```postgres=# SELECT to_number('CXXIII', 'foo RN');
to_number
-----------
123
(1 row)
postgres=# SELECT to_number('CXXIII', 'MIrn');
to_number
-----------
123
(1 row)
postgres=# SELECT to_number('CXXIII', 'EEEErn');
ERROR: "EEEE" must be the last pattern used```However, I think that other formats except FM should be rejected when used with RN in NUMDesc_prepare function. Any opinions?If we look into Oracle, they are strict in this regard and don't allow any other format with RN.1. SELECT to_char(12, 'MIrn') from dual;2. SELECT to_char(12, 'foo RN') from dual;results in error:ORA-01481: invalid number format modelI also found that there is no check when RN is used twice (both in to_char and to_number) and that results in unexpected output.```postgres=# SELECT to_number('CXXIII', 'RNrn');
to_number
-----------
123
(1 row)
postgres=# SELECT to_char(12, 'RNrn');
to_char
--------------------------------
XII xii
(1 row)
postgres=# SELECT to_char(12, 'rnrn');
to_char
--------------------------------
xii xii
(1 row)
postgres=# SELECT to_char(12, 'FMrnrn');
to_char
---------
xiixii
(1 row)```
* Further to Aleksander's point about lack of test coverage for
the to_char direction, I see from
https://coverage.postgresql.org/src/backend/utils/adt/formatting.c.gcov.html
that almost none of the existing roman-number code paths are covered
today. While it's not strictly within the charter of this patch
to improve that, maybe we should try to do so --- at the very
least it'd raise formatting.c's coverage score a few notches.I see that you have provided a patch to increase test coverage of to_char roman format including some other formats. I will try to add more tests for the to_number roman format.I will provide the next patch soon.Regards,Hunaid Sohail
Hi, On 10/30/24 08:07, Hunaid Sohail wrote: > Hi, > > I have attached a rebased patch if someone wants to review in the next > CF. No changes as compared to v4. > > Regards, > Hunaid Sohail > Thanks for a nice patch. I did a quick review today, seems almost there, I only have a couple minor comments: 1) Template Patterns for Numeric Formatting Why the wording change? "input between 1 and 3999" sounds fine to me. 2) The docs say "standard numbers" but I'm not sure what that is? We don't use that term anywhere else, I think. Same for "standard roman numeral". 3) A couple comments need a bit of formatting. Multi-line comments should start with an empty line, some lines are a bit too long. 4) I was wondering if the regression tests check all interesting cases, and it seems none of the tests hit these two cases: if (vCount > 1 || lCount > 1 || dCount > 1) return -1; and if (!IS_VALID_SUB_COMB(currChar, nextChar)) return -1; I haven't tried constructing tests to hit those cases, though. Seems ready to go otherwise. regards -- Tomas Vondra
Hi,
Thanks for reviewing this patch.
On Mon, Dec 9, 2024 at 3:07 AM Tomas Vondra <tomas@vondra.me> wrote:
Thanks for a nice patch. I did a quick review today, seems almost there,
I only have a couple minor comments:
1) Template Patterns for Numeric Formatting
Why the wording change? "input between 1 and 3999" sounds fine to me.
input... seemed to refer to numeric input for to_char roman format. But, after roman support in to_number function shouldn't we modify it? It is the reason I changed it to "valid for numbers 1 to 3999".
2) The docs say "standard numbers" but I'm not sure what that is? We
don't use that term anywhere else, I think. Same for "standard roman
numeral".
I will change "standard numbers" to "numbers".
Note that we are following the standard form. There are other forms too. For example, 4 can be represented as IV (standard) or IIII (non standard).
3) A couple comments need a bit of formatting. Multi-line comments
should start with an empty line, some lines are a bit too long.
I will fix the multi-line formatting.
4) I was wondering if the regression tests check all interesting cases,
and it seems none of the tests hit these two cases:
if (vCount > 1 || lCount > 1 || dCount > 1)
return -1;
SELECT to_number('viv', 'RN') test covers this if statement for the subtraction part. But, I noticed that no test covers simple counts of V, L, D.
I will add SELECT to_number('VV', 'RN') for that.
and
if (!IS_VALID_SUB_COMB(currChar, nextChar))
return -1;
Noted. SELECT to_number('IL', 'RN') test can be added.
Regards,
Hunaid Sohail