Re: [PATCH] Add roman support for to_number function - Mailing list pgsql-hackers

From Hunaid Sohail
Subject Re: [PATCH] Add roman support for to_number function
Date
Msg-id CAMWA6yYNetOJRCEGPUrA9BENPM_Dk0BvxBJsvECKqaQtR7XkxA@mail.gmail.com
Whole thread Raw
In response to Re: [PATCH] Add roman support for to_number function  (Hunaid Sohail <hunaidpgml@gmail.com>)
List pgsql-hackers
Hi,

I have started working on the next version of the patch and have addressed the smaller issues identified by Tom. Before proceeding further, I would like to have opinions on some comments/questions in my previous email.

Regards,
Hunaid Sohail

On Mon, Sep 9, 2024 at 5:45 PM Hunaid Sohail <hunaidpgml@gmail.com> wrote:
Hi,

On Sun, Sep 8, 2024 at 4:09 AM Tom Lane <tgl@sss.pgh.pa.us> wrote:
A few notes from a quick read of the patch:

* roman_to_int() should have a header comment, notably explaining its
result convention.  I find it fairly surprising that "0" means an
error --- I realize that Roman notation can't represent zero, but
wouldn't it be better to use -1?

Noted.
 

* Do *not* rely on toupper().  There are multiple reasons why not,
but in this context the worst one is that in Turkish locales it's
likely to translate "i" to "İ", on which you will fail.  I'd use
pg_ascii_toupper().

Noted.
 

* I think roman_to_int() is under-commented internally too.
To start with, where did the magic "15" come from?  And why
should we have such a test anyway --- what if the format allows
for trailing stuff after the roman numeral?  (That is, I think
you need to fix this so that roman_to_int reports how much data
it ate, instead of assuming that it must read to the end of the
input.) 

MMMDCCCLXXXVIII is the longest valid standard roman numeral (15 characters). I will add appropriate comment on length check.

I am not sure I am able to understand the latter part. Could you please elaborate it?
Are you referring to cases like these?
SELECT to_number('CXXIII foo', 'RN foo');
SELECT to_number('CXXIII foo', 'RN');
Please check my comments on Oracle's behaviour below.

 
The logic for detecting invalid numeral combinations
feels a little opaque too.  Do you have a source for the rules
you're implementing, and if so could you cite it?

There are many sources on the internet.
A few sources:

Note that we are following the standard form: https://en.wikipedia.org/wiki/Roman_numerals#Standard_form
 

* This code will get run through pgindent before commit, so you
might want to revisit whether your comments will still look nice
afterwards.  There's not a lot of consistency in them about initial
cap versus lower case or trailing period versus not, too.

Noted.
 

* roman_result could be declared where used, no?

Noted.
 

* I'm not really convinced that we need a new errcode
ERRCODE_INVALID_ROMAN_NUMERAL rather than using a generic one like
ERRCODE_INVALID_TEXT_REPRESENTATION.  However, if we do do it
like that, why did you pick the out-of-sequence value 22P07?

22P07 is not out-of-sequence. 22P01 to 22P06 are already used.
However, I do agree with you that we can use ERRCODE_INVALID_TEXT_REPRESENTATION. I will change it.
 

* Might be good to have a few test cases demonstrating what happens
when there's other stuff combined with the RN format spec.  Notably,
even if RN itself won't eat whitespace, there had better be a way
to write the format string to allow that.

The current patch (v3) simply ignores other formats with RN except when RN is used EEEE which returns error.
```
postgres=# SELECT to_number('CXXIII', 'foo RN');
 to_number
-----------
       123
(1 row)

postgres=# SELECT to_number('CXXIII', 'MIrn');
 to_number
-----------
       123
(1 row)

postgres=# SELECT to_number('CXXIII', 'EEEErn');
ERROR:  "EEEE" must be the last pattern used
```

However, I think that other formats except FM should be rejected when used with RN in NUMDesc_prepare function. Any opinions?

If we look into Oracle, they are strict in this regard and don't allow any other format with RN.
1. SELECT to_char(12, 'MIrn') from dual;
2. SELECT to_char(12, 'foo RN') from dual;
results in error:
ORA-01481: invalid number format model

I also found that there is no check when RN is used twice (both in to_char and to_number) and that results in unexpected output.

```
postgres=# SELECT to_number('CXXIII', 'RNrn');
 to_number
-----------
       123
(1 row)

postgres=# SELECT to_char(12, 'RNrn');
            to_char
--------------------------------
             XII            xii
(1 row)

postgres=# SELECT to_char(12, 'rnrn');
            to_char
--------------------------------
             xii            xii
(1 row)

postgres=# SELECT to_char(12, 'FMrnrn');
 to_char
---------
 xiixii
(1 row)
``` 


* Further to Aleksander's point about lack of test coverage for
the to_char direction, I see from
https://coverage.postgresql.org/src/backend/utils/adt/formatting.c.gcov.html
that almost none of the existing roman-number code paths are covered
today.  While it's not strictly within the charter of this patch
to improve that, maybe we should try to do so --- at the very
least it'd raise formatting.c's coverage score a few notches.


I see that you have provided a patch to increase test coverage of to_char roman format including some other formats. I will try to add more tests for the to_number roman format.


I will provide the next patch soon.

Regards,
Hunaid Sohail

pgsql-hackers by date:

Previous
From: "David G. Johnston"
Date:
Subject: Re: Accept invalidation messages before the query starts inside a transaction
Next
From: David Rowley
Date:
Subject: Re: Trim the heap free memory