Thread: EBCDIC sorting as a use case for ICU rules

EBCDIC sorting as a use case for ICU rules

From

"Daniel Verite"

Date:

21 June 2023, 13:28:38

Hi,

In the "Order changes in PG16 since ICU introduction" discussion, one
sub-thread [1] was about having a credible use case for tailoring collations
with custom rules, a new feature in v16.

At a conference this week I was asked if ICU could be able to
sort like EBCDIC [2]. It turns out it has been already    asked on
-general a few years ago [3] with no satisfactory answer at the time ,
and that it can be implemented with rules in v16.

A collation like the following this seems to work (the rule simply enumerates
US-ASCII letters in the EBCDIC alphabet order, with adequate quoting)

CREATE COLLATION ebcdic (provider='icu', locale='und',
rules=$$&'

'<'.'<'<'<'('<'+'<\|<'&'<'!'<'$'<'*'<')'<';'<'-'<'/'<','<'%'<'_'<'>'<'?'<'`'<':'<'#'<'@'<\'<'='<'"'<a<b<c<d<e<f<g<h<i<j<k<l<m<n<o<p<q<r<'~'<s<t<u<v<w<x<y<z<'['<'^'<']'<'{'<A<B<C<D<E<F<G<H<I<'}'<J<K<L<M<N<O<P<Q<R<'\'<S<T<U<V<W<X<Y<Z<0<1<2<3<4<5<6<7<8<9$$);

This can be useful for people who migrate from mainframes to Postgres
and need their migration tests to produce the same sorted results as the
original system.
Since rules can be defined at the database level with the icu_rules option,
they don't even need to tweak their queries to add COLLATE clauses,
which surely is appreciable in that kind of project.

US-ASCII when sorted in EBCDIC order comes out like this:

.<(+|&!$*);-/,%_>?`:#@'="abcdefghijklmnopqr~stuvwxyz[^]{ABCDEFGHI}JKLMNOPQR\ST
UVWXYZ0123456789

Maybe this example could be added to the documentation except for
the problem that the rule is very long and dollar-quoting cannot be split
into several lines. Literals enclosed by single quotes can be split that
way, but would require escaping the single quotes in the rule, which
would lead to scary-looking over-quoted contents.

I'm open to suggestions on whether this EBCDIC example is worth being in the
doc in some form or putting this in the wiki would be good enough.



[1]
https://www.postgresql.org/message-id/flat/a28aba5fa6bf1abfff96e40b6d6acff8412edb15.camel%40j-davis.com

[2] https://en.wikipedia.org/wiki/EBCDIC

[3]
https://www.postgresql.org/message-id/flat/0A3221C70F24FB45833433255569204D1F84A7AD%40G01JPEXMBYT05


Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Re: EBCDIC sorting as a use case for ICU rules

From

Joe Conway

Date:

21 June 2023, 15:50:15

On 6/21/23 09:28, Daniel Verite wrote:
> In the "Order changes in PG16 since ICU introduction" discussion, one
> sub-thread [1] was about having a credible use case for tailoring collations
> with custom rules, a new feature in v16.
> 
> At a conference this week I was asked if ICU could be able to
> sort like EBCDIC [2]. It turns out it has been already    asked on
> -general a few years ago [3] with no satisfactory answer at the time ,
> and that it can be implemented with rules in v16.

Oh, very cool! I have seen the requirement for EBCDIC come up multiple 
times over the years.

<snip>

> Maybe this example could be added to the documentation except for
> the problem that the rule is very long and dollar-quoting cannot be split
> into several lines. Literals enclosed by single quotes can be split that
> way, but would require escaping the single quotes in the rule, which
> would lead to scary-looking over-quoted contents.
> 
> I'm open to suggestions on whether this EBCDIC example is worth being in the
> doc in some form or putting this in the wiki would be good enough.

I would definitely favor adding to the docs, but no idea how to deal 
with the length issue.


-- 
Joe Conway
PostgreSQL Contributors Team
RDS Open Source Databases
Amazon Web Services: https://aws.amazon.com

Re: EBCDIC sorting as a use case for ICU rules

From

Jeff Davis

Date:

21 June 2023, 16:14:32

On Wed, 2023-06-21 at 15:28 +0200, Daniel Verite wrote:
> At a conference this week I was asked if ICU could be able to
> sort like EBCDIC [2]. It turns out it has been already  asked on
> -general a few years ago [3] with no satisfactory answer at the time
> ,
> and that it can be implemented with rules in v16.

Interesting, thank you!

> This can be useful for people who migrate from mainframes to Postgres
> and need their migration tests to produce the same sorted results as
> the
> original system.
> Since rules can be defined at the database level with the icu_rules
> option,
> they don't even need to tweak their queries to add COLLATE clauses,
> which surely is appreciable in that kind of project.

I still had some technical concerns about the ICU rules feature,
unfortunately, and one option is to only allow it for the collation
objects and not the database level collation. How much would that hurt
this use case?

> I'm open to suggestions on whether this EBCDIC example is worth being
> in the
> doc in some form or putting this in the wiki would be good enough.

I like the idea of having a real example. Ideally, we could add some
explanation along the way about how the rule is constructed to match
EBCDIC, which would reduce the shock of a long rule like that.

I wonder why the rule syntax is such that it cannot be broken up? Would
it be incorrect for us to allow some whitespace in there?

Regards,
    Jeff Davis

Re: EBCDIC sorting as a use case for ICU rules

From

"Jonathan S. Katz"

Date:

21 June 2023, 17:13:01

On 6/21/23 12:14 PM, Jeff Davis wrote:
> On Wed, 2023-06-21 at 15:28 +0200, Daniel Verite wrote:
>> At a conference this week I was asked if ICU could be able to
>> sort like EBCDIC [2]. It turns out it has been already  asked on
>> -general a few years ago [3] with no satisfactory answer at the time
>> ,
>> and that it can be implemented with rules in v16.
> 
> Interesting, thank you!

+1 -- this is very helpful framing the problem, thank you!

>> This can be useful for people who migrate from mainframes to Postgres
>> and need their migration tests to produce the same sorted results as
>> the
>> original system.
>> Since rules can be defined at the database level with the icu_rules
>> option,
>> they don't even need to tweak their queries to add COLLATE clauses,
>> which surely is appreciable in that kind of project.
> 
> I still had some technical concerns about the ICU rules feature,
> unfortunately, and one option is to only allow it for the collation
> objects and not the database level collation. How much would that hurt
> this use case?
> 
> 
>> I'm open to suggestions on whether this EBCDIC example is worth being
>> in the
>> doc in some form or putting this in the wiki would be good enough.
> 
> I like the idea of having a real example. Ideally, we could add some
> explanation along the way about how the rule is constructed to match
> EBCDIC, which would reduce the shock of a long rule like that.
> 
> I wonder why the rule syntax is such that it cannot be broken up? Would
> it be incorrect for us to allow some whitespace in there?

I'll give the unhelpful comment of "yes, I agree we should have a real 
world example", especially one that seems relevant to helping more 
people adopt PostgreSQL.

Attachment

OpenPGP_signature

Re: EBCDIC sorting as a use case for ICU rules

From

"Daniel Verite"

Date:

30 June 2023, 11:08:45

    Jeff Davis wrote:

> I still had some technical concerns about the ICU rules feature,
> unfortunately, and one option is to only allow it for the collation
> objects and not the database level collation. How much would that hurt
> this use case?

For a regression test suite that should produce results with the custom
order, not being able to configure the sort rules at the db level means
that you'd have to change all the queries to add explicit COLLATE clauses.
I guess that could be quite annoying if the test suite is large.

About making a doc patch from this, I've came up with the attached,
which generates a CREATE COLLATION statement with rules from an
arbitrary strings that just lists characters in whichever order is desired.

In the case of EBCDIC and code page 37, it turns out that there are
several versions of "code page 37", with more or less additions of
characters outside the US-ASCII range. This is why I decided
to show code that generates the rules rather than an already generated
rule. Users may simply change the codepage_37 string in the code
to add or rearrange any characters.

Also the patch makes the relevant sections of "CREATE COLLATION" and
"CREATE DATABASE" point to "Collation Support" with the idea to
centralize the information on tailoring rules.

I'll add this to the next CF.

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Attachment

collation-icu-rules.patch

Re: EBCDIC sorting as a use case for ICU rules

From

Peter Eisentraut

Date:

06 July 2023, 09:32:32

On 21.06.23 15:28, Daniel Verite wrote:
> A collation like the following this seems to work (the rule simply enumerates
> US-ASCII letters in the EBCDIC alphabet order, with adequate quoting)
> 
> CREATE COLLATION ebcdic (provider='icu', locale='und',
> rules=$$&'
>
'<'.'<'<'<'('<'+'<\|<'&'<'!'<'$'<'*'<')'<';'<'-'<'/'<','<'%'<'_'<'>'<'?'<'`'<':'<'#'<'@'<\'<'='<'"'<a<b<c<d<e<f<g<h<i<j<k<l<m<n<o<p<q<r<'~'<s<t<u<v<w<x<y<z<'['<'^'<']'<'{'<A<B<C<D<E<F<G<H<I<'}'<J<K<L<M<N<O<P<Q<R<'\'<S<T<U<V<W<X<Y<Z<0<1<2<3<4<5<6<7<8<9$$);
> 
> This can be useful for people who migrate from mainframes to Postgres
> and need their migration tests to produce the same sorted results as the
> original system.
> Since rules can be defined at the database level with the icu_rules option,
> they don't even need to tweak their queries to add COLLATE clauses,
> which surely is appreciable in that kind of project.
> 
> US-ASCII when sorted in EBCDIC order comes out like this:
> 
> .<(+|&!$*);-/,%_>?`:#@'="abcdefghijklmnopqr~stuvwxyz[^]{ABCDEFGHI}JKLMNOPQR\ST
> UVWXYZ0123456789
> 
> Maybe this example could be added to the documentation except for
> the problem that the rule is very long and dollar-quoting cannot be split
> into several lines. Literals enclosed by single quotes can be split that
> way, but would require escaping the single quotes in the rule, which
> would lead to scary-looking over-quoted contents.

You can use whitespace in the rules.  For example,

CREATE COLLATION ebcdic (provider='icu', locale='und',
rules=$$
& ' ' < '.' < '<' < '(' < '+' < \|
< '&' < '!' < '$' < '*' < ')' < ';'
< '-' < '/' < ',' < '%' < '_' < '>' < '?'
< '`' < ':' < '#' < '@' < \' < '=' < '"'
< a < b < c < d < e < f < g < h < i
< j < k < l < m < n < o < p < q < r
< '~' < s < t < u < v < w < x < y < z
< '[' < '^' < ']'
< '{' < A < B < C < D < E < F < G < H < I
< '}' < J < K < L < M < N < O < P < Q < R
< '\'  < S < T < U < V < W < X < Y < Z
< 0 < 1 < 2 < 3 < 4 < 5 < 6 < 7 < 8 < 9
$$);

(This particular layout is meant to match the rows in
https://en.wikipedia.org/wiki/EBCDIC#Code_page_layout.)

Re: EBCDIC sorting as a use case for ICU rules

From

Peter Eisentraut

Date:

06 July 2023, 09:35:41

On 30.06.23 13:08, Daniel Verite wrote:
> About making a doc patch from this, I've came up with the attached,
> which generates a CREATE COLLATION statement with rules from an
> arbitrary strings that just lists characters in whichever order is desired.

I like adding more documentation and links around this.  But I'm not 
sure how this code you are including is supposed to help users 
understand the rules language.  Effectively, this would be adding 
another rules mechanism on top of the existing one, but doesn't explain 
either one.

Re: EBCDIC sorting as a use case for ICU rules

From

Jeff Davis

Date:

06 July 2023, 18:14:49

On Thu, 2023-07-06 at 11:32 +0200, Peter Eisentraut wrote:
> CREATE COLLATION ebcdic (provider='icu', locale='und',
> rules=$$
> & ' ' < '.' < '<' < '(' < '+' < \|
> < '&' < '!' < '$' < '*' < ')' < ';'
> < '-' < '/' < ',' < '%' < '_' < '>' < '?'
> < '`' < ':' < '#' < '@' < \' < '=' < '"'
> < a < b < c < d < e < f < g < h < i
> < j < k < l < m < n < o < p < q < r
> < '~' < s < t < u < v < w < x < y < z
> < '[' < '^' < ']'
> < '{' < A < B < C < D < E < F < G < H < I
> < '}' < J < K < L < M < N < O < P < Q < R
> < '\'  < S < T < U < V < W < X < Y < Z
> < 0 < 1 < 2 < 3 < 4 < 5 < 6 < 7 < 8 < 9
> $$);

That looks much nicer and would go nicely in the documentation along
with some explanation.

Regards,
    Jeff Davis

Re: EBCDIC sorting as a use case for ICU rules

From

"Daniel Verite"

Date:

17 July 2023, 08:10:19

    Peter Eisentraut wrote:

> You can use whitespace in the rules.  For example,
>
> CREATE COLLATION ebcdic (provider='icu', locale='und',
> rules=$$

Nice, it's clearly better that the piece of code I had in the
previous patch.
It can also be made more compact by grouping consecutive
code points, for instance <*a-r for 'a' to 'r'
I changed it that way, and also moved '^' before '[' and ']',
since according to [1], '^' is at location 0xB0 and '[' and ']'
at 0xBA and 0xBB.

Updated patch attached.

[1] https://en.wikipedia.org/wiki/EBCDIC#Code_page_layout

Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite

Attachment

collation-icu-rules-v2.patch

Re: EBCDIC sorting as a use case for ICU rules

From

Peter Eisentraut

Date:

23 August 2023, 09:30:52

On 17.07.23 10:10, Daniel Verite wrote:
>     Peter Eisentraut wrote:
> 
>> You can use whitespace in the rules.  For example,
>>
>> CREATE COLLATION ebcdic (provider='icu', locale='und',
>> rules=$$
> 
> Nice, it's clearly better that the piece of code I had in the
> previous patch.
> It can also be made more compact by grouping consecutive
> code points, for instance <*a-r for 'a' to 'r'
> I changed it that way, and also moved '^' before '[' and ']',
> since according to [1], '^' is at location 0xB0 and '[' and ']'
> at 0xBA and 0xBB.
> 
> Updated patch attached.

Committed with some editing.  I moved the existing rules example from 
the CREATE COLLATION page into the new section you created, so we have a 
simple example followed by the complex example.

Re: EBCDIC sorting as a use case for ICU rules

From

Jehan-Guillaume de Rorthais

Date:

24 August 2023, 14:26:53

Hi,

Sorry to chime in so lately, I was waiting for some customer feedback.

On Wed, 21 Jun 2023 15:28:38 +0200
"Daniel Verite" <daniel@manitou-mail.org> wrote:

> At a conference this week I was asked if ICU could be able to
> sort like EBCDIC [2].
> It turns out it has been already    asked on
> -general a few years ago [3] with no satisfactory answer at the time ,
> and that it can be implemented with rules in v16.

We worked with a customer few months ago about this question and end up with a
procedure to build new locale/collation for glibc and load them in PostgreSQL
[1].

Our customer built the fr_ebcdic locale file themselves, based on the EBCDIC
IBM500 codepage (including about the same characters than iso 8859-1) and share
it under the BY-CC licence. See in attachment.

The procedure is quite simple:

1. copy this file under "/usr/share/i18n/locales/fr_ebcdic"
2. build it using "localedef -c -i fr_ebcdic -f UTF-8 fr_ebcdic.UTF-8"
3. restart your PostgreSQL instance (because of localeset weird behavior)
4. "pg_import_system_collations('schema')" or create the collation, eg.:
   CREATE COLLATION fr_ebcdic (
     PROVIDER = libc,
     LC_COLLATE = fr_ebcdic.utf8,
     LC_CTYPE = fr_ebcdic.utf8
   );

Now, same question than for the ICU: do we want to provide documentation about
this? Online documentation about such feature are quite arid. In fact, this
could be useful in various other way than just EBCDIC.

Regards,

[1] https://www.postgresql.org/message-id/20230209144947.1dfad6c0%40karst

Attachment

fr_ebcdic

Re: EBCDIC sorting as a use case for ICU rules

From

"Daniel Verite"

Date:

30 August 2023, 16:40:45

    Peter Eisentraut wrote:

> Committed with some editing.  I moved the existing rules example from
> the CREATE COLLATION page into the new section you created, so we have a
> simple example followed by the complex example.

OK, thanks for pushing this!


Best regards,
--
Daniel Vérité
https://postgresql.verite.pro/
Twitter: @DanielVerite