Thread: EBCDIC sorting as a use case for ICU rules
Hi, In the "Order changes in PG16 since ICU introduction" discussion, one sub-thread [1] was about having a credible use case for tailoring collations with custom rules, a new feature in v16. At a conference this week I was asked if ICU could be able to sort like EBCDIC [2]. It turns out it has been already asked on -general a few years ago [3] with no satisfactory answer at the time , and that it can be implemented with rules in v16. A collation like the following this seems to work (the rule simply enumerates US-ASCII letters in the EBCDIC alphabet order, with adequate quoting) CREATE COLLATION ebcdic (provider='icu', locale='und', rules=$$&' '<'.'<'<'<'('<'+'<\|<'&'<'!'<'$'<'*'<')'<';'<'-'<'/'<','<'%'<'_'<'>'<'?'<'`'<':'<'#'<'@'<\'<'='<'"'<a<b<c<d<e<f<g<h<i<j<k<l<m<n<o<p<q<r<'~'<s<t<u<v<w<x<y<z<'['<'^'<']'<'{'<A<B<C<D<E<F<G<H<I<'}'<J<K<L<M<N<O<P<Q<R<'\'<S<T<U<V<W<X<Y<Z<0<1<2<3<4<5<6<7<8<9$$); This can be useful for people who migrate from mainframes to Postgres and need their migration tests to produce the same sorted results as the original system. Since rules can be defined at the database level with the icu_rules option, they don't even need to tweak their queries to add COLLATE clauses, which surely is appreciable in that kind of project. US-ASCII when sorted in EBCDIC order comes out like this: .<(+|&!$*);-/,%_>?`:#@'="abcdefghijklmnopqr~stuvwxyz[^]{ABCDEFGHI}JKLMNOPQR\ST UVWXYZ0123456789 Maybe this example could be added to the documentation except for the problem that the rule is very long and dollar-quoting cannot be split into several lines. Literals enclosed by single quotes can be split that way, but would require escaping the single quotes in the rule, which would lead to scary-looking over-quoted contents. I'm open to suggestions on whether this EBCDIC example is worth being in the doc in some form or putting this in the wiki would be good enough. [1] https://www.postgresql.org/message-id/flat/a28aba5fa6bf1abfff96e40b6d6acff8412edb15.camel%40j-davis.com [2] https://en.wikipedia.org/wiki/EBCDIC [3] https://www.postgresql.org/message-id/flat/0A3221C70F24FB45833433255569204D1F84A7AD%40G01JPEXMBYT05 Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
On 6/21/23 09:28, Daniel Verite wrote: > In the "Order changes in PG16 since ICU introduction" discussion, one > sub-thread [1] was about having a credible use case for tailoring collations > with custom rules, a new feature in v16. > > At a conference this week I was asked if ICU could be able to > sort like EBCDIC [2]. It turns out it has been already asked on > -general a few years ago [3] with no satisfactory answer at the time , > and that it can be implemented with rules in v16. Oh, very cool! I have seen the requirement for EBCDIC come up multiple times over the years. <snip> > Maybe this example could be added to the documentation except for > the problem that the rule is very long and dollar-quoting cannot be split > into several lines. Literals enclosed by single quotes can be split that > way, but would require escaping the single quotes in the rule, which > would lead to scary-looking over-quoted contents. > > I'm open to suggestions on whether this EBCDIC example is worth being in the > doc in some form or putting this in the wiki would be good enough. I would definitely favor adding to the docs, but no idea how to deal with the length issue. -- Joe Conway PostgreSQL Contributors Team RDS Open Source Databases Amazon Web Services: https://aws.amazon.com
On Wed, 2023-06-21 at 15:28 +0200, Daniel Verite wrote: > At a conference this week I was asked if ICU could be able to > sort like EBCDIC [2]. It turns out it has been already asked on > -general a few years ago [3] with no satisfactory answer at the time > , > and that it can be implemented with rules in v16. Interesting, thank you! > This can be useful for people who migrate from mainframes to Postgres > and need their migration tests to produce the same sorted results as > the > original system. > Since rules can be defined at the database level with the icu_rules > option, > they don't even need to tweak their queries to add COLLATE clauses, > which surely is appreciable in that kind of project. I still had some technical concerns about the ICU rules feature, unfortunately, and one option is to only allow it for the collation objects and not the database level collation. How much would that hurt this use case? > I'm open to suggestions on whether this EBCDIC example is worth being > in the > doc in some form or putting this in the wiki would be good enough. I like the idea of having a real example. Ideally, we could add some explanation along the way about how the rule is constructed to match EBCDIC, which would reduce the shock of a long rule like that. I wonder why the rule syntax is such that it cannot be broken up? Would it be incorrect for us to allow some whitespace in there? Regards, Jeff Davis
On 6/21/23 12:14 PM, Jeff Davis wrote: > On Wed, 2023-06-21 at 15:28 +0200, Daniel Verite wrote: >> At a conference this week I was asked if ICU could be able to >> sort like EBCDIC [2]. It turns out it has been already asked on >> -general a few years ago [3] with no satisfactory answer at the time >> , >> and that it can be implemented with rules in v16. > > Interesting, thank you! +1 -- this is very helpful framing the problem, thank you! >> This can be useful for people who migrate from mainframes to Postgres >> and need their migration tests to produce the same sorted results as >> the >> original system. >> Since rules can be defined at the database level with the icu_rules >> option, >> they don't even need to tweak their queries to add COLLATE clauses, >> which surely is appreciable in that kind of project. > > I still had some technical concerns about the ICU rules feature, > unfortunately, and one option is to only allow it for the collation > objects and not the database level collation. How much would that hurt > this use case? > > >> I'm open to suggestions on whether this EBCDIC example is worth being >> in the >> doc in some form or putting this in the wiki would be good enough. > > I like the idea of having a real example. Ideally, we could add some > explanation along the way about how the rule is constructed to match > EBCDIC, which would reduce the shock of a long rule like that. > > I wonder why the rule syntax is such that it cannot be broken up? Would > it be incorrect for us to allow some whitespace in there? I'll give the unhelpful comment of "yes, I agree we should have a real world example", especially one that seems relevant to helping more people adopt PostgreSQL.
Attachment
Jeff Davis wrote: > I still had some technical concerns about the ICU rules feature, > unfortunately, and one option is to only allow it for the collation > objects and not the database level collation. How much would that hurt > this use case? For a regression test suite that should produce results with the custom order, not being able to configure the sort rules at the db level means that you'd have to change all the queries to add explicit COLLATE clauses. I guess that could be quite annoying if the test suite is large. About making a doc patch from this, I've came up with the attached, which generates a CREATE COLLATION statement with rules from an arbitrary strings that just lists characters in whichever order is desired. In the case of EBCDIC and code page 37, it turns out that there are several versions of "code page 37", with more or less additions of characters outside the US-ASCII range. This is why I decided to show code that generates the rules rather than an already generated rule. Users may simply change the codepage_37 string in the code to add or rearrange any characters. Also the patch makes the relevant sections of "CREATE COLLATION" and "CREATE DATABASE" point to "Collation Support" with the idea to centralize the information on tailoring rules. I'll add this to the next CF. Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Attachment
On 21.06.23 15:28, Daniel Verite wrote: > A collation like the following this seems to work (the rule simply enumerates > US-ASCII letters in the EBCDIC alphabet order, with adequate quoting) > > CREATE COLLATION ebcdic (provider='icu', locale='und', > rules=$$&' > '<'.'<'<'<'('<'+'<\|<'&'<'!'<'$'<'*'<')'<';'<'-'<'/'<','<'%'<'_'<'>'<'?'<'`'<':'<'#'<'@'<\'<'='<'"'<a<b<c<d<e<f<g<h<i<j<k<l<m<n<o<p<q<r<'~'<s<t<u<v<w<x<y<z<'['<'^'<']'<'{'<A<B<C<D<E<F<G<H<I<'}'<J<K<L<M<N<O<P<Q<R<'\'<S<T<U<V<W<X<Y<Z<0<1<2<3<4<5<6<7<8<9$$); > > This can be useful for people who migrate from mainframes to Postgres > and need their migration tests to produce the same sorted results as the > original system. > Since rules can be defined at the database level with the icu_rules option, > they don't even need to tweak their queries to add COLLATE clauses, > which surely is appreciable in that kind of project. > > US-ASCII when sorted in EBCDIC order comes out like this: > > .<(+|&!$*);-/,%_>?`:#@'="abcdefghijklmnopqr~stuvwxyz[^]{ABCDEFGHI}JKLMNOPQR\ST > UVWXYZ0123456789 > > Maybe this example could be added to the documentation except for > the problem that the rule is very long and dollar-quoting cannot be split > into several lines. Literals enclosed by single quotes can be split that > way, but would require escaping the single quotes in the rule, which > would lead to scary-looking over-quoted contents. You can use whitespace in the rules. For example, CREATE COLLATION ebcdic (provider='icu', locale='und', rules=$$ & ' ' < '.' < '<' < '(' < '+' < \| < '&' < '!' < '$' < '*' < ')' < ';' < '-' < '/' < ',' < '%' < '_' < '>' < '?' < '`' < ':' < '#' < '@' < \' < '=' < '"' < a < b < c < d < e < f < g < h < i < j < k < l < m < n < o < p < q < r < '~' < s < t < u < v < w < x < y < z < '[' < '^' < ']' < '{' < A < B < C < D < E < F < G < H < I < '}' < J < K < L < M < N < O < P < Q < R < '\' < S < T < U < V < W < X < Y < Z < 0 < 1 < 2 < 3 < 4 < 5 < 6 < 7 < 8 < 9 $$); (This particular layout is meant to match the rows in https://en.wikipedia.org/wiki/EBCDIC#Code_page_layout.)
On 30.06.23 13:08, Daniel Verite wrote: > About making a doc patch from this, I've came up with the attached, > which generates a CREATE COLLATION statement with rules from an > arbitrary strings that just lists characters in whichever order is desired. I like adding more documentation and links around this. But I'm not sure how this code you are including is supposed to help users understand the rules language. Effectively, this would be adding another rules mechanism on top of the existing one, but doesn't explain either one.
On Thu, 2023-07-06 at 11:32 +0200, Peter Eisentraut wrote: > CREATE COLLATION ebcdic (provider='icu', locale='und', > rules=$$ > & ' ' < '.' < '<' < '(' < '+' < \| > < '&' < '!' < '$' < '*' < ')' < ';' > < '-' < '/' < ',' < '%' < '_' < '>' < '?' > < '`' < ':' < '#' < '@' < \' < '=' < '"' > < a < b < c < d < e < f < g < h < i > < j < k < l < m < n < o < p < q < r > < '~' < s < t < u < v < w < x < y < z > < '[' < '^' < ']' > < '{' < A < B < C < D < E < F < G < H < I > < '}' < J < K < L < M < N < O < P < Q < R > < '\' < S < T < U < V < W < X < Y < Z > < 0 < 1 < 2 < 3 < 4 < 5 < 6 < 7 < 8 < 9 > $$); That looks much nicer and would go nicely in the documentation along with some explanation. Regards, Jeff Davis
Peter Eisentraut wrote: > You can use whitespace in the rules. For example, > > CREATE COLLATION ebcdic (provider='icu', locale='und', > rules=$$ Nice, it's clearly better that the piece of code I had in the previous patch. It can also be made more compact by grouping consecutive code points, for instance <*a-r for 'a' to 'r' I changed it that way, and also moved '^' before '[' and ']', since according to [1], '^' is at location 0xB0 and '[' and ']' at 0xBA and 0xBB. Updated patch attached. [1] https://en.wikipedia.org/wiki/EBCDIC#Code_page_layout Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite
Attachment
On 17.07.23 10:10, Daniel Verite wrote: > Peter Eisentraut wrote: > >> You can use whitespace in the rules. For example, >> >> CREATE COLLATION ebcdic (provider='icu', locale='und', >> rules=$$ > > Nice, it's clearly better that the piece of code I had in the > previous patch. > It can also be made more compact by grouping consecutive > code points, for instance <*a-r for 'a' to 'r' > I changed it that way, and also moved '^' before '[' and ']', > since according to [1], '^' is at location 0xB0 and '[' and ']' > at 0xBA and 0xBB. > > Updated patch attached. Committed with some editing. I moved the existing rules example from the CREATE COLLATION page into the new section you created, so we have a simple example followed by the complex example.
Hi, Sorry to chime in so lately, I was waiting for some customer feedback. On Wed, 21 Jun 2023 15:28:38 +0200 "Daniel Verite" <daniel@manitou-mail.org> wrote: > At a conference this week I was asked if ICU could be able to > sort like EBCDIC [2]. > It turns out it has been already asked on > -general a few years ago [3] with no satisfactory answer at the time , > and that it can be implemented with rules in v16. We worked with a customer few months ago about this question and end up with a procedure to build new locale/collation for glibc and load them in PostgreSQL [1]. Our customer built the fr_ebcdic locale file themselves, based on the EBCDIC IBM500 codepage (including about the same characters than iso 8859-1) and share it under the BY-CC licence. See in attachment. The procedure is quite simple: 1. copy this file under "/usr/share/i18n/locales/fr_ebcdic" 2. build it using "localedef -c -i fr_ebcdic -f UTF-8 fr_ebcdic.UTF-8" 3. restart your PostgreSQL instance (because of localeset weird behavior) 4. "pg_import_system_collations('schema')" or create the collation, eg.: CREATE COLLATION fr_ebcdic ( PROVIDER = libc, LC_COLLATE = fr_ebcdic.utf8, LC_CTYPE = fr_ebcdic.utf8 ); Now, same question than for the ICU: do we want to provide documentation about this? Online documentation about such feature are quite arid. In fact, this could be useful in various other way than just EBCDIC. Regards, [1] https://www.postgresql.org/message-id/20230209144947.1dfad6c0%40karst
Attachment
Peter Eisentraut wrote: > Committed with some editing. I moved the existing rules example from > the CREATE COLLATION page into the new section you created, so we have a > simple example followed by the complex example. OK, thanks for pushing this! Best regards, -- Daniel Vérité https://postgresql.verite.pro/ Twitter: @DanielVerite