Thread: Is it possible to sort strings in EBCDIC order in PostgreSQL server?
Is it possible to sort strings in EBCDIC order in PostgreSQL server?
From
"Tsunakawa, Takayuki"
Date:
Hello, It is embarrassing to ask such a thing, but is there any way to sort character column values in EBCDIC order inside the PostgreSQLserver? i.e. is it possible to use EBCDIC collation in PostgreSQL, say, by using ICU or something? We need torun on a certain mainframe. I see almost no hope from the following, but let me make sure. https://www.postgresql.org/message-id/flat/28548.1424884373%40sss.pgh.pa.us#28548.1424884373@sss.pgh.pa.us "Hmm ... EBCDIC ... is that mandatory? Because there are a pretty large number of ASCII dependencies in PG, mostly arising from our support of multibyte character sets, which are all expected to be ASCII supersets. If there's not a way to run the server process in ASCII-land, you're likely dead in the water. I would not want to try to ferret out all the dependencies." Regards Takayuki Tsunakawa
Tsunakawa, Takayuki wrote: > It is embarrassing to ask such a thing, but is there any way to sort character column values in EBCDIC order inside thePostgreSQL server? > i.e. is it possible to use EBCDIC collation in PostgreSQL, say, by using ICU or something? We need to run on a certainmainframe. > > I see almost no hope from the following, but let me make sure. That should not be too difficult. PostgreSQL is extensible! Just define operators that implement <, <=, =, >= and > for EBCDIC, create an operator class for these (for indexing) and use ORDER BY ... USING. Yours, Laurenz Albe
"Tsunakawa, Takayuki" <tsunakawa.takay@jp.fujitsu.com> writes: > It is embarrassing to ask such a thing, but is there any way to sort character column values in EBCDIC order inside thePostgreSQL server? i.e. is it possible to use EBCDIC collation in PostgreSQL, say, by using ICU or something? We needto run on a certain mainframe. > I see almost no hope from the following, but let me make sure. > https://www.postgresql.org/message-id/flat/28548.1424884373%40sss.pgh.pa.us#28548.1424884373@sss.pgh.pa.us > "Hmm ... EBCDIC ... is that mandatory? Because there are a pretty large > number of ASCII dependencies in PG, mostly arising from our support of > multibyte character sets, which are all expected to be ASCII supersets. > If there's not a way to run the server process in ASCII-land, you're > likely dead in the water. I would not want to try to ferret out all > the dependencies." As I recall, that question was about converting the server to use EBCDIC strings everywhere. That seems impractical for the reasons I mentioned. But you could probably sort according to EBCDIC order if you were willing to accept some inefficiency: internal to the comparison function, convert each string to EBCDIC on each call. Another way you could imagine is to store EBCDIC strings embedded in bytea values. regards, tom lane
On Sun, Dec 10, 2017 at 11:09 PM, Tsunakawa, Takayuki <tsunakawa.takay@jp.fujitsu.com> wrote: > It is embarrassing to ask such a thing, but is there any way to sort character column values in EBCDIC order inside thePostgreSQL server? i.e. is it possible to use EBCDIC collation in PostgreSQL, say, by using ICU or something? We needto run on a certain mainframe. There is no need for embarrassment. I think that you mean that you'd like to get behavior equivalent to the "C" locale with an EBCDIC encoding. Right? I think that you could use ICU to do this with some work. ICU collations are not tied to a particular encoding, unlike the libc collations (they support most but not all collations). That said, the idea of an "EBCDIC collation" seems limiting. Why should a system like DB2 for the mainframe (that happens to use EBCDIC as its encoding) not have a more natural, human-orientated collation even while using EBCDIC? ISTM that the point of using the "C" locale (with EBDIC or with UTF-8 or with any other encoding) is to get a performance benefit where the actual collation's behavior doesn't matter much to users. Are you sure it's really important to be *exactly* compatible with EBCDIC order? As long as you're paying for a custom collation, why not just use a collation that is helpful to humans? -- Peter Geoghegan
RE: Is it possible to sort strings in EBCDIC order in PostgreSQLserver?
From
"Tsunakawa, Takayuki"
Date:
Hi Laurenz, Tom, Peter, Thanks for your suggestions. The practical solution seems to be to override comparison operators of char, varchar and textdata types with UDFs that behave as Tom mentioned. From: Peter Geoghegan [mailto:pg@bowt.ie] > That said, the idea of an "EBCDIC collation" seems limiting. Why > should a system like DB2 for the mainframe (that happens to use EBCDIC > as its encoding) not have a more natural, human-orientated collation > even while using EBCDIC? ISTM that the point of using the "C" locale > (with EBDIC or with UTF-8 or with any other encoding) is to get a > performance benefit where the actual collation's behavior doesn't > matter much to users. Are you sure it's really important to be > *exactly* compatible with EBCDIC order? As long as you're paying for a > custom collation, why not just use a collation that is helpful to > humans? You are right. I'd like to ask the customer whether and why they need EBCDIC ordering. Regards Takayuki Tsunakawa
Hi Laurenz, Tom, Peter,
Thanks for your suggestions. The practical solution seems to be to override comparison operators of char, varchar and text data types with UDFs that behave as Tom mentioned.
From: Peter Geoghegan [mailto:pg@bowt.ie]
> That said, the idea of an "EBCDIC collation" seems limiting. Why
> should a system like DB2 for the mainframe (that happens to use EBCDIC
> as its encoding) not have a more natural, human-orientated collation
> even while using EBCDIC? ISTM that the point of using the "C" locale
> (with EBDIC or with UTF-8 or with any other encoding) is to get a
> performance benefit where the actual collation's behavior doesn't
> matter much to users. Are you sure it's really important to be
> *exactly* compatible with EBCDIC order? As long as you're paying for a
> custom collation, why not just use a collation that is helpful to
> humans?
You are right. I'd like to ask the customer whether and why they need EBCDIC ordering.
This is a guess on my part, based on many years on an EBCDIC system. But I'll bet that they are doing a conversion off of the EBCDIC system (maybe Db2 on z/OS) to an ASCII system (Linux or Windows) running PostgreSQL. They want to be able to compare the output from the existing system to the output on the new system. EBCDIC orders "lower case", "upper case", then "digits". The default C locale on Linux (I don't know Windows) will sort "digits", then alphabetic with the lower then upper case of each letter in order like: "aAbB...zZ". Comparing identical data which is not presented in exactly the same order would be very difficult.
Regards
Takayuki Tsunakawa
I have a theory that it's impossible to prove anything, but I can't prove it.
Maranatha! <><
John McKown
John McKown
The default C locale on Linux (I don't know Windows) will sort "digits", then alphabetic with the lower then upper case of each letter in order like: "aAbB...zZ"
That's no true at all! The C locales are 0-9A-Za-z
#include <locale.h>
#include <stddef.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
static int myCompare (const void * a, const void * b)
{
return strcmp (*(const char **) a, *(const char **) b);
}
void with_other_locale (char *new_locale,
const char **arr,
const int n
)
{
char *old_locale, *saved_locale;
/* Get the name of the current locale. */
old_locale = setlocale (LC_ALL, NULL);
/* Copy the name so it won’t be clobbered by setlocale. */
saved_locale = strdup (old_locale);
if (strlen(new_locale) == 0)
{
new_locale = saved_locale;
}
/* Now change the locale and do some stuff with it. */
setlocale (LC_ALL, new_locale);
qsort (arr, n, sizeof (const char *), myCompare);
printf("\nSorted array in locale %s is\n", new_locale);
for (int i = 0; i < n; i++)
printf("%d: %s \n", i, arr[i]);
/* Restore the original locale. */
setlocale (LC_ALL, saved_locale);
free (saved_locale);
}
int main ()
{
const char *arr[] = {"Jim", "job", "Anne", "aardvark", "Isaac", "island", "12 Days of Christmas", "12 drummers"};
int n = sizeof(arr)/sizeof(arr[0]);
int i;
printf("Given array is\n");
for (i = 0; i < n; i++)
printf("%d: %s \n", i, arr[i]);
with_other_locale("", arr, n);
with_other_locale("C", arr, n);
with_other_locale("en_US.UTF-8", arr, n);
with_other_locale("UTF-8", arr, n);
return 0;
}
Gives
Given array is
0: Jim
1: job
2: Anne
3: aardvark
4: Isaac
5: island
6: 12 Days of Christmas
7: 12 drummers
Sorted array in locale C is
0: 12 Days of Christmas
1: 12 drummers
2: Anne
3: Isaac
4: Jim
5: aardvark
6: island
7: job
Sorted array in locale C is
0: 12 Days of Christmas
1: 12 drummers
2: Anne
3: Isaac
4: Jim
5: aardvark
6: island
7: job
Sorted array in locale en_US.UTF-8 is
0: 12 Days of Christmas
1: 12 drummers
2: Anne
3: Isaac
4: Jim
5: aardvark
6: island
7: job
Sorted array in locale UTF-8 is
0: 12 Days of Christmas
1: 12 drummers
2: Anne
3: Isaac
4: Jim
5: aardvark
6: island
7: job
0: Jim
1: job
2: Anne
3: aardvark
4: Isaac
5: island
6: 12 Days of Christmas
7: 12 drummers
Sorted array in locale C is
0: 12 Days of Christmas
1: 12 drummers
2: Anne
3: Isaac
4: Jim
5: aardvark
6: island
7: job
Sorted array in locale C is
0: 12 Days of Christmas
1: 12 drummers
2: Anne
3: Isaac
4: Jim
5: aardvark
6: island
7: job
Sorted array in locale en_US.UTF-8 is
0: 12 Days of Christmas
1: 12 drummers
2: Anne
3: Isaac
4: Jim
5: aardvark
6: island
7: job
Sorted array in locale UTF-8 is
0: 12 Days of Christmas
1: 12 drummers
2: Anne
3: Isaac
4: Jim
5: aardvark
6: island
7: job
I actually don't think there is a case-insensitive locale (bydefault?) on a unix machine.
Jim
On Tue, Dec 12, 2017 at 8:18 AM, John McKown <john.archie.mckown@gmail.com> wrote:
Hi Laurenz, Tom, Peter,
Thanks for your suggestions. The practical solution seems to be to override comparison operators of char, varchar and text data types with UDFs that behave as Tom mentioned.
From: Peter Geoghegan [mailto:pg@bowt.ie]
> That said, the idea of an "EBCDIC collation" seems limiting. Why
> should a system like DB2 for the mainframe (that happens to use EBCDIC
> as its encoding) not have a more natural, human-orientated collation
> even while using EBCDIC? ISTM that the point of using the "C" locale
> (with EBDIC or with UTF-8 or with any other encoding) is to get a
> performance benefit where the actual collation's behavior doesn't
> matter much to users. Are you sure it's really important to be
> *exactly* compatible with EBCDIC order? As long as you're paying for a
> custom collation, why not just use a collation that is helpful to
> humans?
You are right. I'd like to ask the customer whether and why they need EBCDIC ordering.This is a guess on my part, based on many years on an EBCDIC system. But I'll bet that they are doing a conversion off of the EBCDIC system (maybe Db2 on z/OS) to an ASCII system (Linux or Windows) running PostgreSQL. They want to be able to compare the output from the existing system to the output on the new system. EBCDIC orders "lower case", "upper case", then "digits". The default C locale on Linux (I don't know Windows) will sort "digits", then alphabetic with the lower then upper case of each letter in order like: "aAbB...zZ". Comparing identical data which is not presented in exactly the same order would be very difficult.
Regards
Takayuki Tsunakawa
--I have a theory that it's impossible to prove anything, but I can't prove it.Maranatha! <><
John McKown
The default C locale on Linux (I don't know Windows) will sort "digits", then alphabetic with the lower then upper case of each letter in order like: "aAbB...zZ"That's no true at all! The C locales are 0-9A-Za-z
Thanks for the correction. Turns out that I forgot that my default locale on Linux was en_US.utf8.
I have a theory that it's impossible to prove anything, but I can't prove it.
Maranatha! <><
John McKown
John McKown
en_US.utf8. is still 0-9A-Za-z and in my example set (as it's my default too :))
You'd need a case insensitive collation to do what you described, and I'm not sure those exist in postgres. (I guess you could always build your own if you _really_ wanted to.
Jim
On Tue, Dec 12, 2017 at 10:24 AM, John McKown <john.archie.mckown@gmail.com> wrote:
The default C locale on Linux (I don't know Windows) will sort "digits", then alphabetic with the lower then upper case of each letter in order like: "aAbB...zZ"That's no true at all! The C locales are 0-9A-Za-zThanks for the correction. Turns out that I forgot that my default locale on Linux was en_US.utf8.--I have a theory that it's impossible to prove anything, but I can't prove it.Maranatha! <><
John McKown
Sorry for spamming the list. It appears that I'm an idiot. Sorry :(
jim=# select * from test order by a collate "C";
a
----------------------
12 Days of Christmas
12 drummers
Anne
Isaac
Jim
a
aardvark
b
island
job
(10 rows)
a
----------------------
12 Days of Christmas
12 drummers
Anne
Isaac
Jim
a
aardvark
b
island
job
(10 rows)
jim=# select * from test order by a collate "en_US.utf8";
a
----------------------
12 Days of Christmas
12 drummers
a
aardvark
Anne
b
Isaac
island
Jim
job
(10 rows)
a
----------------------
12 Days of Christmas
12 drummers
a
aardvark
Anne
b
Isaac
island
Jim
job
(10 rows)
On Tue, Dec 12, 2017 at 10:36 AM, James Keener <jim@jimkeener.com> wrote:
en_US.utf8. is still 0-9A-Za-z and in my example set (as it's my default too :))You'd need a case insensitive collation to do what you described, and I'm not sure those exist in postgres. (I guess you could always build your own if you _really_ wanted to.JimOn Tue, Dec 12, 2017 at 10:24 AM, John McKown <john.archie.mckown@gmail.com> wrote:The default C locale on Linux (I don't know Windows) will sort "digits", then alphabetic with the lower then upper case of each letter in order like: "aAbB...zZ"That's no true at all! The C locales are 0-9A-Za-zThanks for the correction. Turns out that I forgot that my default locale on Linux was en_US.utf8.--I have a theory that it's impossible to prove anything, but I can't prove it.Maranatha! <><
John McKown
Sorry for spamming the list. It appears that I'm an idiot. Sorry :(
I guess we're even now. We both made a similar mistake.
But, despite my error, I still think the OP's need for an EBCDIC order is to compare output from parallel runs of an application from both an EBCDIC host and and ASCII host, to make sure they are "the same".
I have a theory that it's impossible to prove anything, but I can't prove it.
Maranatha! <><
John McKown
John McKown
If that's the case, I wonder if OP could write a function that would convert from the ASCII code-point ot the EBCDIC codepoint. For instance, (using the function at https://wiki.postgresql.org/ wiki/Binary_Replace) convert A to char 193, `select binary_replace('Anne'::bytea, 'A'::bytea, '\xc1'::bytea);` and then sort by the result of this function.
Jim
On Tue, Dec 12, 2017 at 10:47 AM, John McKown <john.archie.mckown@gmail.com> wrote:
--Sorry for spamming the list. It appears that I'm an idiot. Sorry :(I guess we're even now. We both made a similar mistake.But, despite my error, I still think the OP's need for an EBCDIC order is to compare output from parallel runs of an application from both an EBCDIC host and and ASCII host, to make sure they are "the same".I have a theory that it's impossible to prove anything, but I can't prove it.Maranatha! <><
John McKown
On Tue, 12 Dec 2017 07:18:10 -0600, John McKown <john.archie.mckown@gmail.com> wrote: >?This is a guess on my part, based on many years on an EBCDIC system. But >I'll bet that they are doing a conversion off of the EBCDIC system (maybe >Db2 on z/OS) to an ASCII system (Linux or Windows) running PostgreSQL. They >want to be able to compare the output from the existing system to the >output on the new system. [he asks naively:] If a port off the EBCDIC system really is the case, wouldn't it be easier just to write an external compare tool? Or maybe a PG function to convert UTF strings as EBCDIC so as to compare string column dumps? Either of the above seems more rational than putting a whole lot of effort into creating an EBCDIC collation. [Not that it is terribly hard to create a new collation. It's more a matter of "why bother?" when you are trying to move away from that system in the 1st place.] YMMV, George
On Tue, Dec 12, 2017 at 5:18 AM, John McKown <john.archie.mckown@gmail.com> wrote: > On Tue, Dec 12, 2017 at 2:17 AM, Tsunakawa, Takayuki > <tsunakawa.takay@jp.fujitsu.com> wrote: >> >> Hi Laurenz, Tom, Peter, >> >> Thanks for your suggestions. The practical solution seems to be to >> override comparison operators of char, varchar and text data types with UDFs >> that behave as Tom mentioned. >> >> From: Peter Geoghegan [mailto:pg@bowt.ie] >> > That said, the idea of an "EBCDIC collation" seems limiting. Why >> > should a system like DB2 for the mainframe (that happens to use EBCDIC >> > as its encoding) not have a more natural, human-orientated collation >> > even while using EBCDIC? ISTM that the point of using the "C" locale >> > (with EBDIC or with UTF-8 or with any other encoding) is to get a >> > performance benefit where the actual collation's behavior doesn't >> > matter much to users. Are you sure it's really important to be >> > *exactly* compatible with EBCDIC order? As long as you're paying for a >> > custom collation, why not just use a collation that is helpful to >> > humans? >> >> You are right. I'd like to ask the customer whether and why they need >> EBCDIC ordering. > > > This is a guess on my part, based on many years on an EBCDIC system. But > I'll bet that they are doing a conversion off of the EBCDIC system (maybe > Db2 on z/OS) to an ASCII system (Linux or Windows) running PostgreSQL. They > want to be able to compare the output from the existing system to the output > on the new system. EBCDIC orders "lower case", "upper case", then "digits". ICU supports creating custom collations that reorder upper and lower case, or digits with scripts (e.g. Latin alphabet characters). See the documentation -- "23.2.2.3.2. ICU collations". Advanced customization is possible. -- Peter Geoghegan
On 12/12/17 10:21, Peter Geoghegan wrote: > ICU supports creating custom collations that reorder upper and lower > case, or digits with scripts (e.g. Latin alphabet characters). See the > documentation -- "23.2.2.3.2. ICU collations". Advanced customization > is possible. I just gave this a try and I'm not sure I completely understand what it's doing but it definitely doesn't look like it's sorting according to EBCDIC byte orders. (This was on centos 7 with libicu-50.1.2) I might have created the collation incorrectly. I'd love to hear if anyone else finds a way to get this working outside of custom comparison operators. pops-10.1 root@db1=# create collation "ebcdic" (provider=icu, locale='cp037'); CREATE COLLATION pops-10.1 root@db1=# create table test (data text); CREATE TABLE pops-10.1 root@db1=# insert into test values('a'),('b'),('A'),('B'),('1'),('2'),('!'),('^'); INSERT 0 8 pops-10.1 root@db1=# select * from test order by data collate "ebcdic"; data ------ ! ^ 1 2 a A b B (8 rows) -- http://about.me/jeremy_schneider