Thread: An idea on faster CHAR field indexing
Hi folks, This is my first post to your list. I've been reading it for about a week. I like the quality of the developers here andthink this portends well for the future of Postgres. Anyway, an idea. Not sure if RDBMSs internally already implement this technique. But in case them don't and in case you've never thought of it here something I just thought of: CHAR fields have different sorting (aka collation) rules for each code page. eg the very fact that A comes before B is something that the collation info for a given code page has to specify. Well, just because a character has a lower value than another character in its encoding in a given code page doesn't mean it gets sorted first. So let me cut to the chase: I'm thinking that rather than store the actual character sequence of each field (or some subset of a field) in an index why not translate the characters into their collation sequence values and store _those_ in the index? The idea is to reduce the number of times that string has to be converted to its mathematical sorting order representation. Don't do it every time two strings get compared. Do it when a record is inserted or that field is updated. Is this already done? Or is it not such a good idea for some reason? I'd consider this idea of greater value in something like Unicode. For 16 bit Unicode the lookup table to find each character's ordinal value (or sorting value, whatever its called) is 128k, right? Doing a bunch of look-ups into that hasto not be good for L1 and L2 cache in a processor.
> So let me cut to the chase: I'm thinking that rather than store the > actual character sequence of each field (or some subset of a field) > in an index why not translate the characters into their collation > sequence values and store _those_ in the index? This is not an obvious win, since: 1. some collations rules require multiple passes over the data 2. POSIX strxfrm() will convert strings of characters to a form that can be compared by strcmp() [i.e. single pass] buttends to greatly increase memory requirements I've only data for one implementation of strxfrm(), but the memory usage startled me. In my application it was fasterto use strcoll() directly for collation than to pre-expand the data with strxfrm(). Regards, Giles
Giles, I'm curious as to why the need for multiple passes. Is that true even in Latin 1 code pages? If not, this optimization could at least be used for code pages that don't require multiple passes. As for memory usage: I don't see the issue here. The translation to some collation sequence has to be done anyhow. Writing one's own routine to do look-ups into a collation sequence table is a fairly trivial exercise. One would have the option with SBCS code pages to either translate to 8 bit collation values or to translate them into master Unicode collation values. Not sure what the advantage would be of doing the latter. I only see it as useful if you have different rows storing text in different code pages and then only if the RDBMS can know for a given field on a per row basis what its code page is. On Thu, 22 Jun 2000 06:59:06 +1000, Giles Lean wrote: > >> So let me cut to the chase: I'm thinking that rather than store the >> actual character sequence of each field (or some subset of a field) >> in an index why not translate the characters into their collation >> sequence values and store _those_ in the index? > >This is not an obvious win, since: > >1. some collations rules require multiple passes over the data > >2. POSIX strxfrm() will convert strings of characters to a form that > can be compared by strcmp() [i.e. single pass] but tends to greatly > increase memory requirements > > I've only data for one implementation of strxfrm(), but the memory > usage startled me. In my application it was faster to use > strcoll() directly for collation than to pre-expand the data with > strxfrm(). > >Regards, > >Giles >
> I'm curious as to why the need for multiple passes. Is that true > even in Latin 1 code pages? Yes. Some locales want strings to be ordered first by ignoring any accents on chracters, then using a tie-break on equal strings by doing a comparison that includes the accents. To take another of your points out of order: this is an obstacle that Unicode doesn't resolve. Unicode gives you a character set capable of representing characters from many different locales, but collation order will remain locale specific. > If not, this optimization could at least > be used for code pages that don't require multiple passes. ... but due to the increased memory/disk space, this is likely not an optimisation. Measurements needed, I'd suggest. My only experience of this was tuning a sort utility, where the extra time to convert the strings with strxfrm() and the large additional memory requirement killed any advantage strcmp() had over strcoll(). Whether this would be the case for database indexes in general or ideed ever I don't know. > As for memory usage: I don't see the issue here. The translation to > some collation sequence has to be done anyhow. No; you can do the comparisons in multiple passes instead without extra storage allocation. Using multiple passes will be efficient if the comparisons mostly don't need the second pass, which I suspect is typical. > Writing one's own routine to do look-ups into a collation sequence > table is a fairly trivial exercise. True. But if you can't do character-by-character comparisons then such a simplistic implementation will fail. I hadn't mentioned this time around (but see the archives for the recent discussion of LIKE) that there are locales with 2:1 and 1:2 mappings of characters too. Regards, Giles
Giles, On Thu, 22 Jun 2000 11:12:54 +1000, Giles Lean wrote: >Yes. Some locales want strings to be ordered first by ignoring any >accents on chracters, then using a tie-break on equal strings by doing >a comparison that includes the accents. I guess I don't see how this is really any different. Why order first by the character and second by the accent? For instance, if you know the relative order of the various forms of "o" then just give them all successive numbers and do a single pass sort. You just have to make sure that all the numbers in that set of numbers are greater than the number you assign to "m" and less than the number you assign to "p". >To take another of your points out of order: this is an obstacle that >Unicode doesn't resolve. Unicode gives you a character set capable of >representing characters from many different locales, but collation >order will remain locale specific. With Unicode you have to have a collation order that cuts across what use to be separate character sets in separate code pages. >... but due to the increased memory/disk space, this is likely not an >optimisation. Measurements needed, I'd suggest. But why is there increased memory and disk space? Do the fields that go into an index not now already get stored twice? Does the index just contain a series of references to records and that is it?
"Randall Parker" <randall@nls.net> writes: > On Thu, 22 Jun 2000 11:12:54 +1000, Giles Lean wrote: >> Yes. Some locales want strings to be ordered first by ignoring any >> accents on chracters, then using a tie-break on equal strings by doing >> a comparison that includes the accents. > I guess I don't see how this is really any different. Why order first > by the character and second by the accent? For instance, if you know > the relative order of the various forms of "o" then just give them all > successive numbers and do a single pass sort. You just have to make > sure that all the numbers in that set of numbers are greater than the > number you assign to "m" and less than the number you assign to "p". Nope. Would it were that easy. I don't have a keyboard that will let me type a proper example, but consider 1. a o-with-type-1-accent c 2. a o-with-type-2-accent b If type-1 accent sorts before type-2 then your proposal will consider string 1 less than string 2. But the correct answer (in these locales) is the other way round, because you mustn't look at the accents at all unless you discover that the strings are otherwise equal. The determining comparison here is that b < c, therefore string 2 < string 1. regards, tom lane
Giles Lean <giles@nemeton.com.au> writes: > My only experience of this was tuning a sort utility, where the extra > time to convert the strings with strxfrm() and the large additional > memory requirement killed any advantage strcmp() had over strcoll(). > Whether this would be the case for database indexes in general or > ideed ever I don't know. Interesting. That certainly suggests strxfrm could be a loser for a database index too, but I agree it'd be nice to see some actual measurements rather than speculation. What locale(s) were you using when testing your sort code? I suspect the answers might depend on locale quite a bit... regards, tom lane
> Interesting. That certainly suggests strxfrm could be a loser for > a database index too, but I agree it'd be nice to see some actual > measurements rather than speculation. > > What locale(s) were you using when testing your sort code? I suspect > the answers might depend on locale quite a bit... I did a little more measurement today. It's still only annecdotal evidence -- I wasn't terribly rigorous -- but here are my results. My data file consisted of ~660,000 lines and a total size of ~200MB. Each line had part descriptions in German and some uninteresting fields. I stripped out the uninteresting fields and read the file calling calling strxfrm() for each line. I recorded the total input bytes and the total bytes returned by strxfrm(). HP-UX 11.00 de_DE.roman8 locale: input bytes: 179647811 result bytes: 1447833496 (increase factor 8.05) Solaris 2.6 de_CH locale: input bytes: 179647811 result bytes: 1085875122 (increase factor 6.04) I didn't time the test program on Solaris, but on HP-UX this program took longer to run than a simplistic qsort() using strcoll() does, and my comparison sort program has to write the data out as well, which the strxfrm() calling program didn't do. Regards, Giles
[ I hope this is useful for the archives, even though it's pretty much been covered already. --giles ] > With Unicode you have to have a collation order that cuts across > what use to be separate character sets in separate code pages. From the glossary at http://www.unicode.org/glossary/index.html: Collation The process of ordering units of textual information. Collation is usually specific to a particular language.Also known as alphabetizing or alphabetic sorting. Unicode Technical Report #10, "Unicode Collation Algorithm,"defines a complete, unambiguous, specified ordering for all characters in the Unicode Standard. The "Unicode Technical Report #10" is accessible at: http://www.unicode.org/unicode/reports/tr10/ This technical report provides a way to specify "the tailoring for any particular country" which together with the standard ordering will allow different Unicode implementations to sort any particular input identically. Summary is that Unicode still has locales and we still have to know not only the character set/code page but also the locale ("country specific tailoring") an index was created with to use it. Regards, Giles