Thread: An idea on faster CHAR field indexing

An idea on faster CHAR field indexing

From
"Randall Parker"
Date:
Hi folks,

This is my first post to your list. I've been reading it for about a week. I like the quality of the developers here
andthink 
 
this portends well for the future of Postgres.

Anyway, an idea. Not sure if RDBMSs internally already implement this technique. But in case them don't and in case 
you've never thought of it here something I just thought of:

CHAR fields have different sorting (aka collation) rules for each code page. eg the very fact that A comes before B is

something that the collation info for a given code page has to specify. Well, just because a character has a lower
value
 
than another character in its encoding in a given code page doesn't mean it gets sorted first. 

So let me cut to the chase: I'm thinking that rather than store the actual character sequence of each field (or some 
subset of a field) in an index why not translate the characters into their collation sequence values and store _those_
in
 
the index? 

The idea is to reduce the number of times that string has to be converted to its mathematical sorting order
representation.
 
Don't do it every time two strings get compared. Do it when a record is inserted or that field is updated.

Is this already done? Or is it not such a good idea for some reason? 

I'd consider this idea of greater value in something like Unicode. For 16 bit Unicode the lookup table to find each 
character's ordinal value (or sorting value, whatever its called) is 128k, right? Doing a bunch of look-ups into that
hasto 
 
not be good for L1 and L2 cache in a processor. 






Re: An idea on faster CHAR field indexing

From
Giles Lean
Date:
> So let me cut to the chase: I'm thinking that rather than store the
> actual character sequence of each field (or some subset of a field)
> in an index why not translate the characters into their collation
> sequence values and store _those_ in the index?

This is not an obvious win, since:

1. some collations rules require multiple passes over the data

2. POSIX strxfrm() will convert strings of characters to a form that  can be compared by strcmp() [i.e. single pass]
buttends to greatly  increase memory requirements
 
  I've only data for one implementation of strxfrm(), but the memory  usage startled me.  In my application it was
fasterto use  strcoll() directly for collation than to pre-expand the data with  strxfrm().
 

Regards,

Giles



Re: An idea on faster CHAR field indexing

From
"Randall Parker"
Date:
Giles,

I'm curious as to why the need for multiple passes. Is that true even in Latin 1 code pages? If not, this optimization

could at least be used for code pages that don't require multiple passes.

As for memory usage: I don't see the issue here. The translation to some collation sequence has to be done anyhow. 
Writing one's own routine to do look-ups into a collation sequence table is a fairly 
trivial exercise. 

One would have the option with SBCS code pages to either translate to 8 bit collation values or to translate them into

master Unicode collation values. Not sure what the advantage would be of doing the 
latter. I only see it as useful if you have different rows storing text in different code pages and then only if the
RDBMS
 
can know for a given field on a per row basis what its code page is.

On Thu, 22 Jun 2000 06:59:06 +1000, Giles Lean wrote:

>
>> So let me cut to the chase: I'm thinking that rather than store the
>> actual character sequence of each field (or some subset of a field)
>> in an index why not translate the characters into their collation
>> sequence values and store _those_ in the index?
>
>This is not an obvious win, since:
>
>1. some collations rules require multiple passes over the data
>
>2. POSIX strxfrm() will convert strings of characters to a form that
>   can be compared by strcmp() [i.e. single pass] but tends to greatly
>   increase memory requirements
>
>   I've only data for one implementation of strxfrm(), but the memory
>   usage startled me.  In my application it was faster to use
>   strcoll() directly for collation than to pre-expand the data with
>   strxfrm().
>
>Regards,
>
>Giles
>





Re: An idea on faster CHAR field indexing

From
Giles Lean
Date:
> I'm curious as to why the need for multiple passes. Is that true
> even in Latin 1 code pages?

Yes.  Some locales want strings to be ordered first by ignoring any
accents on chracters, then using a tie-break on equal strings by doing
a comparison that includes the accents.

To take another of your points out of order: this is an obstacle that
Unicode doesn't resolve.  Unicode gives you a character set capable of
representing characters from many different locales, but collation
order will remain locale specific.

> If not, this optimization could at least
> be used for code pages that don't require multiple passes.

... but due to the increased memory/disk space, this is likely not an
optimisation.  Measurements needed, I'd suggest.

My only experience of this was tuning a sort utility, where the extra
time to convert the strings with strxfrm() and the large additional
memory requirement killed any advantage strcmp() had over strcoll().
Whether this would be the case for database indexes in general or
ideed ever I don't know.

> As for memory usage: I don't see the issue here. The translation to
> some collation sequence has to be done anyhow. 

No; you can do the comparisons in multiple passes instead without
extra storage allocation.  Using multiple passes will be efficient if
the comparisons mostly don't need the second pass, which I suspect is
typical.

> Writing one's own routine to do look-ups into a collation sequence
> table is a fairly trivial exercise.

True.  But if you can't do character-by-character comparisons then
such a simplistic implementation will fail.

I hadn't mentioned this time around (but see the archives for the
recent discussion of LIKE) that there are locales with 2:1 and 1:2
mappings of characters too.

Regards,

Giles



Re: An idea on faster CHAR field indexing

From
"Randall Parker"
Date:
Giles,

On Thu, 22 Jun 2000 11:12:54 +1000, Giles Lean wrote:

>Yes.  Some locales want strings to be ordered first by ignoring any
>accents on chracters, then using a tie-break on equal strings by doing
>a comparison that includes the accents.

I guess I don't see how this is really any different. Why order first by the character and second by the accent? For
instance,
 
if you know the relative order of the various forms of "o" then just give them all successive numbers and do a single
pass
 
sort. You just have to make sure that all the numbers in that set of numbers are greater than the number you assign to
"m"
 
and less than the number you assign to "p".

>To take another of your points out of order: this is an obstacle that
>Unicode doesn't resolve.  Unicode gives you a character set capable of
>representing characters from many different locales, but collation
>order will remain locale specific.

With Unicode you have to have a collation order that cuts across what use to be separate character sets in separate
code
 
pages. 

>... but due to the increased memory/disk space, this is likely not an
>optimisation.  Measurements needed, I'd suggest.

But why is there increased memory and disk space? Do the fields that go into an index not now already get stored twice?

Does the index just contain a series of references to records and that is it? 







Re: An idea on faster CHAR field indexing

From
Tom Lane
Date:
"Randall Parker" <randall@nls.net> writes:
> On Thu, 22 Jun 2000 11:12:54 +1000, Giles Lean wrote:
>> Yes.  Some locales want strings to be ordered first by ignoring any
>> accents on chracters, then using a tie-break on equal strings by doing
>> a comparison that includes the accents.

> I guess I don't see how this is really any different. Why order first
> by the character and second by the accent? For instance, if you know
> the relative order of the various forms of "o" then just give them all
> successive numbers and do a single pass sort. You just have to make
> sure that all the numbers in that set of numbers are greater than the
> number you assign to "m" and less than the number you assign to "p".

Nope.  Would it were that easy.  I don't have a keyboard that will
let me type a proper example, but consider

1.    a  o-with-type-1-accent  c

2.    a  o-with-type-2-accent  b

If type-1 accent sorts before type-2 then your proposal will consider
string 1 less than string 2.  But the correct answer (in these locales)
is the other way round, because you mustn't look at the accents at all
unless you discover that the strings are otherwise equal.  The
determining comparison here is that b < c, therefore string 2 < string 1.
        regards, tom lane


Re: An idea on faster CHAR field indexing

From
Tom Lane
Date:
Giles Lean <giles@nemeton.com.au> writes:
> My only experience of this was tuning a sort utility, where the extra
> time to convert the strings with strxfrm() and the large additional
> memory requirement killed any advantage strcmp() had over strcoll().
> Whether this would be the case for database indexes in general or
> ideed ever I don't know.

Interesting.  That certainly suggests strxfrm could be a loser for
a database index too, but I agree it'd be nice to see some actual
measurements rather than speculation.

What locale(s) were you using when testing your sort code?  I suspect
the answers might depend on locale quite a bit...
        regards, tom lane


Re: An idea on faster CHAR field indexing

From
Giles Lean
Date:
> Interesting.  That certainly suggests strxfrm could be a loser for
> a database index too, but I agree it'd be nice to see some actual
> measurements rather than speculation.
> 
> What locale(s) were you using when testing your sort code?  I suspect
> the answers might depend on locale quite a bit...

I did a little more measurement today.  It's still only annecdotal
evidence -- I wasn't terribly rigorous -- but here are my results.

My data file consisted of ~660,000 lines and a total size of ~200MB.
Each line had part descriptions in German and some uninteresting
fields.  I stripped out the uninteresting fields and read the file
calling calling strxfrm() for each line.  I recorded the total input
bytes and the total bytes returned by strxfrm().

HP-UX 11.00 de_DE.roman8 locale:
input bytes:   179647811
result bytes: 1447833496 (increase factor 8.05)

Solaris 2.6 de_CH locale:
input bytes:   179647811 
result bytes: 1085875122 (increase factor 6.04)

I didn't time the test program on Solaris, but on HP-UX this program
took longer to run than a simplistic qsort() using strcoll() does, and
my comparison sort program has to write the data out as well, which
the strxfrm() calling program didn't do.

Regards,

Giles


Re: An idea on faster CHAR field indexing

From
Giles Lean
Date:
[ I hope this is useful for the archives, even though it's pretty much been covered already. --giles ]

> With Unicode you have to have a collation order that cuts across
> what use to be separate character sets in separate code pages.

From the glossary at http://www.unicode.org/glossary/index.html:

Collation   The process of ordering units of textual information. Collation is   usually specific to a particular
language.Also known as   alphabetizing or alphabetic sorting. Unicode Technical Report #10,   "Unicode Collation
Algorithm,"defines a complete, unambiguous,   specified ordering for all characters in the Unicode Standard.
 

The "Unicode Technical Report #10" is accessible at:

http://www.unicode.org/unicode/reports/tr10/

This technical report provides a way to specify "the tailoring for any
particular country" which together with the standard ordering will
allow different Unicode implementations to sort any particular input
identically.

Summary is that Unicode still has locales and we still have to know
not only the character set/code page but also the locale ("country
specific tailoring") an index was created with to use it.

Regards,

Giles