Re: How to add locale support for each column? - Mailing list pgsql-hackers

From Greg Stark
Subject Re: How to add locale support for each column?
Date
Msg-id 873c15agzi.fsf@stark.xeocode.com
Whole thread Raw
In response to Re: How to add locale support for each column?  (Stephan Szabo <sszabo@megazone.bigpanda.com>)
Responses Re: How to add locale support for each column?
List pgsql-hackers
Stephan Szabo <sszabo@megazone.bigpanda.com> writes:

> But shouldn't the comparison be against sorting on col not lower(col)?
> strxfrm(col) sorts seem comparable to col, strxfrm(lower(col)) sorts seem
> comparable to lower(col). Some collations do treat 'A' and 'a' as be
> adjacent in sort order, but that's not a guarantee, so it's not valid to
> say, "everywhere you'd use lower(col) you can use strxfrm instead."

Well, in my implementation strxfrm is a postgresql function. So I wanted to
compare it with an expression that had at least as much overhead as a
postgresql expression with a single function call.

> And in past numbers you sent, it looked like the amounts were: 1s for sort
> on col, 1.5s for sort on lower(col), 2.5s for sort on strxfrm(col).  That
> doesn't seem negligible to me

Right, I amended my "negligible" claim. It's a significant but reasonable
speed. A 1.5s delay on sorting 100k rows is certainly not the kind of
intolerable delay that would make the idea of switching locales intolerable.

> unless that doesn't grow linearly with the number of rows.

Well I was comparing sorting 206,000 rows. Even if it scales linearly, a 10s
delay on sorting 2M records isn't really fatal. I certainly wouldn't want to
remove the ability to sort using strcmp if the data is ascii or binary. But if
you're going to use locale collation order it's going to be slower. strxfrm
has to do quite a bit of work. Even a postgres-internal mechanism is going to
have to do that same work.

The only time you could save is the time it takes to look up "en_US" in a list
(or hash) of cached locales and switch a pointer. I suspect that's going to be
on a small (but not negligible) portion the overhead. I guess this is subject
to analysis, I'll try to do a gprof run at some point to answer that.


> > I see no reason to think Postgres's implementation of looking up xfrm rules
> > for the specified locale will be any faster than the OS's. We know some OS's
> > suck but some certainly don't.
> 
> But do you have to change locales per row or per sort? Presumably, a built
> in implementation may be able to do the latter rather than the former.

We certainly need the ability to change the locales per-row, in fact possibly
multiple times per row.

Consider

select en,fr from translationsorder by en,fr

Which is actually something reasonable I could have to do in my current
project.

However changing locales should be nigh-instantaneous, it really ought to be
just changing a pointer. And in the API Tom foresees shouldn't even happen.
The only cost of sorting on many locales (aside from the initial load) would
be in the reduced cache hit rate from using more locale tables.

-- 
greg



pgsql-hackers by date:

Previous
From: Bruce Momjian
Date:
Subject: Re: Make configure use krb5-config
Next
From: "Ross J. Reedstrom"
Date:
Subject: Re: 'TID index'