Re: How to add locale support for each column? - Mailing list pgsql-hackers

From Stephan Szabo
Subject Re: How to add locale support for each column?
Date
Msg-id 20040925194941.M12568@megazone.bigpanda.com
Whole thread Raw
In response to Re: How to add locale support for each column?  (Greg Stark <gsstark@mit.edu>)
Responses Re: How to add locale support for each column?  (Greg Stark <gsstark@mit.edu>)
List pgsql-hackers
On Sat, 25 Sep 2004, Greg Stark wrote:

> Stephan Szabo <sszabo@megazone.bigpanda.com> writes:
>
> > But shouldn't the comparison be against sorting on col not lower(col)?
> > strxfrm(col) sorts seem comparable to col, strxfrm(lower(col)) sorts seem
> > comparable to lower(col). Some collations do treat 'A' and 'a' as be
> > adjacent in sort order, but that's not a guarantee, so it's not valid to
> > say, "everywhere you'd use lower(col) you can use strxfrm instead."
>
> Well, in my implementation strxfrm is a postgresql function. So I wanted to
> compare it with an expression that had at least as much overhead as a
> postgresql expression with a single function call.

I'd thought there was still a question of where such a thing would live?
If it's an external project or a contrib thing, the above might be true,
but if it's meant to be a truly supported internal builtin then the
function call cost is part of the implementation and is significant data
that cannot be thrown out.

> > And in past numbers you sent, it looked like the amounts were: 1s for sort
> > on col, 1.5s for sort on lower(col), 2.5s for sort on strxfrm(col).  That
> > doesn't seem negligible to me
>
> Right, I amended my "negligible" claim. It's a significant but reasonable
> speed. A 1.5s delay on sorting 100k rows is certainly not the kind of
> intolerable delay that would make the idea of switching locales intolerable.

Aparently the message I responded to hung around for a while before
getting to me because they came out of order.

> > unless that doesn't grow linearly with the number of rows.
>
> Well I was comparing sorting 206,000 rows. Even if it scales linearly, a 10s
> delay on sorting 2M records isn't really fatal. I certainly wouldn't want to

I agree in general, but if part of this involves forcing "C" locale (see
my question at the end) and so any locale sorting is forced to do this,
then if a query in en_US currently takes 7 seconds, but now will take 17,
I think that's signficant.

> remove the ability to sort using strcmp if the data is ascii or binary. But if
> you're going to use locale collation order it's going to be slower. strxfrm
> has to do quite a bit of work. Even a postgres-internal mechanism is going to
> have to do that same work.

Was your strxfrm comparison against a column comparison in "C" locale then
rather than one using en_US or some other such locale?

> > > I see no reason to think Postgres's implementation of looking up xfrm rules
> > > for the specified locale will be any faster than the OS's. We know some OS's
> > > suck but some certainly don't.
> >
> > But do you have to change locales per row or per sort? Presumably, a built
> > in implementation may be able to do the latter rather than the former.
>
> We certainly need the ability to change the locales per-row, in fact possibly
> multiple times per row.

But we don't presumably have to look up the locale each time as you note.
I don't see how whether our implementation is not any faster than the OS's
matters if we simply do it less.  Now, as you also pointed out, it may
turn out that the time for the OS lookups after the first *are*
reasonably insigificant.

--
More importantly, do we have know whether or not this function really
works properly in non-C locales?  Is the strxfrm result guaranteed to sort
correctly (using strcoll) in others?


pgsql-hackers by date:

Previous
From: "Ross J. Reedstrom"
Date:
Subject: Re: 'TID index'
Next
From: Dennis Bjorklund
Date:
Subject: Re: Use of zlib