Re: How to add locale support for each column? - Mailing list pgsql-hackers
From | Greg Stark |
---|---|
Subject | Re: How to add locale support for each column? |
Date | |
Msg-id | 87wtyh8quu.fsf@stark.xeocode.com Whole thread Raw |
In response to | Re: How to add locale support for each column? (Stephan Szabo <sszabo@megazone.bigpanda.com>) |
List | pgsql-hackers |
Stephan Szabo <sszabo@megazone.bigpanda.com> writes: > I'd thought there was still a question of where such a thing would live? > If it's an external project or a contrib thing, the above might be true, > but if it's meant to be a truly supported internal builtin then the > function call cost is part of the implementation and is significant data > that cannot be thrown out. Well it seems to be consensus that it would be good to have a complete locale handling as envisioned by the spec. But I don't see that as relevant to this discussion. I'm comparing a function handling strxfrm with a function handling lower() and with sorting on a column directly. The point was to demonstrate that it was practical (if not ideal) to switch locales repeatedly, especially when you take into account that *any* function will have some overhead anyways. If it were built into postgres the overhead might be lower, but I doubt by much, and in any case it's just not an option for me now. > Aparently the message I responded to hung around for a while before > getting to me because they came out of order. That seems to be happening a lot lately. > I agree in general, but if part of this involves forcing "C" locale (see > my question at the end) and so any locale sorting is forced to do this, > then if a query in en_US currently takes 7 seconds, but now will take 17, > I think that's significant. I compared against sorting in C locale. It would be interesting to know how much of the penalty came from simply having to do the work strxfrm vs the overhead of switching locales. The former is inevitable. *Any* implementation of locale collation orders is going to have to do it. The latter is maybe something we can work on reducing, though not without considerable cost in terms of code complexity. It will mean either lobbying for API changes in libc or growing the codebase of postgres by the size of an entire i18n package. I strongly suspect maintaining i18n packages turns out to be a *lot* of work. > Was your strxfrm comparison against a column comparison in "C" locale then > rather than one using en_US or some other such locale? C. I could compare it against sorting in a database created in a given locale, but I suspect I'll find gprof output more directly helpful. > But we don't presumably have to look up the locale each time as you note. The question is whether looking up the locale is significant compared to executing strxfrm. I suspect it'll be significant, but not the majority of the time. The real question is whether speeding up sorting by removing that overhead is worth the complexity of abandoning libc. I would strongly urge people to consider writing postgres support to assume standard libc functionality. If we can convince glibc and BSD libc people to add a more reasonable interface we can optionally use it, just as we do other more modern interfaces to old features. If some platforms are just terminally braindead we should look for ways to support people installing gnu libintl (or whatever the glibc i18n chunk is called) separately and using it like we do libreadline, libkrb, or libz. > More importantly, do we have know whether or not this function really works > properly in non-C locales? Is the strxfrm result guaranteed to sort > correctly (using strcoll) in others? Well you wouldn't want to use strcoll at all actually, just strcmp. Actually Conway's reimplementation returns a bytea which is probably more correct than my original plan to return text. Though I should check whether postgres has to do extra work to sort bytea data instead of varchar data, especially since strxfrm should never return strings containing nuls. -- greg
pgsql-hackers by date: