Thread: Curious sorting puzzle

Curious sorting puzzle

From
Ivan Voras
Date:
The situation is this: we're using a varchar column to store
alphanumeric codes which are by themselves 7-bit clean. But we are
operating under a locale which has its own special collation rules, and
is also utf-8 encoded. Recently we've discovered a serious "d'oh!"-type
bug which we tracked down to the fact that when we sort by this column
the collation respects locale sorting rules, which is messing up other
parts of the application.

The question is: what is the most efficient way to solve this problem
(the required operation is to sort data using binary "collation" - i.e.
compare byte by byte)? Since this field gets queried a lot it must have
an index. Some of the possible solutions we thought of are: replacing
the varchar type with numeric and do magical transcoding (bad, needs
changes thoughout the application) and inserting spaces after every
character (not as bad, but still requires modifying both the application
and the data). An ideal solution would be to have a
"not-locale-affected-varchar" field type :)


Re: Curious sorting puzzle

From
Tom Lane
Date:
Ivan Voras <ivoras@fer.hr> writes:
> The situation is this: we're using a varchar column to store
> alphanumeric codes which are by themselves 7-bit clean. But we are
> operating under a locale which has its own special collation rules, and
> is also utf-8 encoded. Recently we've discovered a serious "d'oh!"-type
> bug which we tracked down to the fact that when we sort by this column
> the collation respects locale sorting rules, which is messing up other
> parts of the application.

> The question is: what is the most efficient way to solve this problem
> (the required operation is to sort data using binary "collation" - i.e.
> compare byte by byte)? Since this field gets queried a lot it must have
> an index. Some of the possible solutions we thought of are: replacing
> the varchar type with numeric and do magical transcoding (bad, needs
> changes thoughout the application) and inserting spaces after every
> character (not as bad, but still requires modifying both the application
> and the data). An ideal solution would be to have a
> "not-locale-affected-varchar" field type :)

If you're just storing ASCII then I think bytea might work for this.
Do you need any actual text operations (like concatenation), or this
just a store-and-retrieve field?

If you need text ops too then probably the best answer is to make your
own datatype.  It's not that hard --- look at the citext datatype (on
pgfoundry IIRC, or else gborg) for a closely related example.

            regards, tom lane

Re: Curious sorting puzzle

From
Ivan Voras
Date:
Tom Lane wrote:

>>An ideal solution would be to have a
>>"not-locale-affected-varchar" field type :)
>
>
> If you're just storing ASCII then I think bytea might work for this.
> Do you need any actual text operations (like concatenation), or this
> just a store-and-retrieve field?

I've just tested bytea and it looks like a perfect solution - it supports:

- character-like syntax
- indexes
- uses indexes with LIKE 'x%' queries
- SUBSTRING()

That's good enough for us - it seems it's just what we need - a
string-like type with byte collation.

Thanks!