Home > mailing lists

Re: 7.3 no longer using indexes for LIKE queries - Mailing list pgsql-general

From	Matthew Gabeler-Lee
Subject	Re: 7.3 no longer using indexes for LIKE queries
Date	December 4, 2002 17:09:14
Msg-id	ABABFB80F35AD311848B0090279918EF010B9B64@ZYCOSNT2.hq.zycos.com Whole thread Raw
In response to	7.3 no longer using indexes for LIKE queries (Matthew Gabeler-Lee <mgabelerlee@zycos.com>)
List	pgsql-general

Tree view

Looking at the list archives, it looks like there's some general classes of
problems:

1) Multibyte charsets; me not knowing much, I'd guess that initially none of
these are 'sane'.

2) longer string less than shorter string ('ABC\NNN' < 'ABC'); can test by
iterating through bytes (or potentially pairs of bytes, to catch needing two
special chars as a digraph or something) appeneded to a base string and
seeing if any of them trigger the longer-less.

3) characters that are ignored in a sort (e.g. 'ABC\NNNDEF' sorts same as
'ABCDEF'); can test again by iterating through bytes or pairs of bytes
looking for the condition.

4) accent folding; I'm not entirely sure like is supposed to do this.  I'm
going to pretend for the rest of this that the like operator shouldn't fold
accented characters.

Employing extreme paranoia and assuming great ignorance of any specialties
of the locale being tested, I think this should cover all the bases:

// pglub(x) = postgres' upper bound string for the like optimization on 'x%'
// <, >, = mean sorts less than, etc.
report insane if locale is multibyte
for every 2-byte string $a
  let $b = pglub($a)
  for every 2-byte string $c
    report insane if "$a$c" < $a or "$a$c" > $b
for some 4 byte string(s) of plain characters $d
  for every 2 byte string $e
    report insane if $d = "$d[0,1]$e$d[2,3]"
else report sane

On a fast machine, this would probably take a couple minutes[1] to run per
locale, so it would definitely be a one-time thing, not a compile-time
thing, but would be practical to run on a one time or per-release basis to
generate some source file that listed sane locales.

[1] Basically iterating through all values of 32 bits, takes my pc a few
seconds to just do 'for(unsigned n=0;n<0xffffffff;++n);', so I extrapolated
an estimate of what needs to be done beyond the counting.  The second loop
runs so many less times that its runtime is in the noise.

With some slight modifications, this sanity-finding routine could be used to
generate tables that for use by a function to find the proper bounds for
like optimization on locales that only suffered from problem #2.  Yes, this
embeds knowledge about locales into postgres, but it does it through
exhaustive examination, not usage of personal expertise.

An addendum on accent folding:

To my ignorant mind, characters with different accents are different
characters.  My French teacher certainly marked me off if I got the accents
wrong.

It seems to me that the most common place one wants to think about this is
in full text searching and the like.  In this case, maybe I'm daft, but
perhaps the thing to do is to create a functional index, where the function
being indexed strips all the accents off characters.

Does the SQL spec have anything to say on accent folding in comparisons?

    -Matt


-----Original Message-----
From: Bruce Momjian [mailto:pgman@candle.pha.pa.us]

I think we do know the C local is OK for this.  It is just the other
ones we don't know about, and we don't even know.  Is there some test we
can write for this?

pgsql-general by date:

From: Jean-Luc Lachance
Date: 04 December 2002, 16:49:38
Subject: Re: Efficient Boolean Storage

From: "j.random.programmer"
Date: 04 December 2002, 17:19:46
Subject: Re: Postgresql -- initial impressions and comments

Re: 7.3 no longer using indexes for LIKE queries - Mailing list pgsql-general

Previous

Next