On Tue, 04 Jul 2006, Tom Lane wrote:
> I think the real problem here is that regex matching is the wrong
> tool for the job. Have you looked into a full-text index
> (tsearch2)?
So much to do with so little time...
I've briefly looked into it but:
- it's complicated;
- it is not needed - basic scan is good enough for the amount of data
we have (if a sane query plan is chosen by a database);
- we have data in many languages (including based on cyryllic
alphabet) - languages which use different forms of the same word
based on context, for example:
Warszawa
Warszawy
Warszawie
Warszawê
Warszaw±
Warszawo
All of the above could be translated to "Warsaw". So we need to
support matching parts of words ("warszaw"), which I haven't seen
in tsearch2 (maybe I've overlooked). We also have words, which
different forms look like this: "stó³" "stole" "sto³u" (Polish for
"table") - when we need to find it we'd need to list every possible
form (about 10) or use a regex like: 'st[oó][l³]'.
> With something like that, the index operator has at least got the
> correct conceptual model, ie, looking for indexed words. I'm not sure
> if they have any decent statistical support for it :-( but in theory
> that seems doable, whereas regex estimation will always be a crapshoot.
So why estimate regex expressions if there is no estimation possible?
Let's set this estimate to be pessimistic (match everything or
everything not null) and it will choose better plans. At least until
somebody will figure out better approach.
Pozdrawiam
Tometzky
--
Best of prhn - najzabawniejsze teksty polskiego UseNet-u
http://prhn.dnsalias.org/
Chaos zawsze pokonuje porz±dek, gdy¿ jest lepiej zorganizowany.
[ Terry Pratchett ]