Re: [GENERAL] indexed regex select optimisation missing? - Mailing list pgsql-general
From | Stuart Woolford |
---|---|
Subject | Re: [GENERAL] indexed regex select optimisation missing? |
Date | |
Msg-id | 99110510305400.00767@test.macmillan.co.nz Whole thread Raw |
In response to | Re: [GENERAL] indexed regex select optimisation missing? ("Gene Selkov, Jr." <selkovjr@mcs.anl.gov>) |
Responses |
Re: [GENERAL] indexed regex select optimisation missing?
|
List | pgsql-general |
On Fri, 05 Nov 1999, Gene Selkov, Jr. wrote: > OK, let's turn from speculations to facts (have just gotten off my > rear end and verified each).: > > 1. '^whatever.*' and '^whatever' are equivalent regular expressions. yes, sorry, I was aware of this, although I was using .* for clarity and my mind got stuck in 'proper' regex mode where those are needed.., it unfortunately has no effect on the outcome here. > 2. The version of regexp used in postgres is aware of this equivalence. sure seems that way. > 3. Btree index is used in the queries involving anchored expressions: > > emp=> explain select * from ps where ps ~ '^EDTA'; > NOTICE: QUERY PLAN: > > Index Scan using psix on ps (cost=2373.21 rows=1 width=62) > > emp=> explain select * from ps where ps ~ '^EDTA.*'; > NOTICE: QUERY PLAN: > > Index Scan using psix on ps (cost=2373.21 rows=1 width=62) > > (ps is a 250k-row table; the result is returned immediately when > indexed and in about 3 seconds when not) My point is that, while the index (in 6.5.1 and 6.5.2, anyway) is used to locate the start of the scan, the system is then index-scanning the *whole* rest of the table (which takes minutes for my 1.6 million entry table if it is from near the start), as opposed to using a better 'stop term' to stop scanning once the regex will no longer be able to match (ie: the static front of the regex is no longer matching), so the ordered scan is only being half utilised, this makes a MASSIVE difference in performance. For example, say one of the words in the table is 'alongword', and there is also 'alongwords', but no other words with the root of 'alongword' If I do a "select key from inv_word_i where word='alongword'" it will use the btree index on inv_word_i, and locate the one match almost instantly. If I do a "select key from inv_word_i where word~'alongword' it will need to scan all the records (this takes some time, minutes, infact) - as it should!, and would match atleast the two entries detailed above. If I do a 'select key from inv_word_i where word~'^alongword' it uses the index to find 'alongword', then does an index scan of the *whole* rest of the table check all the rest of the entries for regex matching, so it takes a long time, and returns the two entries detailed above, it will take almost as long as the previous query. What it should do is stop as soon as the leftmost part of the regex match no longer matches 'alongword' because, as it is scanning in indexed order, a match is no longer possible. The query will then run at nearly the speed of the first example, while finding the required two entries. This method is extensible to any regex where there is a '^' followed by a length of static match, as soon as the static part does not match in index scan order, the regex can never be matched. This makes a massive difference for searching large indexes of words when you want to match a root words and all extensions of that word (for exmple, window, windows, windowing, windowed, windowless, etc....) - this optimisation (if it is missing or broken) would make postgresql a much more powerful tool for this job for what would seem to be a quite simple addition. > > However, > > 4. Hash index is never used makes a lot off sense, hash indexes do not supply ordering information, and are therefore only usefull for equivanence location, not ordered scanning, which is required for the regex situation. > =========================== > > Observations made with 6.5 on RedHat 5.1. -- ------------------------------------------------------------ Stuart Woolford, stuartw@newmail.net Unix Consultant. Software Developer. Supra Club of New Zealand. ------------------------------------------------------------
pgsql-general by date: