Thread: Math and logic mistakes in tsquery_opr_selec

Math and logic mistakes in tsquery_opr_selec

From
Tom Lane
Date:
While reflecting on
http://archives.postgresql.org/pgsql-performance/2012-09/msg00030.php
I discovered that tsquery selectivity is capable of concluding that
"word:*" matches less stuff than "word":

pub=# explain analyze select * from publications_test where to_tsvector('simple', title) @@ to_tsquery('simple',
'database');                                                         QUERY PLAN
                  
 

---------------------------------------------------------------------------------------------------------------------------------Bitmap
HeapScan on publications_test  (cost=53.27..5717.00 rows=3776 width=177) (actual time=22.359..57.078 rows=3885 loops=1)
Recheck Cond: (to_tsvector('simple'::regconfig, title) @@ '''database'''::tsquery)  ->  Bitmap Index Scan on ii
(cost=0.00..52.32rows=3776 width=0) (actual time=17.908..17.908 rows=3885 loops=1)        Index Cond:
(to_tsvector('simple'::regconfig,title) @@ '''database'''::tsquery)Total runtime: 73.254 ms
 
(5 rows)

pub=# explain analyze select * from publications_test where to_tsvector('simple', title) @@ to_tsquery('simple',
'database:*');                                                          QUERY PLAN
                     
 

----------------------------------------------------------------------------------------------------------------------------------Bitmap
HeapScan on publications_test  (cost=41.19..3021.55 rows=1185 width=177) (actual time=49.031..101.935 rows=6448
loops=1) Recheck Cond: (to_tsvector('simple'::regconfig, title) @@ '''database'':*'::tsquery)  ->  Bitmap Index Scan on
ii (cost=0.00..40.89 rows=1185 width=0) (actual time=43.193..43.193 rows=6448 loops=1)        Index Cond:
(to_tsvector('simple'::regconfig,title) @@ '''database'':*'::tsquery)Total runtime: 127.576 ms
 
(5 rows)

Note the smaller estimated rowcount for the second example.  This is
patently ridiculous of course, since "database" must be included in
what matches "database:*".

On investigation it appears that I made multiple logical errors in
commit 97532f7c29468010b87e40a04f8daa3eb097f654, which I have to admit
I didn't think about very hard because it seemed simple.  What the code
currently does for a prefix pattern is to add up the frequencies of MCVs
that match the prefix pattern ("database" is an MCV in this example),
as well as the total frequency of all MCVs, and then compute
selec = matched / allmcvs;

That looks correct if you don't stop to think, but it isn't.
It is equivalent to
selec = matched + (1 - allmcvs) * (matched / allmcvs);

that is, we're taking the "matched" frequency as-is, and then assuming
that matched / allmcvs is an appropriate selectivity estimate for the
non-MCV population.  But doing that overweights the more common MCVs.
What we should be doing, and what the comparable and better-tested code
in histogram_selectivity() actually does do, is weight each MCV equally;
there's no reason to assume that more-common MCVs are more
representative of the rest of the lexeme population than less-common
MCVs.  So the calculation should be more like
selec = matched + (1 - allmcvs) * (n_matched / n_mcvs);

The other problem with this math is that simply adding up the
frequencies is the wrong thing, because these aren't most common
*values*, they are most common *elements*.  Any given table row can
contain several different MCEs, so we shouldn't just add the
probabilities as if they were mutually-exclusive events.  We need
to treat them as independent events, so that the summing looks more
like
matched += t->frequency - matched * t->frequency;

(The reason my example comes out so obviously wrong is that allmcvs
actually sums to more than 1 without this correction, so that the
estimate becomes something less than the value of "matched".)

Lastly, the code ends by clamping its estimate to be at least
DEFAULT_TS_MATCH_SEL:
           /*            * In any case, never believe that a prefix match has selectivity            * less than
DEFAULT_TS_MATCH_SEL.           */           selec = Max(DEFAULT_TS_MATCH_SEL, selec);
 

On reflection, that seems like a horrid idea.  We should probably not
believe the selectivity is exactly zero if the pattern chances to match
none of the MCEs, but setting it equal to the default-for-no-statistics
is way too high.  I'm tempted to use DEFAULT_TS_MATCH_SEL/100 here, but
I wonder if anyone has an idea for a more principled minimum estimate.
In particular, does it make sense to consider the number of MCEs we
have, and/or the length of the pattern?  Having more MCEs seems to make
it less likely that we are underestimating the match frequency, and
longer patterns should match less stuff, too.  But I'm not sure what
to do with those intuitions.

Comments?
        regards, tom lane