Math and logic mistakes in tsquery_opr_selec - Mailing list pgsql-hackers
From | Tom Lane |
---|---|
Subject | Math and logic mistakes in tsquery_opr_selec |
Date | |
Msg-id | 24846.1347382157@sss.pgh.pa.us Whole thread Raw |
List | pgsql-hackers |
While reflecting on http://archives.postgresql.org/pgsql-performance/2012-09/msg00030.php I discovered that tsquery selectivity is capable of concluding that "word:*" matches less stuff than "word": pub=# explain analyze select * from publications_test where to_tsvector('simple', title) @@ to_tsquery('simple', 'database'); QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------Bitmap HeapScan on publications_test (cost=53.27..5717.00 rows=3776 width=177) (actual time=22.359..57.078 rows=3885 loops=1) Recheck Cond: (to_tsvector('simple'::regconfig, title) @@ '''database'''::tsquery) -> Bitmap Index Scan on ii (cost=0.00..52.32rows=3776 width=0) (actual time=17.908..17.908 rows=3885 loops=1) Index Cond: (to_tsvector('simple'::regconfig,title) @@ '''database'''::tsquery)Total runtime: 73.254 ms (5 rows) pub=# explain analyze select * from publications_test where to_tsvector('simple', title) @@ to_tsquery('simple', 'database:*'); QUERY PLAN ----------------------------------------------------------------------------------------------------------------------------------Bitmap HeapScan on publications_test (cost=41.19..3021.55 rows=1185 width=177) (actual time=49.031..101.935 rows=6448 loops=1) Recheck Cond: (to_tsvector('simple'::regconfig, title) @@ '''database'':*'::tsquery) -> Bitmap Index Scan on ii (cost=0.00..40.89 rows=1185 width=0) (actual time=43.193..43.193 rows=6448 loops=1) Index Cond: (to_tsvector('simple'::regconfig,title) @@ '''database'':*'::tsquery)Total runtime: 127.576 ms (5 rows) Note the smaller estimated rowcount for the second example. This is patently ridiculous of course, since "database" must be included in what matches "database:*". On investigation it appears that I made multiple logical errors in commit 97532f7c29468010b87e40a04f8daa3eb097f654, which I have to admit I didn't think about very hard because it seemed simple. What the code currently does for a prefix pattern is to add up the frequencies of MCVs that match the prefix pattern ("database" is an MCV in this example), as well as the total frequency of all MCVs, and then compute selec = matched / allmcvs; That looks correct if you don't stop to think, but it isn't. It is equivalent to selec = matched + (1 - allmcvs) * (matched / allmcvs); that is, we're taking the "matched" frequency as-is, and then assuming that matched / allmcvs is an appropriate selectivity estimate for the non-MCV population. But doing that overweights the more common MCVs. What we should be doing, and what the comparable and better-tested code in histogram_selectivity() actually does do, is weight each MCV equally; there's no reason to assume that more-common MCVs are more representative of the rest of the lexeme population than less-common MCVs. So the calculation should be more like selec = matched + (1 - allmcvs) * (n_matched / n_mcvs); The other problem with this math is that simply adding up the frequencies is the wrong thing, because these aren't most common *values*, they are most common *elements*. Any given table row can contain several different MCEs, so we shouldn't just add the probabilities as if they were mutually-exclusive events. We need to treat them as independent events, so that the summing looks more like matched += t->frequency - matched * t->frequency; (The reason my example comes out so obviously wrong is that allmcvs actually sums to more than 1 without this correction, so that the estimate becomes something less than the value of "matched".) Lastly, the code ends by clamping its estimate to be at least DEFAULT_TS_MATCH_SEL: /* * In any case, never believe that a prefix match has selectivity * less than DEFAULT_TS_MATCH_SEL. */ selec = Max(DEFAULT_TS_MATCH_SEL, selec); On reflection, that seems like a horrid idea. We should probably not believe the selectivity is exactly zero if the pattern chances to match none of the MCEs, but setting it equal to the default-for-no-statistics is way too high. I'm tempted to use DEFAULT_TS_MATCH_SEL/100 here, but I wonder if anyone has an idea for a more principled minimum estimate. In particular, does it make sense to consider the number of MCEs we have, and/or the length of the pattern? Having more MCEs seems to make it less likely that we are underestimating the match frequency, and longer patterns should match less stuff, too. But I'm not sure what to do with those intuitions. Comments? regards, tom lane
pgsql-hackers by date: