Thread: bitmap-index-scan faster than seq-scan on full-table-scan (gin index)

bitmap-index-scan faster than seq-scan on full-table-scan (gin index)

From

Jesper Krogh

Date:

31 May 2010, 16:53:10

Hi.

The test data a set of generated terms using this perl-script
http://shrek.krogh.cc/~jesper/build-test.pl
and http://shrek.krogh.cc/~jesper/words.txt

I have generated a test dataset with an average tsvector length of
around 250 and 200.000 tuples in the dataset.

Conceptually searching for the "full dataset" would always be fastest
solved by a seq-scan. The query planner enforces this so much, so not
even "enable_seqscan=off" can convince it to to something else. So in
the next two explain analyze I compare a query searching 99% of the
table up with a seqscan. The 98% case is enforced to be a 
"bitmap-index-scan"
I would expect the runtime of the seqscan to be shortest and the 
bitmap-index-scan
to be quite a lot larger, due to "random access" and the fact that the 
index-data
also needs to be read in from disk.

Bot runs are run with a freshly started postgresql backend and
"echo 3 > /proc/sys/vm/drop_caches" so the os caching should not come 
into play.

ftstest=# EXPLAIN ANALYZE select id from ftstest where body_fts @@ 
to_tsquery('commonterm98');                                                                 QUERY 
PLAN

--------------------------------------------------------------------------------------------------------------------------------------------
BitmapHeap Scan on ftstest  (cost=6579.81..992733.57 rows=195976 
 
width=4) (actual time=4813.258..7081.277 rows=195976 loops=1)   Recheck Cond: (body_fts @@
to_tsquery('commonterm98'::text))  ->  Bitmap Index Scan on ftstest_gist_idx  (cost=0.00..6530.82 
 
rows=195976 width=0) (actual time=4787.513..4787.513 rows=195976 loops=1)         Index Cond: (body_fts @@
to_tsquery('commonterm98'::text))Total runtime: 7389.346 ms
 
(5 rows)

ftstest=# set enable_bitmapscan = off;
WARNING:  terminating connection because of crash of another server process
DETAIL:  The postmaster has commanded this server process to roll back 
the current transaction and exit, because another server process exited 
abnormally and possibly corrupted shared memory.
HINT:  In a moment you should be able to reconnect to the database and 
repeat your command.
server closed the connection unexpectedly    This probably means the server terminated abnormally    before or while
processingthe request.
 
The connection to the server was lost. Attempting reset: Succeeded.
ftstest=# set enable_bitmapscan = off;
SET
ftstest=# EXPLAIN ANALYZE select id from ftstest where body_fts @@ 
to_tsquery('commonterm98');                                                      QUERY PLAN
----------------------------------------------------------------------------------------------------------------------
SeqScan on ftstest  (cost=0.00..1006314.00 rows=195976 width=4) 
 
(actual time=96.077..60092.080 rows=195976 loops=1)   Filter: (body_fts @@ to_tsquery('commonterm98'::text)) Total
runtime:60436.556 ms
 
(3 rows)

So searching the full table via a bitmap-index-scan is actually 9 times
cheaper than a seq-scan.  (same on 9.0b1 and 8.4).

Digging more into it reveals that the body_fts tsvector is indeed needed for
the "filter" in the SeqScan. The tsvector data is stored in a TOAST 
table and
the in the bitmap-index-scan case it only needs to read in the main table
for checking visibillity. In the end it translates to reading in 1.4GB of
TOAST-data vs. reading in 34MB of table data.

Thinking a bit, then I dont think this is a particular rare case, 
allthough the
ratio between the tables may be a real cornercase. The ratio is not 1:33 
in the
dataset that looks like the production dataset, but more 1:10, but in 
all cases
in "production" there would be a much higher cache-hit ratio on the 
gin-index
and the main table pages than on the TOAST table, so even with a ratio 
of 1:1
there most likely would be a real-world benefit.

Would it be possible to implement the "Filtering" using the gin-index and
a subsequent visibillity-check as on the index-scan?

The same end up being the case for queries ordered by btree indexes and
"filtered" by gin-indexes.

Jesper
-- 
Jesper

Re: bitmap-index-scan faster than seq-scan on full-table-scan (gin index)

From

Tom Lane

Date:

31 May 2010, 17:09:39

Jesper Krogh <jesper@krogh.cc> writes:
> Conceptually searching for the "full dataset" would always be fastest
> solved by a seq-scan. The query planner enforces this so much, so not
> even "enable_seqscan=off" can convince it to to something else.
> ...
> Would it be possible to implement the "Filtering" using the gin-index and
> a subsequent visibillity-check as on the index-scan?

You're failing to make any sense whatsoever.  If you're reading the full
dataset, there is no filter condition.  If there is a potentially
indexable filter condition, the planner will certainly consider that.

Personally I think the issue here has got more to do with the
non-immutability of the single-argument form of to_tsquery, which means
it gets re-evaluated at every row during a seqscan.  Do your results
change if you work with to_tsquery('english', ...)  (or whatever your
default TS config is)?
        regards, tom lane

Re: bitmap-index-scan faster than seq-scan on full-table-scan (gin index)

From

Jesper Krogh

Date:

31 May 2010, 18:09:46

On 2010-05-31 22:09, Tom Lane wrote:
> Jesper Krogh<jesper@krogh.cc>  writes:
>    
>> Conceptually searching for the "full dataset" would always be fastest
>> solved by a seq-scan. The query planner enforces this so much, so not
>> even "enable_seqscan=off" can convince it to to something else.
>> ...
>> Would it be possible to implement the "Filtering" using the gin-index and
>> a subsequent visibillity-check as on the index-scan?
>>      
> You're failing to make any sense whatsoever.  If you're reading the full
> dataset, there is no filter condition.  If there is a potentially
> indexable filter condition, the planner will certainly consider that.
>    
Yes, you're totally right on that (about making sense).

But that is because of the "simplified" use-case described. A more elaborate
description ..
I have a table with has a set of colums attached to it typically
used for "sorting" these columns may also be inferred on the table
by a typical join condition and a document that is fts-indexed.
So the actual use-case is that people query for:

"give me the 1000 most recent documents matching <term>"

Term may in some cases be hugely trivial, only filtering away 0.001% of the
dataset resulting in a "index-scan on a btree date index" filtering on the
tsvector column for <term>".

Term may also be something really specific only returning a single
or a few documents and just pushing a post-sorting to get the ordering.

But in the case where the query-planner falls over to a "index-scan"
on one of the btree-indices it ends up reading over from the TOAST data.

Will the planner consider doing the index-scan(forward or backwards)
on a btree-index and filter using a gin-index instead of filtering directly
on the tuple-data?
(I haven't been able to enforce an query-plan that looks like that).

> Personally I think the issue here has got more to do with the
> non-immutability of the single-argument form of to_tsquery, which means
> it gets re-evaluated at every row during a seqscan.  Do your results
> change if you work with to_tsquery('english', ...)  (or whatever your
> default TS config is)?
>    

It is english..  and yes it did indeed change the results. So the 
expensive case
dropped from ~60s to ~28s and the cheap case from ~7.3s to ~4.3s, that
is quite surprising that such "small" change can have that huge impact. The
single-argument version should be forbidden.

But the performance ratio between the two cases is still the same.

The test was actually run with the preliminary gincostestimate-patch from
Oleg Bartunov so the actual cost estimates match way better now, but that
should not impact the actual runtime.

Thanks

-- 
Jesper