Thread: bitmap scans, btree scans, and tid order

bitmap scans, btree scans, and tid order

From
"Jeffrey W. Baker"
Date:
About this time last year I was holding forth on the value of visiting
the heap in TID order, even when index scan tuples are randomly ordered.
Today I decided to start working on the problem stated in this TODO
item:
       Fetch heap pages matching index entries in sequential order               Rather than randomly accessing heap
pagesbased on index       entries, mark heap pages needing access in a bitmap and do the       lookups in sequential
order.Another method would be to sort       heap ctids matching the index before accessing the heap rows.       
 
I see that Tom has already done the infrastructure work by adding
getmulti, but getmulti isn't used by nodeIndexscan.c, only
nodeBitmapIndexscan.c.  Will btree index scans be executed by creating
in-memory bitmaps in 8.1, or will some scans still be executed the usual
way?  If the former, I'd be wasting time, but in the latter case it
would be worth it.

-jwb


Re: bitmap scans, btree scans, and tid order

From
Tom Lane
Date:
"Jeffrey W. Baker" <jwbaker@acm.org> writes:
> I see that Tom has already done the infrastructure work by adding
> getmulti, but getmulti isn't used by nodeIndexscan.c, only
> nodeBitmapIndexscan.c.  Will btree index scans be executed by creating
> in-memory bitmaps in 8.1, or will some scans still be executed the usual
> way?

We aren't going to remove the existing indexscan behavior, because
bitmap scans lose the ordering of the underlying index.  There are many
situations where that ordering is important.  (See for instance the
recent changes to make MAX/MIN use that behavior.)
        regards, tom lane


Re: bitmap scans, btree scans, and tid order

From
Jeffrey Baker
Date:
Tom Lane wrote:
> "Jeffrey W. Baker" <jwbaker@acm.org> writes:
> 
>>I see that Tom has already done the infrastructure work by adding
>>getmulti, but getmulti isn't used by nodeIndexscan.c, only
>>nodeBitmapIndexscan.c.  Will btree index scans be executed by creating
>>in-memory bitmaps in 8.1, or will some scans still be executed the usual
>>way?
> 
> 
> We aren't going to remove the existing indexscan behavior, because
> bitmap scans lose the ordering of the underlying index.  There are many
> situations where that ordering is important.  (See for instance the
> recent changes to make MAX/MIN use that behavior.)

Would you take a patch that retained the optimized executions of plans 
returning 1 tuple and also fixed the random heap problem?

-jwb


Re: bitmap scans, btree scans, and tid order

From
Neil Conway
Date:
Jeffrey Baker wrote:
> Would you take a patch that retained the optimized executions of plans 
> returning 1 tuple and also fixed the random heap problem?

Can you elaborate on what you're proposing? Obviously sorted b+-tree 
output is important for a lot more than just min()/max(). I don't see an 
obvious way to produce sorted output from a bitmap tree index scan 
without requiring an additional sort step (which would be rather 
pointless -- the whole point of the optimization is to avoid an 
additional sort).

-Neil


Re: bitmap scans, btree scans, and tid order

From
Jeffrey Baker
Date:
Neil Conway wrote:
> Jeffrey Baker wrote:
> 
>> Would you take a patch that retained the optimized executions of plans 
>> returning 1 tuple and also fixed the random heap problem?
> 
> 
> Can you elaborate on what you're proposing? Obviously sorted b+-tree 
> output is important for a lot more than just min()/max(). I don't see an 
> obvious way to produce sorted output from a bitmap tree index scan 
> without requiring an additional sort step (which would be rather 
> pointless -- the whole point of the optimization is to avoid an 
> additional sort).

I understand the importance of returning tuples in index order for many 
plans (although I probably haven't thought of all the cases.  min/max is 
the most obvious; order by + limit is another).  The only problem I'm 
trying to solve is when an indexscan returns a large result, causing the 
heap to be visited in index order, which is to say random order, from 
the disk's perspective.  When I investigated this last year, sorting the 
intermediate result of the index scan in disk order was good for a 
reduction by two-thirds in actual execution time, and sorting the scan 
result in chunks of 1000 tuples was enough to reduce the time by half.

I'm considering one of the following courses of action:

Change nodeIndexscan.c to call index_getmulti, and to handle multiple 
tuples returned.  That code would sort the tuple array and store the 
tuples in the result in disk order.

-or-

Change the planner/executor to use the bitmap scan in all cases where 
index order is unimportant.  From my reading of the current code, the 
bitmap scan is only used in case of a join.

-jwb


Re: bitmap scans, btree scans, and tid order

From
"Jeffrey W. Baker"
Date:
On Mon, 2005-05-16 at 09:53 -0400, Tom Lane wrote:
> Jeffrey Baker <jwbaker@acm.org> writes:
> > Change the planner/executor to use the bitmap scan in all cases where 
> > index order is unimportant.  From my reading of the current code, the 
> > bitmap scan is only used in case of a join.
> 
> This is a fallacy, and I think your concern is largely mistaken.  Have
> you experimented with the cases you are worried about?

Perhaps I have not stated the problem clearly.  Believe me, I have
experimented extensively with this problem.  So here's the statement of
the problem: during an index scan, the heap is visited in index order,
which can be and frequently is random order on the disk.  An index scan
that currently takes 15 minutes on my database takes only 5 when the
tuples are fetched in strict disk order, and takes 8 minutes if the
tuples are fetched in disk order in groups of 1000.  The problem exists
when the scan is expected to return a lot of tuples, but the planner
chooses an index scan anyway.  In many cases, sequential scan would be
faster than index scan + random heap visits.

> It's entirely possible that the current cost model needs adjustment to
> make the planner pick the bitmap scan in more cases.  However, it is
> easy to demonstrate (either by thought-experiment or actual trial) that
> a bitmap scan isn't necessarily a win.

I agree.  There's certainly a threshold below which the bitmap would not
make sense.  It's also possible that changing the btree scan to work in
groups of tuples instead of single tuples would make more sense, which
is why I ventured two different solution to the problem.

>  The overhead of maintaining the
> bitmap is far from zero, and you don't get anything unless the resulting
> pattern of heap page fetches is significantly cleaner than before.

Bitmaps scans naturally come back in TID order.  I realize this isn't
1:1 correspondence with disk order, but it's a good start.  I also like
the way the index scan and heap scan are decoupled in the bitmap code.
It leaves room for more serious optimization of disk access, which is
relatively difficult in the synchronous, one-at-a-time btree code.

> Therefore, a patch that eliminates cost-estimation in favor of trying to
> force one choice or the other is definitely not likely to be received
> favorably ...

That's not the idea.

-jwb




Re: bitmap scans, btree scans, and tid order

From
Hannu Krosing
Date:
On P, 2005-05-15 at 23:58 -0700, Jeffrey Baker wrote:

> I'm considering one of the following courses of action:
> 
> Change nodeIndexscan.c to call index_getmulti, and to handle multiple 
> tuples returned.  That code would sort the tuple array and store the 
> tuples in the result in disk order.
> 
> -or-
> 
> Change the planner/executor to use the bitmap scan in all cases where 
> index order is unimportant.  From my reading of the current code, the 
> bitmap scan is only used in case of a join.

I think the 2nd option is probably the way to go.  Probably not _all_
cases - it's probably cheper to not build a bitmap for cases where the
index returns only a few (or just one) rows.

OTOH, some scheme, where you fill sort_mem-sized memory structure with
tuples, which are fetched in heap order but stored in that structure in
index order could also be an interesting optimisastion for sort-order
preserving btree-index scans. this could also be used for bigger sort as
first step by saving each chunk to disk and then merging them back into
result.

in pseudocode one iteretion of this could go like this

TIDARRAY = ARRAY[1000]
I = 0
FOR TID in FETC_FROM_BTREE(0, 1000):   ARRAY [I] := (TID, I)
SORT_ARRAY_ON_TID()        # this must be much faster than sorting                          # whole tuples for this
methodto be a win                          # over bitmap_scan + sort.                          # using some
self-orderingstructure like                          # btree/rbtree for TIDARRAY is also an option
 
OUTARRAY = ARRAY[1000]
FOR (TID, I) IN TIDARRAY:   OUTARRAY[I] = FETCH_FROM_HEAP(TID)
RETURN OUTARRAY

This can probably not use bitmap scan code, but has the nice property of
doing the disk accesses in file order but still returning the result in
index order.

-- 
Hannu Krosing <hannu@skype.net>



Re: bitmap scans, btree scans, and tid order

From
Tom Lane
Date:
"Jeffrey W. Baker" <jwbaker@acm.org> writes:
> On Mon, 2005-05-16 at 09:53 -0400, Tom Lane wrote:
>> This is a fallacy, and I think your concern is largely mistaken.  Have
>> you experimented with the cases you are worried about?

> Perhaps I have not stated the problem clearly.  Believe me, I have
> experimented extensively with this problem.

Sorry, perhaps I wasn't clear: have you experimented *with CVS tip*?
The current code is certainly capable of choosing either seqscan,
bitmap scan, or traditional index scan depending on the given query.
For example,

regression=# explain analyze select * from tenk1 where unique1 between 100 and 1000;
                  QUERY PLAN
 

--------------------------------------------------------------------------------------------------------------------------Bitmap
HeapScan on tenk1  (cost=9.58..381.53 rows=930 width=244) (actual time=6.185..18.034 rows=901 loops=1)  Recheck Cond:
((unique1>= 100) AND (unique1 <= 1000))  ->  Bitmap Index Scan on tenk1_unique1  (cost=0.00..9.58 rows=930 width=0)
(actualtime=4.522..4.522 rows=901 loops=1)        Index Cond: ((unique1 >= 100) AND (unique1 <= 1000))Total runtime:
23.784ms
 
(5 rows)

regression=# explain analyze select * from tenk1 where unique2 between 100 and 1000;
                   QUERY PLAN
 

-----------------------------------------------------------------------------------------------------------------------------Index
Scanusing tenk1_unique2 on tenk1  (cost=0.00..45.88 rows=805 width=244) (actual time=0.154..14.856 rows=901 loops=1)
IndexCond: ((unique2 >= 100) AND (unique2 <= 1000))Total runtime: 20.331 ms
 
(3 rows)

(The reason these apparently-identical queries get different plans is
that the unique2 column is physically ordered, so the plain indexscan
gets a very high correlation discount.)  There are enable_foo switches
to let you force selection of any one of the three ways for testing
purposes.

> It's also possible that changing the btree scan to work in
> groups of tuples instead of single tuples would make more sense, which
> is why I ventured two different solution to the problem.

My feeling is that that would add a lot of complexity for dubious win.
The reason it's dubious is that it would penalize some cases, for
instance LIMIT-type queries where you aren't going to fetch many tuples
anyway.  I think that at least for the time being (8.1 time frame) we
should leave traditional indexscans alone and concentrate on being sure
we are getting the most we can out of the new bitmap scan code.  Only
after that dust has settled will we have hard facts with which to argue
about whether compromise behaviors might be wins.

I think the work that's most needed at the moment is to test the
bitmap-scan cost model to see if it has much to do with reality...
        regards, tom lane


Re: bitmap scans, btree scans, and tid order

From
"Jeffrey W. Baker"
Date:
On Mon, 2005-05-16 at 14:35 -0400, Tom Lane wrote:
> "Jeffrey W. Baker" <jwbaker@acm.org> writes:
> > It's also possible that changing the btree scan to work in
> > groups of tuples instead of single tuples would make more sense, which
> > is why I ventured two different solution to the problem.
> 
> My feeling is that that would add a lot of complexity for dubious win.
> The reason it's dubious is that it would penalize some cases, for
> instance LIMIT-type queries where you aren't going to fetch many tuples
> anyway.  I think that at least for the time being (8.1 time frame) we
> should leave traditional indexscans alone and concentrate on being sure
> we are getting the most we can out of the new bitmap scan code.  Only
> after that dust has settled will we have hard facts with which to argue
> about whether compromise behaviors might be wins.

I agree.  I'll look at how my workload behaves with CVS code.  I wasn't
proposing this for 8.1 inclusion, and the TODO isn't marked for 8.1.

> I think the work that's most needed at the moment is to test the
> bitmap-scan cost model to see if it has much to do with reality...

Alright.

-jwb


Re: bitmap scans, btree scans, and tid order

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> "Jeffrey W. Baker" <jwbaker@acm.org> writes:
> > I see that Tom has already done the infrastructure work by adding
> > getmulti, but getmulti isn't used by nodeIndexscan.c, only
> > nodeBitmapIndexscan.c.  Will btree index scans be executed by creating
> > in-memory bitmaps in 8.1, or will some scans still be executed the usual
> > way?
> 
> We aren't going to remove the existing indexscan behavior, because
> bitmap scans lose the ordering of the underlying index.  There are many
> situations where that ordering is important.  (See for instance the
> recent changes to make MAX/MIN use that behavior.)

Hm. There are other circumstances where the ordering doesn't matter. When
there's another unrelated ORDER BY clause or merge join wrapped around the
index scan for example.

This suggests one new 8.1 optimization strategy may be to add strategic no-op
OR clauses to cause 8.1 to use a bitmapOr node.

For example something like this where "flag" isn't very selective (say 25%)
might run more slowly than a sequential scan because of the random access
pattern.

SELECT id,name,flag FROM tabWHERE flagORDER BY name

But adding a no-op bitmapOr node like:

SELECT id,name,flag FROM tabWHERE flag   OR indexed_never_true_flagORDER BY name

Might run faster, perhaps even more quickly than the sequential scan because
the bitmap avoids the random access pattern but doesn't have to read the whole
table.

-- 
greg



Re: bitmap scans, btree scans, and tid order

From
Mike Rylander
Date:
On 5/16/05, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> regression=# explain analyze select * from tenk1 where unique1 between 100 and 1000;
>                                                         QUERY PLAN
>
--------------------------------------------------------------------------------------------------------------------------
>  Bitmap Heap Scan on tenk1  (cost=9.58..381.53 rows=930 width=244) (actual time=6.185..18.034 rows=901 loops=1)
>    Recheck Cond: ((unique1 >= 100) AND (unique1 <= 1000))
>    ->  Bitmap Index Scan on tenk1_unique1  (cost=0.00..9.58 rows=930 width=0) (actual time=4.522..4.522 rows=901
loops=1)
>          Index Cond: ((unique1 >= 100) AND (unique1 <= 1000))
>  Total runtime: 23.784 ms
> (5 rows)
>
> regression=# explain analyze select * from tenk1 where unique2 between 100 and 1000;
>                                                          QUERY PLAN
>
-----------------------------------------------------------------------------------------------------------------------------
>  Index Scan using tenk1_unique2 on tenk1  (cost=0.00..45.88 rows=805 width=244) (actual time=0.154..14.856 rows=901
loops=1)
>    Index Cond: ((unique2 >= 100) AND (unique2 <= 1000))
>  Total runtime: 20.331 ms
> (3 rows)
>

Tom (or anyone with some round tuits and CVS-tip savy), if you have a
chance at some point would you post the non-bitmap version of the
query for tenk2 from above?  I'd be very interested to see if the
bitmap forced TID order fetch actually does help, or if it's
outweighed by the bitmap setup overhead.

TIA

--
Mike Rylander
mrylander@gmail.com
GPLS -- PINES Development
Database Developer
http://open-ils.org


Re: bitmap scans, btree scans, and tid order

From
Tom Lane
Date:
Jeffrey Baker <jwbaker@acm.org> writes:
> Change the planner/executor to use the bitmap scan in all cases where 
> index order is unimportant.  From my reading of the current code, the 
> bitmap scan is only used in case of a join.

This is a fallacy, and I think your concern is largely mistaken.  Have
you experimented with the cases you are worried about?

It's entirely possible that the current cost model needs adjustment to
make the planner pick the bitmap scan in more cases.  However, it is
easy to demonstrate (either by thought-experiment or actual trial) that
a bitmap scan isn't necessarily a win.  The overhead of maintaining the
bitmap is far from zero, and you don't get anything unless the resulting
pattern of heap page fetches is significantly cleaner than before.
Therefore, a patch that eliminates cost-estimation in favor of trying to
force one choice or the other is definitely not likely to be received
favorably ...
        regards, tom lane


Bitmap scan cost model (was Re: bitmap scans, btree scans, and tid order)

From
"Jeffrey W. Baker"
Date:
On Mon, 2005-05-16 at 14:35 -0400, Tom Lane wrote:
> "Jeffrey W. Baker" <jwbaker@acm.org> writes:
> > On Mon, 2005-05-16 at 09:53 -0400, Tom Lane wrote:
> >> This is a fallacy, and I think your concern is largely mistaken.  Have
> >> you experimented with the cases you are worried about?
> 
> > Perhaps I have not stated the problem clearly.  Believe me, I have
> > experimented extensively with this problem.
> 
> Sorry, perhaps I wasn't clear: have you experimented *with CVS tip*?
> The current code is certainly capable of choosing either seqscan,
> bitmap scan, or traditional index scan depending on the given query.
> For example,

[...]

> I think the work that's most needed at the moment is to test the
> bitmap-scan cost model to see if it has much to do with reality...

The cost model seems to work pretty darn well, as a matter of fact.  

This table is in-order by id, in-order by date, and randomly ordered by
"random".

scratch=# \d test    Table "public.test"Column |  Type   | Modifiers 
--------+---------+-----------id     | integer | date   | date    | random | integer | 
Indexes:   "test_pk" UNIQUE, btree (id)   "test_date_idx" btree (date)   "test_rand_idx" btree (random)

scratch=# select count(1) from test; count   
----------10000000

The size of the tables and indexes is about 1G, or double physical
memory.  I tested by selecting on the random column.  For 100 random
values, I selected the row that matches exactly, then in ranges of 1000,
10000, 100000, and 1000000.  These touch roughly 5, 50, 500, and 5000
tuples, respectively.  The test is repeated for index scan, bitmap scan,
and sequential scan.

Example query:
select count(1)          from test          where random >= 1429076987           and random < 1429076987 + 10000;
                                                         QUERY PLAN
    
 

------------------------------------------------------------------------------------------------------------------------------Aggregate
(cost=171.31..171.31 rows=1 width=0) (actual time=0.996..1.000 rows=1 loops=1)  ->  Bitmap Heap Scan on test
(cost=2.26..171.20rows=43 width=0) (actual time=0.145..0.829 rows=42 loops=1)        Recheck Cond: ((random >=
1429076987)AND (random < 1429086987))        ->  Bitmap Index Scan on test_rand_idx  (cost=0.00..2.26 rows=43 width=0)
(actualtime=0.063..0.063 rows=42 loops=1)              Index Cond: ((random >= 1429076987) AND (random <
1429086987))Totalruntime: 1.114 ms
 

100 queries returning |  1  |  5  |  50  |  500  |  5000  |
----------------------+-----+-----+------+-------+--------+
bitmap scan           |  1s |  2s |  13s | 1m41s | 11m16s |
index scan            |  1s |  1s |  12s | 2m6s  | 24m19s |
----------------------+-----+-----+------+-------+--------+
sequential scan       |             28m20s                | 

The planner uses index scan for the first two columns, and chooses
bitmap scan for the rest.  This would seem to be a reasonable decision,
given the results.  The only real problem with the cost model in further
testing is that it doesn't switch to seq scan quickly enough.  If bitmap
scan is disabled, the planner will pick index scan even in cases when
sequential scan is 10x faster:

scratch=# set enable_bitmapscan to off;
SET
scratch=# explain analyze select count(1) from test where random >= 1429076987  and  random < 1429076987 + 10000000;
                                                           QUERY PLAN
            
 

--------------------------------------------------------------------------------------------------------------------------------------------Aggregate
(cost=170142.03..170142.03 rows=1 width=0) (actual time=177419.182..177419.185 rows=1 loops=1)  ->  Index Scan using
test_rand_idxon test  (cost=0.00..170034.11 rows=43167 width=0) (actual time=0.035..177255.696 rows=46764 loops=1)
 Index Cond: ((random >= 1429076987) AND (random < 1439076987))Total runtime: 177419.302 ms
 
(4 rows)

scratch=# set enable_indexscan to off;
SET
scratch=# explain analyze select count(1) from test where random >= 1429076987  and  random < 1429076987 + 10000000;
                                                QUERY PLAN                                                      
 

----------------------------------------------------------------------------------------------------------------------Aggregate
(cost=204165.55..204165.55 rows=1 width=0) (actual time=12334.042..12334.045 rows=1 loops=1)  ->  Seq Scan on test
(cost=0.00..204057.62rows=43167 width=0) (actual time=17.436..12174.150 rows=46764 loops=1)        Filter: ((random >=
1429076987)AND (random < 1439076987))Total runtime: 12334.156 ms
 
(4 rows)

Obviously in this case sequential scan was (would have been) a huge win.
Incrementing random_page_cost from 4 (the default) to 5 causes the
planner to make a better decision.

-jwb




On Tue, 17 May 2005 22:12:17 -0700, "Jeffrey W. Baker" <jwbaker@acm.org>
wrote:
>Incrementing random_page_cost from 4 (the default) to 5 causes the
>planner to make a better decision.

We have such a low default random_page_cost primarily to mask other
problems in the optimizer, two of which are

. multi-column index correlation

. interpolation between min_IO_Cost and max_IO_cost which approximates
max_IO_cost too fast.

ServusManfred


"Jeffrey W. Baker" <jwbaker@acm.org> writes:
> ... If bitmap
> scan is disabled, the planner will pick index scan even in cases when
> sequential scan is 10x faster:

> scratch=# set enable_bitmapscan to off;
> SET
> scratch=# explain analyze select count(1) from test where random >= 1429076987  and  random < 1429076987 + 10000000;
>                                                                  QUERY PLAN
                    
 
>
--------------------------------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=170142.03..170142.03 rows=1 width=0) (actual time=177419.182..177419.185 rows=1 loops=1)
>    ->  Index Scan using test_rand_idx on test  (cost=0.00..170034.11 rows=43167 width=0) (actual
time=0.035..177255.696rows=46764 loops=1)
 
>          Index Cond: ((random >= 1429076987) AND (random < 1439076987))
>  Total runtime: 177419.302 ms
> (4 rows)

> scratch=# set enable_indexscan to off;
> SET
> scratch=# explain analyze select count(1) from test where random >= 1429076987  and  random < 1429076987 + 10000000;
>                                                       QUERY PLAN

>
----------------------------------------------------------------------------------------------------------------------
>  Aggregate  (cost=204165.55..204165.55 rows=1 width=0) (actual time=12334.042..12334.045 rows=1 loops=1)
>    ->  Seq Scan on test  (cost=0.00..204057.62 rows=43167 width=0) (actual time=17.436..12174.150 rows=46764
loops=1)
>          Filter: ((random >= 1429076987) AND (random < 1439076987))
>  Total runtime: 12334.156 ms
> (4 rows)

> Obviously in this case sequential scan was (would have been) a huge win.
> Incrementing random_page_cost from 4 (the default) to 5 causes the
> planner to make a better decision.

But to get the estimated cost ratio to match up with the actual cost
ratio, we'd have to raise random_page_cost to nearly 70, which is a bit
hard to credit.  What was the platform being tested here?
        regards, tom lane


Re: Bitmap scan cost model (was Re: bitmap scans, btree

From
"Jeffrey W. Baker"
Date:
On Wed, 2005-05-18 at 11:27 -0400, Tom Lane wrote:
> "Jeffrey W. Baker" <jwbaker@acm.org> writes:
> > Obviously in this case sequential scan was (would have been) a huge win.
> > Incrementing random_page_cost from 4 (the default) to 5 causes the
> > planner to make a better decision.
> 
> But to get the estimated cost ratio to match up with the actual cost
> ratio, we'd have to raise random_page_cost to nearly 70, which is a bit
> hard to credit.  What was the platform being tested here?

i686 Linux 2.6.8 with a single 7200RPM SATA disk.

-jwb