Thread: performance of bitmap scans in nested loop joins

performance of bitmap scans in nested loop joins

From
"Sergey E. Koposov"
Date:
Hello All,

I'd like to report about suprising (for me) results of performance testing of
bitmap  indexes in nested loop join plans. 

I have the table q3c

test=# \d q3c    Table "public.q3c"Column |  Type  | Modifiers 
--------+--------+-----------ipix   | bigint | errbox | box    | ra     | real   | dec    | real   | 
Indexes:   "ipix_idx" UNIQUE, btree (ipix)

####################

test=# SELECT count(*) from q3c; count  
---------3000000
(1 row)


First, two join plans with just simple index scan: 

############# 1) 

test=# EXPLAIN ANALYZE SELECT * FROM q3c,q3c AS q3cs WHERE 
q3c.ipix>=q3cs.ipix-3 AND q3c.ipix<=q3cs.ipix+3;                                                           QUERY PLAN
                                                         
 

-----------------------------------------------------------------------------------------------------------------------------------Nested
Loop (cost=0.01..49059519577.21 rows=1000010666695 width=96) (actual time=0.137..39385.725 rows=3000000 loops=1)  ->
SeqScan on q3c q3cs  (cost=0.00..60928.16 rows=3000016 width=48) (actual time=0.007..3659.171 rows=3000000 loops=1)  ->
Index Scan using ipix_idx on q3c  (cost=0.01..9686.37 rows=333335 width=48) (actual time=0.007..0.009 rows=1
loops=3000000)       Index Cond: ((q3c.ipix >= ("outer".ipix - 3)) AND (q3c.ipix <= ("outer".ipix + 3)))Total runtime:
40755.390ms
 
(5 rows)


############# 2)

test=# EXPLAIN ANALYZE SELECT * FROM q3c,q3c as q3cs WHERE 
(q3c.ipix>=q3cs.ipix-1000 AND q3c.ipix<=q3cs.ipix-993);                                                           QUERY
PLAN                                                            
 

-----------------------------------------------------------------------------------------------------------------------------------Nested
Loop (cost=0.01..49059519577.21 rows=1000010666695 width=96) (actual time=28058.983..28058.983 rows=0 loops=1)  ->  Seq
Scanon q3c q3cs  (cost=0.00..60928.16 rows=3000016 width=48) (actual time=0.061..3803.598 rows=3000000 loops=1)  ->
IndexScan using ipix_idx on q3c  (cost=0.01..9686.37 rows=333335 width=48) (actual time=0.006..0.006 rows=0
loops=3000000)       Index Cond: ((q3c.ipix >= ("outer".ipix - 1000)) AND (q3c.ipix <= ("outer".ipix - 993)))Total
runtime:28059.066 ms
 
(5 rows)

#############

And now I combine them in one:


test=# EXPLAIN ANALYZE SELECT * FROM q3c,q3c AS q3cs WHERE 
(q3c.ipix>=q3cs.ipix-3 AND q3c.ipix<=q3cs.ipix+3) OR 
(q3c.ipix>=q3cs.ipix-1000 AND q3c.ipix<=q3cs.ipix-993);
                QUERY PLAN                                                                                    
 

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Nested
Loop (cost=5832.03..190281569757.46 rows=1888909037091 width=96) (actual time=0.180..122444.610 rows=3000000 loops=1)
-> Seq Scan on q3c q3cs  (cost=0.00..60928.16 rows=3000016 width=48) (actual time=0.063..3871.731 rows=3000000 loops=1)
->  Bitmap Heap Scan on q3c  (cost=5832.03..43426.73 rows=666670 width=48) (actual time=0.033..0.034 rows=1
loops=3000000)       Recheck Cond: (((q3c.ipix >= ("outer".ipix - 3)) AND (q3c.ipix <= ("outer".ipix + 3))) OR
((q3c.ipix>= ("outer".ipix - 1000)) AND (q3c.ipix <= ("outer".ipix - 993))))        ->  BitmapOr
(cost=5832.03..5832.03rows=666670 width=0) (actual time=0.029..0.029 rows=0 loops=3000000)              ->  Bitmap
IndexScan on ipix_idx  (cost=0.00..2916.02 rows=333335 width=0) (actual time=0.014..0.014 rows=1 loops=3000000)
          Index Cond: ((q3c.ipix >= ("outer".ipix - 3)) AND (q3c.ipix <= ("outer".ipix + 3)))              ->  Bitmap
IndexScan on ipix_idx  (cost=0.00..2916.02 rows=333335 width=0) (actual time=0.011..0.011 rows=0 loops=3000000)
          Index Cond: ((q3c.ipix >= ("outer".ipix - 1000)) AND (q3c.ipix <= ("outer".ipix - 993)))Total runtime:
124366.545ms
 
(10 rows)


So, we see that total time of the plan with bitmap scan is roughly two times
greater than the sum of times of individual index scans.  


I see that in my case even simple union is significantly faster than bitmap
indexes. 

test=# EXPLAIN ANALYZE SELECT * FROM q3c,q3c AS q3cs WHERE 
(q3c.ipix>=q3cs.ipix-3 AND q3c.ipix<=q3cs.ipix+3) 
UNION ALL
SELECT * FROM q3c,q3c AS q3cs WHERE 
(q3c.ipix>=q3cs.ipix-1000 AND q3c.ipix<=q3cs.ipix-993);
QUERYPLAN                                                                   
 

-----------------------------------------------------------------------------------------------------------------------------------------------Append
(cost=0.01..118119252488.32 rows=2000021333390 width=96) (actual time=0.139..75048.897 rows=3000000 loops=1)  ->
SubqueryScan "*SELECT* 1"  (cost=0.01..59059626244.16 rows=1000010666695 width=96) (actual time=0.139..44221.409
rows=3000000loops=1)        ->  Nested Loop  (cost=0.01..49059519577.21 rows=1000010666695 width=96) (actual
time=0.137..40735.906rows=3000000 loops=1)              ->  Seq Scan on q3c q3cs  (cost=0.00..60928.16 rows=3000016
width=48)(actual time=0.063..3719.637 rows=3000000 loops=1)              ->  Index Scan using ipix_idx on q3c
(cost=0.01..9686.37rows=333335 width=48) (actual time=0.007..0.009 rows=1 loops=3000000)                    Index Cond:
((q3c.ipix>= ("outer".ipix - 3)) AND (q3c.ipix <= ("outer".ipix + 3)))  ->  Subquery Scan "*SELECT* 2"
(cost=0.01..59059626244.16rows=1000010666695 width=96) (actual time=28120.449..28120.449 rows=0 loops=1)        ->
NestedLoop  (cost=0.01..49059519577.21 rows=1000010666695 width=96) (actual time=28120.447..28120.447 rows=0 loops=1)
          ->  Seq Scan on q3c q3cs  (cost=0.00..60928.16 rows=3000016 width=48) (actual time=0.023..3649.739
rows=3000000loops=1)              ->  Index Scan using ipix_idx on q3c  (cost=0.01..9686.37 rows=333335 width=48)
(actualtime=0.006..0.006 rows=0 loops=3000000)                    Index Cond: ((q3c.ipix >= ("outer".ipix - 1000)) AND
(q3c.ipix<= ("outer".ipix - 993)))Total runtime: 76413.737 ms
 
(12 rows)


Last note: all those queries were run in fully cached regime on P4 2.8Ghz. 
I used the yesterday's CVS snapshot.

Are those performance results expected for the bitmap index scans ?  

Thanks in advance for any comments.


With Best Regards,    Sergey Koposov

------------------------------------------------------------
Sergey E. Koposov
Sternberg Astronomical Institute, Moscow University (Russia)
Max-Planck Institute for Astronomy (Germany) 
Internet: math@sai.msu.ru, http://lnfm1.sai.msu.su/~math/




Re: performance of bitmap scans in nested loop joins

From
Tom Lane
Date:
"Sergey E. Koposov" <math@sai.msu.ru> writes:
> I'd like to report about suprising (for me) results of performance testing of
> bitmap  indexes in nested loop join plans. 

I'm surprised too.  There's something weird in the indexscans
themselves:

>    ->  Index Scan using ipix_idx on q3c  (cost=0.01..9686.37 rows=333335 width=48) (actual time=0.006..0.006 rows=0
loops=3000000)
>          Index Cond: ((q3c.ipix >= ("outer".ipix - 1000)) AND (q3c.ipix <= ("outer".ipix - 993)))

>                ->  Bitmap Index Scan on ipix_idx  (cost=0.00..2916.02 rows=333335 width=0) (actual time=0.011..0.011
rows=0loops=3000000)
 
>                      Index Cond: ((q3c.ipix >= ("outer".ipix - 1000)) AND (q3c.ipix <= ("outer".ipix - 993)))

The latter is (or should be) doing slightly *less* work, so why is it
taking almost twice as much time?  Can you get gprof profiles of the
two cases?
        regards, tom lane


Re: performance of bitmap scans in nested loop joins

From
"Sergey E. Koposov"
Date:
On Fri, 29 Apr 2005, Tom Lane wrote:
> >    ->  Index Scan using ipix_idx on q3c  (cost=0.01..9686.37 rows=333335 width=48) (actual time=0.006..0.006 rows=0
loops=3000000)
> >          Index Cond: ((q3c.ipix >= ("outer".ipix - 1000)) AND (q3c.ipix <= ("outer".ipix - 993)))
> 
> >                ->  Bitmap Index Scan on ipix_idx  (cost=0.00..2916.02 rows=333335 width=0) (actual
time=0.011..0.011rows=0 loops=3000000)
 
> >                      Index Cond: ((q3c.ipix >= ("outer".ipix - 1000)) AND (q3c.ipix <= ("outer".ipix - 993)))
> 
> The latter is (or should be) doing slightly *less* work, so why is it
> taking almost twice as much time?  Can you get gprof profiles of the
> two cases?


I've got them. Here there are two gprof profiles:

http://lnfm1.sai.msu.ru/~math/public_misc/idxscan.gprof
http://lnfm1.sai.msu.ru/~math/public_misc/bitmap.gprof (now as links, because the previous letter with those files as
attachements haven't passed on -hackers (due to size, I think)) 
 


bitmap.gprof is  the profiling of the:

test=# explain analyze select * from q3c,q3c as q3cs where 
(q3c.ipix>=q3cs.ipix-3 AND q3c.ipix<=q3cs.ipix+3) OR 
(q3c.ipix>=q3cs.ipix-1000 AND q3c.ipix<=q3cs.ipix-993);
             
 
QUERY PLAN                                                                      
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
-----------------Nested Loop  (cost=5832.01..190280130928.00 rows=1888888888889 width=96)
(actual time=0.435..374743.591 rows=3000000 loops=1)  ->  Seq Scan on q3c q3cs  (cost=0.00..60928.00 rows=3000000
width=48)
(actual time=0.079..10632.570 rows=3000000 loops=1)  ->  Bitmap Heap Scan on q3c  (cost=5832.01..43426.68 rows=666667
width=48)
(actual time=0.102..0.104 rows=1 loops=3000000)        Recheck Cond: (((q3c.ipix >= ("outer".ipix - 3)) AND (q3c.ipix
<=
("outer".ipix + 3))) OR ((q3c.ipix >= ("outer".ipix - 1000)) AND (q3c.ipix
<=
("outer".ipix - 993))))        ->  BitmapOr  (cost=5832.01..5832.01 rows=666667 width=0) (actual
time=0.094..0.094 rows=0 loops=3000000)              ->  Bitmap Index Scan on ipix_idx  (cost=0.00..2916.01
rows=333333 width=0) (actual time=0.045..0.045 rows=1 loops=3000000)                    Index Cond: ((q3c.ipix >=
("outer".ipix- 3)) AND
 
(q3c.ipix <= ("outer".ipix + 3)))              ->  Bitmap Index Scan on ipix_idx  (cost=0.00..2916.01
rows=333333 width=0) (actual time=0.041..0.041 rows=0 loops=3000000)                    Index Cond: ((q3c.ipix >=
("outer".ipix- 1000)) AND
 
(q3c.ipix <= ("outer".ipix - 993)))Total runtime: 377551.805 ms
(10 rows)


And idxscan.gprof is the profiling of the:

test=# EXPLAIN ANALYZE SELECT * FROM q3c,q3c as q3cs WHERE 
(q3c.ipix>=q3cs.ipix-1000 AND q3c.ipix<=q3cs.ipix-993);                                                           QUERY
PLAN         
 
--------------------------------------------------------------------------------
---------------------------------------------------Nested Loop  (cost=0.01..49059045928.00 rows=1000000000000
width=96)
(actual
time=104991.950..104991.950 rows=0 loops=1)  ->  Seq Scan on q3c q3cs  (cost=0.00..60928.00 rows=3000000 width=48)
(actual time=0.069..10465.514 rows=3000000 loops=1)  ->  Index Scan using ipix_idx on q3c  (cost=0.01..9686.33
rows=333333
width=48) (actual time=0.025..0.025 rows=0 loops=3000000)        Index Cond: ((q3c.ipix >= ("outer".ipix - 1000)) AND
(q3c.ipix<=
 
("outer".ipix - 993)))Total runtime: 104992.202 ms
(5 rows)


With Best regards,        Sergey Koposov







Re: performance of bitmap scans in nested loop joins

From
Tom Lane
Date:
"Sergey E. Koposov" <math@sai.msu.ru> writes:
> On Fri, 29 Apr 2005, Tom Lane wrote:
>> The latter is (or should be) doing slightly *less* work, so why is it
>> taking almost twice as much time?  Can you get gprof profiles of the
>> two cases?

> I've got them. I attached two files with a little bit truncated output of 
> gprof. 

I figured out part of the problem: I had made nodeBitmapIndexscan
re-open the index on each call, thinking that that would save amrescan
calls.  But an amrescan is a whole lot cheaper than index open/close,
so that was a bad tradeoff.

This seems to account for about half of the slowdown you exhibited.
I'm not sure where the other half went ... gprof isn't turning up
any obvious candidates.
        regards, tom lane


Re: performance of bitmap scans in nested loop joins

From
"Sergey E. Koposov"
Date:
On Wed, 4 May 2005, Tom Lane wrote:

> 
> I figured out part of the problem: I had made nodeBitmapIndexscan
> re-open the index on each call, thinking that that would save amrescan
> calls.  But an amrescan is a whole lot cheaper than index open/close,
> so that was a bad tradeoff.
> 
> This seems to account for about half of the slowdown you exhibited.
> I'm not sure where the other half went ... gprof isn't turning up
> any obvious candidates.
> 

Great!!!
Thanks. 

Now I have done again the profiling. Really, now the bitmap index scan is 
faster and take 90 seconds instead of 120 seconds. But it still more than the
sum of two simple index scans (26+38 secs). 

I also tried to help you, Tom, to find the exact location of the remaining 
problem. For that purpose I did the profiling of those queries:

test=# explain analyze select * from q3c,q3c as q3cs where
q3c.ipix>=q3cs.ipix-3 AND q3c.ipix<=q3cs.ipix+3;

test=# EXPLAIN ANALYZE SELECT * FROM q3c,q3c as q3cs WHERE
(q3c.ipix>=q3cs.ipix-1000 AND q3c.ipix<=q3cs.ipix-993);

test=# explain analyze select * from q3c,q3c as q3cs where
(q3c.ipix>=q3cs.ipix-3 AND q3c.ipix<=q3cs.ipix+3) OR
(q3c.ipix>=q3cs.ipix-1000 AND  q3c.ipix<=q3cs.ipix-993);

And I coadded the "flat profiles" of first two (index scan) queries and 
compared it with the flat profile of bitmap scan:         self seconds     | self seconds 
Function              of TWO index scans | of bitmap scan
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
LWLockAcquire                       51.29      46.29
LWLockRelease                       40.29      38.65
_bt_compare                         38.35      36.20
hash_seq_search                     0.00      30.08
hash_search                         26.90      29.32
AllocSetAlloc                       10.73      27.23
btint8cmp                           30.95      24.05
PinBuffer                           23.25      22.56
hash_any                            18.58      20.18
_bt_binsrch                         19.73      19.09
UnpinBuffer                         18.27      17.44
_bt_first                           13.92      16.20
FunctionCall2                       18.23      14.62
_bt_preprocess_keys                 12.01      13.67
_bt_moveright                       13.42      11.87
InstrStopNode                       7.62      11.24
seg_alloc                           0.00      11.13
ReadBuffer                          11.42      10.68
InstrEndLoop                        4.11      10.38
ExecProcNode                        14.60      10.17
fmgr_info_cxt_security              13.87      9.96
_bt_search                          10.41      9.80
AllocSetReset                       9.70      9.47
AllocSetFree                        4.00      7.75
_bt_getroot                         7.54      7.35
LockBuffer                          9.32      7.07
MemoryContextAlloc                  3.22      6.93
ExecMakeFunctionResultNoSets        5.95      6.85
element_alloc                       0.12      6.73
_bt_checkkeys                       8.68      6.07
slot_deform_tuple                   7.71      5.77
ExecReScan                          2.89      5.67
MemoryContextAllocZero              0.00      5.43
ReleaseAndReadBuffer                6.20      5.00
ExecEvalVar                         4.02      4.80
ReleaseBuffer                       5.39      4.28
ExecScan                            4.39      4.11
ExecutorRun                         1.64      4.08
slot_getattr                        4.80      4.00
_bt_relandgetbuf                    4.31      3.70
MemoryContextCreate                 0.00      3.54


The strongest difference is in hash_seq_search (30 seconds) and 
AllocSetAlloc -- 17 seconds. (which contribute a significant part into 
slowness of bitmap scan) (I remind that  in my case bitmap scan is slower 
than 2 index scans by ~ 20-30seconds (34 seconds when postgres was compiled
with profiling support, 25 seconds when postgres was compiled without
profiling support)). 

Since I don't know exactly the normal sequence of function calls in postgres,
I cannot judge whether those functions are true reason of the problem or not,
but probably that information can help.

In any case, again I have put all the full profiling results on the web

http://lnfm1.sai.msu.ru/~math/public_misc/idxscan1.gprof
profiling of
test=# EXPLAIN ANALYZE SELECT * FROM q3c,q3c as q3cs WHERE
(q3c.ipix>=q3cs.ipix-1000 AND q3c.ipix<=q3cs.ipix-993);


http://lnfm1.sai.msu.ru/~math/public_misc/_idxscan1.gprof
profiling of
test=# explain analyze select * from q3c,q3c as q3cs where
q3c.ipix>=q3cs.ipix-3 AND q3c.ipix<=q3cs.ipix+3;


http://lnfm1.sai.msu.ru/~math/public_misc/bitmap1.gprof
profiling of
test=# explain analyze select * from q3c,q3c as q3cs where
(q3c.ipix>=q3cs.ipix-3 AND q3c.ipix<=q3cs.ipix+3) OR
(q3c.ipix>=q3cs.ipix-1000 AND  q3c.ipix<=q3cs.ipix-993);


Here, this is just the new output of explain analyze (for the postgres
compiled without profiling support)

test=# explain analyze select * from q3c,q3c as q3cs where (q3c.ipix>=q3cs.ipix-3 AND q3c.ipix<=q3cs.ipix+3) OR
(q3c.ipix>=q3cs.ipix-1000AND q3c.ipix<=q3cs.ipix-993);
               QUERY PLAN                                                                                    
 

---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------Nested
Loop (cost=5832.03..190281569757.46 rows=1888909037091 width=96) (actual time=0.173..88500.469 rows=3000000 loops=1)
-> Seq Scan on q3c q3cs  (cost=0.00..60928.16 rows=3000016 width=48) (actual time=0.072..3750.724 rows=3000000 loops=1)
->  Bitmap Heap Scan on q3c  (cost=5832.03..43426.73 rows=666670 width=48) (actual time=0.021..0.022 rows=1
loops=3000000)       Recheck Cond: (((q3c.ipix >= ("outer".ipix - 3)) AND (q3c.ipix <= ("outer".ipix + 3))) OR
((q3c.ipix>= ("outer".ipix - 1000)) AND (q3c.ipix <= ("outer".ipix - 993))))        ->  BitmapOr
(cost=5832.03..5832.03rows=666670 width=0) (actual time=0.017..0.017 rows=0 loops=3000000)              ->  Bitmap
IndexScan on ipix_idx  (cost=0.00..2916.02 rows=333335 width=0) (actual time=0.008..0.008 rows=1 loops=3000000)
          Index Cond: ((q3c.ipix >= ("outer".ipix - 3)) AND (q3c.ipix <= ("outer".ipix + 3)))              ->  Bitmap
IndexScan on ipix_idx  (cost=0.00..2916.02 rows=333335 width=0) (actual time=0.006..0.006 rows=0 loops=3000000)
          Index Cond: ((q3c.ipix >= ("outer".ipix - 1000)) AND (q3c.ipix <= ("outer".ipix - 993)))Total runtime:
90241.180ms
 



test=# EXPLAIN ANALYZE SELECT * FROM q3c,q3c as q3cs WHERE 
test-# (q3c.ipix>=q3cs.ipix-1000 AND q3c.ipix<=q3cs.ipix-993);
QUERY PLAN                                                             
 

-----------------------------------------------------------------------------------------------------------------------------------Nested
Loop (cost=0.01..49059519577.21 rows=1000010666695 width=96) (actual time=26391.542..26391.542 rows=0 loops=1)  ->  Seq
Scanon q3c q3cs  (cost=0.00..60928.16 rows=3000016 width=48) (actual time=0.059..3393.616 rows=3000000 loops=1)  ->
IndexScan using ipix_idx on q3c  (cost=0.01..9686.37 rows=333335 width=48) (actual time=0.006..0.006 rows=0
loops=3000000)       Index Cond: ((q3c.ipix >= ("outer".ipix - 1000)) AND (q3c.ipix <= ("outer".ipix - 993)))Total
runtime:26391.623 ms
 
(5 rows)



test=# explain analyze select * from q3c,q3c as q3cs where q3c.ipix>=q3cs.ipix-3 AND q3c.ipix<=q3cs.ipix+3;
                                             QUERY PLAN                                                             
 

-----------------------------------------------------------------------------------------------------------------------------------Nested
Loop (cost=0.01..49059519577.21 rows=1000010666695 width=96) (actual time=0.108..37355.388 rows=3000000 loops=1)  ->
SeqScan on q3c q3cs  (cost=0.00..60928.16 rows=3000016 width=48) (actual time=0.039..3403.276 rows=3000000 loops=1)  ->
Index Scan using ipix_idx on q3c  (cost=0.01..9686.37 rows=333335 width=48) (actual time=0.007..0.008 rows=1
loops=3000000)       Index Cond: ((q3c.ipix >= ("outer".ipix - 3)) AND (q3c.ipix <= ("outer".ipix + 3)))Total runtime:
38588.672ms
 
(5 rows)


With best regards, 
Sergey Koposov




Re: performance of bitmap scans in nested loop joins

From
Tom Lane
Date:
"Sergey E. Koposov" <math@sai.msu.ru> writes:
> And I coadded the "flat profiles" of first two (index scan) queries and 
> compared it with the flat profile of bitmap scan: 

Thanks, I had been thinking of doing that same calculation but hadn't
got round to it yet.  It looks like the bitmap case is actually a little
ahead on buffer access (as you'd expect) and btree work (which is
surprising because it ought to be dead even; are these numbers very
repeatable?).  Where we are losing is mostly on the actual manipulation
of the bitmaps (particularly hash_seq_search which is done in
tbm_begin_iterate; and it looks like memory allocation for the bitmap
hashtables is nontrivial too).  I had already had a TODO item to look
into speeding up hash_seq_search ... will see what I can find.
        regards, tom lane


Re: performance of bitmap scans in nested loop joins

From
"Sergey E. Koposov"
Date:
On Thu, 5 May 2005, Tom Lane wrote:

> "Sergey E. Koposov" <math@sai.msu.ru> writes:
> > And I coadded the "flat profiles" of first two (index scan) queries and 
> > compared it with the flat profile of bitmap scan: 
> 
> Thanks, I had been thinking of doing that same calculation but hadn't
> got round to it yet.  It looks like the bitmap case is actually a little
> ahead on buffer access (as you'd expect) and btree work (which is
> surprising because it ought to be dead even; are these numbers very
> repeatable?).  Where we are losing is mostly on the actual manipulation

Yes, all those timings are rather stable and have been obtained on the very
unloaded system (and in fully cached regime). 

With Best regards,    Sergey Koposov



Re: performance of bitmap scans in nested loop joins

From
Tom Lane
Date:
I wrote:
> ... Where we are losing is mostly on the actual manipulation
> of the bitmaps (particularly hash_seq_search which is done in
> tbm_begin_iterate; and it looks like memory allocation for the bitmap
> hashtables is nontrivial too).  I had already had a TODO item to look
> into speeding up hash_seq_search ... will see what I can find.

I got around to looking at this some more.  It seems that the basic
problem is that dynahash.c isn't optimized for very small hashtables.
The test case I'm looking at (which may be an oversimplification of
Sergey's problem) never has more than one entry in the hashtables it
creates for in-memory bitmaps, and so the hashtable overhead is pretty
significant.  Particularly the overhead of creating and deleting 'em.

The other uses of dynahash.c that are at all performance-critical seem
to deal with hashtables that are (a) fairly large and (b) long-lived.
So I'm thinking that hacking dynahash.c itself to improve this situation
would probably be counterproductive overall.

An idea that comes to mind is to allow tidbitmap.c to handle the cases
of zero or one page entry directly, storing the page entry in a simple
field of the TIDBitmap struct.  Only when the bitmap needs entries for
more than one heap page will we create a subsidiary hash table.  (Note
that we could store bits for more than one tuple, if they happen to be
on the same heap page.)

This would uglify the logic in tidbitmap.c a bit, but it doesn't seem
completely preposterous.  The main objection I can see is that it would
only optimize the zero-or-one-page cases, which might be insufficient.

Anyone have a better idea for speeding up operations on small bitmaps?
        regards, tom lane