Thread: two index bitmap scan of a big table & hash_seq_search

two index bitmap scan of a big table & hash_seq_search

From

"Sergey E. Koposov"

Date:

19 August 2011, 15:24:20

Hello Hackers,

I've recently noticed a particularly strange behaviour of one of my simple 
queries:
SELECT ra,dec FROM sdssdr7.photoobjall WHERE ra BETWEEN 175 and 190 AND   dec BETWEEN 17 And 23 ;
The table is quite large (1.1Tb, 10^9 rows, and has Btree indexes on 
"ra","dec" columns).
The plan is a combination of two bitmap scans:
wsdb=# explain SELECT ra,dec FROM sdssdr7.photoobjall WHERE ra BETWEEN 175 and 190 AND dec BETWEEN 17 And 23 ;
                                                             QUERY PLAN
 

-----------------------------------------------------------------------------------------------------------------------------------------------------------
BitmapHeap Scan on photoobjall  (cost=1854918.62..13401835.50 rows=3266290 width=16)   Recheck Cond: ((ra >=
175::doubleprecision) AND (ra <= 190::double precision) AND ("dec" >= 17::double precision) AND ("dec" <= 23::double
precision))  ->  BitmapAnd  (cost=1854918.62..1854918.62 rows=3266290 width=0)         ->  Bitmap Index Scan on
sdssdr7_photoobjall_ra_idx (cost=0.00..607217.21 rows=30530306 width=0)               Index Cond: ((ra >= 175::double
precision)AND (ra <= 190::double precision))         ->  Bitmap Index Scan on sdssdr7_photoobjall_dec_idx
(cost=0.00..1246068.01rows=62654186 width=0)               Index Cond: (("dec" >= 17::double precision) AND ("dec" <=
23::doubleprecision))
 

But the funny thing I noticed is that the query after running a certain 
amount of time doing I/O, starts to use 100%CPU and spend 99% the time in 
hash_seq_search. Here is the oprofile of PG during that period:
--------
CPU: Intel Core/i7, speed 2.268e+06 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        symbol name
303404   99.3562  hash_seq_search
1163      0.3808  tbm_lossify
639       0.2093  hash_search_with_hash_value
....
--------

During that it very rarely tries to actually do any I/O. Every second or 
so it does issue an read of the main relation(not the index). Here is 
how strace looks like during that time:
read(455, "\0\0\0\0\0\0\0\0\1\0\0\0t\6\200\6\360\37\4 \0\0\0\0\200\206 \0\340\237 \0\320"..., 8192) = 8192
... repeated 100 or more  times(issued every second or so) ..
lseek(455, 603635712, SEEK_SET)         = 603635712
read(455, "\0\0\0\0\0\0\0\0\1\0\0\0t\6\200\6\360\37\4 \0\0\0\0\200\206 \0\340\237 \0\320"..., 8192) = 8192
... repeated a lot of times...
...etc...
So it seems that it actually reads the same thing all over again and 
again.

Another funny thing is that I have the exactly the same table, but with 
significantly smaller number of columns (so it occupies 180 gb 
instead) but has the same indexes. Also all the rows in the table are 
exactly in the same order (both tables were clustered using the same 
thing). For that table the query succeeds without problems in 20 seconds 
or so:
wsdb=> explain analyze SELECT ra,dec FROM sdssdr7.phototag WHERE ra BETWEEN 175 and 190 AND dec BETWEEN 17 And 23 ;
                                                                       QUERY PLAN
 

--------------------------------------------------------------------------------------------------------------------------------------------------------------------
BitmapHeap Scan on phototag  (cost=1824731.39..10750473.21 rows=3204629 width=16) (actual time=21146.520..23136.512
rows=2973205loops=1)   Recheck Cond: ((ra >= 175::double precision) AND (ra <= 190::double precision) AND ("dec" >=
17::doubleprecision) AND ("dec" <= 23::double precision))   ->  BitmapAnd  (cost=1824731.39..1824731.39 rows=3204629
width=0)(actual time=21093.906..21093.906 rows=0 loops=1)         ->  Bitmap Index Scan on sdssdr7_phototag2_ra_idx
(cost=0.00..613910.25rows=30866810 width=0) (actual time=6502.589..6502.589 rows=30418322 loops=1)               Index
Cond:((ra >= 175::double precision) AND (ra <= 190::double precision))         ->  Bitmap Index Scan on
sdssdr7_phototag2_dec_idx (cost=0.00..1209218.57 rows=60801242 width=0) (actual time=14201.455..14201.455 rows=63031352
loops=1)              Index Cond: (("dec" >= 17::double precision) AND ("dec" <= 23::double precision)) Total runtime:
23295.384ms
 
(8 rows)

While for "wider" table it is still running the query for more than an 
hour  with 100%CPU an almost no I/O

Additional info:
PG version 8.4.8,
OS: debian 5.0
hardware: 2xXeon E5520, 12GB RAM, RAID50 with BBU
The tables are completely static(e.g. no write activity on them), and have 
been vacuum analyzed.
The system is not busy at all (e.g. not much if any concurrent queries).
The tables have large number of columns of real/int/bigint/double 
precision time.

Some config parameters: effective_cache_size            | 6GB work_mem                        | 1GB
effective_io_concurrency       | 0 shared_buffers                  | 4GB
 

Any ideas what can be wrong? Any info I can provide ?

Thanks in advance,    Sergey

*******************************************************************
Sergey E. Koposov, PhD
Institute for Astronomy, Cambridge UK /Sternberg Astronomical Institute
Web: http://lnfm1.sai.msu.ru/~math
E-mail: math@sai.msu.ru

Re: two index bitmap scan of a big table & hash_seq_search

From

Tom Lane

Date:

20 August 2011, 00:04:01

"Sergey E. Koposov" <math@sai.msu.ru> writes:
> But the funny thing I noticed is that the query after running a certain 
> amount of time doing I/O, starts to use 100%CPU and spend 99% the time in 
> hash_seq_search. Here is the oprofile of PG during that period:
> --------
> CPU: Intel Core/i7, speed 2.268e+06 MHz (estimated)
> Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
> samples  %        symbol name
> 303404   99.3562  hash_seq_search
> 1163      0.3808  tbm_lossify
> 639       0.2093  hash_search_with_hash_value

It seems like you've uncovered a scaling limitation in the tidbitmap
logic when it has to deal with very very large numbers of pages.

I might be reading too much into the mention of tbm_lossify, but
I wonder if the problem is repeated invocations of tbm_lossify()
as the bitmap gets larger.  Maybe that function needs to be more
aggressive about how much information it deletes per call.
        regards, tom lane

Re: two index bitmap scan of a big table & hash_seq_search

From

"Sergey E. Koposov"

Date:

20 August 2011, 13:50:37

On Fri, 19 Aug 2011, Tom Lane wrote:
> I might be reading too much into the mention of tbm_lossify, but
> I wonder if the problem is repeated invocations of tbm_lossify()
> as the bitmap gets larger.  Maybe that function needs to be more
> aggressive about how much information it deletes per call.
Thanks for idea, Tom.

Yes, it turns out that the problem was in lossify'ing the bitmap to 
intensely. I've put the elogs around the lossification in tbm_add_tuples()              if (tbm->nentries >
tbm->maxentries)               {                        elog(WARNING, "lossifying %d %d", tbm->nentries, 

tbm->maxentries);                        tbm_lossify(tbm);                        elog(WARNING, "lossified %d",
tbm->nentries);               }

And I saw in my log 
koposov:wsdb:2011-08-20 17:31:46 BST:21524 WARNING:  lossifying 13421773 13421772
koposov:wsdb:2011-08-20 17:31:46 BST:21524 WARNING:  lossified 13421772
issued with a rate of 20000 per second. E.g. it lossifies one page per 
lossify call (and does a lot of hash_seq_search operations too) ...

After that I changed the check in tbm_lossify()
from:                if (tbm->nentries <= tbm->maxentries)
to:                if (tbm->nentries <= (0.8*tbm->maxentries))

which allowed the query finish in 75 seconds (comparing to 3hours).

I'm not entirely sure that my fix of the tbm_lossify function is a proper 
one, but it looks all right.
Do you think that this should be fixed ?
    Sergey

*******************************************************************
Sergey E. Koposov, PhD
Institute for Astronomy, Cambridge/Sternberg Astronomical Institute
Web: http://lnfm1.sai.msu.ru/~math
E-mail: math@sai.msu.ru

Re: two index bitmap scan of a big table & hash_seq_search

From

Tom Lane

Date:

20 August 2011, 14:04:05

"Sergey E. Koposov" <math@sai.msu.ru> writes:
> Yes, it turns out that the problem was in lossify'ing the bitmap to 
> intensely.

Yeah, I had just been coming to the same conclusion.  Your table has
about 134M pages, and if the planner estimate of 62M rows was right
(and there's no reason it shouldn't be pretty close on that) then
we're talking about a bitmap that's going to contain about one bit
set in each of about half of the pages.  The page structures are
50-some bytes apiece so a non-lossy representation would run to
3-plus GB, well beyond your work_mem limit.  So it would fill up
to work_mem and then start lossifying pages ... one at a time.
I had suspected that there might be a performance issue there,
as per the comment at line 954, but we hadn't actually seen it
reported from the field before.

> After that I changed the check in tbm_lossify()
> from:
>                  if (tbm->nentries <= tbm->maxentries)
> to:
>                  if (tbm->nentries <= (0.8*tbm->maxentries))
> which allowed the query finish in 75 seconds (comparing to 3hours).

I was about to propose using tbm->maxentries/2, which is in the same
spirit but a tad cheaper to calculate.

I think that we also need to consider the possibility that tbm_lossify
finishes its loop without ever getting under maxentries --- that could
only happen with very large tables and very small work_mem, but it could
happen.  If it did, then all subsequent operations would keep on calling
tbm_lossify, and it would keep scanning the entire hashtable and
probably not accomplishing much, and taking forever to do it.  Unless
somebody has a better idea, what I think we should do then is just
artificially inflate maxentries --- that is, accept that we are not
going to fit in the originally requested work_mem, and we might as well
set a more realistic goal.

> Do you think that this should be fixed ?

Yes, definitely.
        regards, tom lane