Hello Hackers,
I've recently noticed a particularly strange behaviour of one of my simple
queries:
SELECT ra,dec FROM sdssdr7.photoobjall WHERE ra BETWEEN 175 and 190 AND dec BETWEEN 17 And 23 ;
The table is quite large (1.1Tb, 10^9 rows, and has Btree indexes on
"ra","dec" columns).
The plan is a combination of two bitmap scans:
wsdb=# explain SELECT ra,dec FROM sdssdr7.photoobjall WHERE ra BETWEEN 175 and 190 AND dec BETWEEN 17 And 23 ;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------
BitmapHeap Scan on photoobjall (cost=1854918.62..13401835.50 rows=3266290 width=16) Recheck Cond: ((ra >=
175::doubleprecision) AND (ra <= 190::double precision) AND ("dec" >= 17::double precision) AND ("dec" <= 23::double
precision)) -> BitmapAnd (cost=1854918.62..1854918.62 rows=3266290 width=0) -> Bitmap Index Scan on
sdssdr7_photoobjall_ra_idx (cost=0.00..607217.21 rows=30530306 width=0) Index Cond: ((ra >= 175::double
precision)AND (ra <= 190::double precision)) -> Bitmap Index Scan on sdssdr7_photoobjall_dec_idx
(cost=0.00..1246068.01rows=62654186 width=0) Index Cond: (("dec" >= 17::double precision) AND ("dec" <=
23::doubleprecision))
But the funny thing I noticed is that the query after running a certain
amount of time doing I/O, starts to use 100%CPU and spend 99% the time in
hash_seq_search. Here is the oprofile of PG during that period:
--------
CPU: Intel Core/i7, speed 2.268e+06 MHz (estimated)
Counted CPU_CLK_UNHALTED events (Clock cycles when not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples % symbol name
303404 99.3562 hash_seq_search
1163 0.3808 tbm_lossify
639 0.2093 hash_search_with_hash_value
....
--------
During that it very rarely tries to actually do any I/O. Every second or
so it does issue an read of the main relation(not the index). Here is
how strace looks like during that time:
read(455, "\0\0\0\0\0\0\0\0\1\0\0\0t\6\200\6\360\37\4 \0\0\0\0\200\206 \0\340\237 \0\320"..., 8192) = 8192
... repeated 100 or more times(issued every second or so) ..
lseek(455, 603635712, SEEK_SET) = 603635712
read(455, "\0\0\0\0\0\0\0\0\1\0\0\0t\6\200\6\360\37\4 \0\0\0\0\200\206 \0\340\237 \0\320"..., 8192) = 8192
... repeated a lot of times...
...etc...
So it seems that it actually reads the same thing all over again and
again.
Another funny thing is that I have the exactly the same table, but with
significantly smaller number of columns (so it occupies 180 gb
instead) but has the same indexes. Also all the rows in the table are
exactly in the same order (both tables were clustered using the same
thing). For that table the query succeeds without problems in 20 seconds
or so:
wsdb=> explain analyze SELECT ra,dec FROM sdssdr7.phototag WHERE ra BETWEEN 175 and 190 AND dec BETWEEN 17 And 23 ;
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------------------------------------------
BitmapHeap Scan on phototag (cost=1824731.39..10750473.21 rows=3204629 width=16) (actual time=21146.520..23136.512
rows=2973205loops=1) Recheck Cond: ((ra >= 175::double precision) AND (ra <= 190::double precision) AND ("dec" >=
17::doubleprecision) AND ("dec" <= 23::double precision)) -> BitmapAnd (cost=1824731.39..1824731.39 rows=3204629
width=0)(actual time=21093.906..21093.906 rows=0 loops=1) -> Bitmap Index Scan on sdssdr7_phototag2_ra_idx
(cost=0.00..613910.25rows=30866810 width=0) (actual time=6502.589..6502.589 rows=30418322 loops=1) Index
Cond:((ra >= 175::double precision) AND (ra <= 190::double precision)) -> Bitmap Index Scan on
sdssdr7_phototag2_dec_idx (cost=0.00..1209218.57 rows=60801242 width=0) (actual time=14201.455..14201.455 rows=63031352
loops=1) Index Cond: (("dec" >= 17::double precision) AND ("dec" <= 23::double precision)) Total runtime:
23295.384ms
(8 rows)
While for "wider" table it is still running the query for more than an
hour with 100%CPU an almost no I/O
Additional info:
PG version 8.4.8,
OS: debian 5.0
hardware: 2xXeon E5520, 12GB RAM, RAID50 with BBU
The tables are completely static(e.g. no write activity on them), and have
been vacuum analyzed.
The system is not busy at all (e.g. not much if any concurrent queries).
The tables have large number of columns of real/int/bigint/double
precision time.
Some config parameters: effective_cache_size | 6GB work_mem | 1GB
effective_io_concurrency | 0 shared_buffers | 4GB
Any ideas what can be wrong? Any info I can provide ?
Thanks in advance, Sergey
*******************************************************************
Sergey E. Koposov, PhD
Institute for Astronomy, Cambridge UK /Sternberg Astronomical Institute
Web: http://lnfm1.sai.msu.ru/~math
E-mail: math@sai.msu.ru