Prefetch index pages for B-Tree index scans - Mailing list pgsql-hackers
From | Claudio Freire |
---|---|
Subject | Prefetch index pages for B-Tree index scans |
Date | |
Msg-id | CAGTBQpZzf70n0PYJ=VQLd+jb3wJGo=2TXmY+SkJD6G_vjC5QNg@mail.gmail.com Whole thread Raw |
Responses |
Re: [PATCH] Prefetch index pages for B-Tree index scans
(Claudio Freire <klaussfreire@gmail.com>)
|
List | pgsql-hackers |
I've noticed, doing some reporting queries once, that index scans fail to saturate server resources on compute-intensive queries. Problem is, just after fetching a page, postgres starts computing stuff before fetching the next. This results in I/O - compute - I/O - compute alternation that results in idle CPU and disk, both. I've also noticed earlier patches attempted to implement prefetch of index pages, yet they don't seem to be committed. I'm wondering why. The patches themselves were quite complex, attempting to prefetch heap tuples in addition to index tuples. I can see how this would be beneficial, but heap prefetch is notoriously more complicated, and with the advent of index-only scans, maybe index-page-only prefetching would be of use. To test this hypothesis, I wrote the attached patch. pgbench doesn't seem to see any performance impact (expected, since pgbench uses really tiny queries that touch a single page). Using pgbench's data with a scale of 1000, I tried some other queries that were more taxing of index-only scans. This was done on my personal computer, with a really lame I/O subsystem (compared to database servers), and with 9.2.1 rather than git, but I think it should be significant anyway. I will try to verify on RAID-ed disks, but I'm in short supply of test systems at the moment. Pgbench's biggest index (on pgbench_accounts.aid) is not unsurprisingly quite sequential, since it has been just created. So, I tested both forward and backward index scans, to get an idea of how the patch impacts on sequential and non-sequential workloads. I haven't been able to test truly random ones yet - I'm trying to load up a dump from a production database that's both big and messy to test. A few things worth noting, is that the patch avoids doing prefetch on single-page index access, and only when block numbers aren't sequentially increasing (only when scanning the index nonsequentially will prefetch be attempted), since I noticed the fadvise call in those cases was being counterproductive. So, here we go: The base query I used, is: explain analyze select aid, count(*) from pgbench_accounts where aid between 10000000 and 200000000 group by aid order by aid; For backward scans, just order by aid desc. The full command used is sudo bash -c 'echo 3 > /proc/sys/vm/drop_caches' ; pg_ctl start -l ${PGLOG} ; sleep 5 ; ( sleep 10 ; echo 'set effective_io_concurrency to 0;' ; echo 'explain analyze select aid, count(*) from pgbench_accounts where aid between 10000000 and 200000000 group by aid order by aid;' ) | psql -h localhost -p 5433 pgbench ; sleep 5 ; pg_ctl stop server starting The server is started and stopped every time to make sure the shared cache is empty, and sleeps are there to avoid backlash I/O from dropping caches to influence the benchmark. Results: With effective_io_concurrency set to 0 (patch disabled): Forward: QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- GroupAggregate (cost=0.00..4149039.04 rows=90257289 width=4) (actual time=47.552..31113.353 rows=90000001 loops=1) -> Index Only Scan using pgbench_accounts_pkey on pgbench_accounts (cost=0.00..2795179.71 rows=90257289 width=4) (actual time=47.542..13197.982 rows=90000001 loops=1) Index Cond: ((aid >= 10000000) AND (aid <= 200000000)) Heap Fetches: 0 Total runtime: 33648.500 ms I/O thoughtput averages 60MB/s (clearly sequential) I/O utilization averages around 30% Backward: QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- GroupAggregate (cost=0.00..4149039.04 rows=90257289 width=4) (actual time=10.279..159704.590 rows=90000001 loops=1) -> Index Only Scan Backward using pgbench_accounts_pkey on pgbench_accounts (cost=0.00..2795179.71 rows=90257289 width=4) (actual time=10.266..132853.382 rows=90000001 loops=1) Index Cond: ((aid >= 10000000) AND (aid <= 200000000)) Heap Fetches: 0 Total runtime: 163202.869 ms I/O thoughput averages 11MB/s (clearly not fully random, but neither sequential) I/O utilization averages 68% With effective_io_concurrency set to 1 (patch enabled): Forward: QUERY PLAN ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------- GroupAggregate (cost=0.00..4149039.04 rows=90257289 width=4) (actual time=47.582..30474.222 rows=90000001 loops=1) -> Index Only Scan using pgbench_accounts_pkey on pgbench_accounts (cost=0.00..2795179.71 rows=90257289 width=4) (actual time=47.571..12208.340 rows=90000001 loops=1) Index Cond: ((aid >= 10000000) AND (aid <= 200000000)) Heap Fetches: 0 Total runtime: 32875.695 ms I/O thoughput still averages 60MB/s (tiny performance imrpovement is probably just variation, although in the multiple tests I've made there is a systematic bias towards lower times, probably the effect of prefetch on the few cases where the index jumps nonsequentially) I/O utilization remains around 35% Backward: QUERY PLAN -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- GroupAggregate (cost=0.00..4149039.04 rows=90257289 width=4) (actual time=28.190..157708.405 rows=90000001 loops=1) -> Index Only Scan Backward using pgbench_accounts_pkey on pgbench_accounts (cost=0.00..2795179.71 rows=90257289 width=4) (actual time=28.178..135282.317 rows=90000001 loops=1) Index Cond: ((aid >= 10000000) AND (aid <= 200000000)) Heap Fetches: 0 Total runtime: 160735.539 ms I/O thoughput averages 12MB/s (a small increase), and the 3-second difference seems related to it (it's consistent). I/O utilization averages 88% (important increase) This last result makes me think deeper prefetching could be potentially beneficial (it would result in read merges), but it's rather hard to implement without a skiplist of leaf pages. Maybe the backward-sequential pattern could be detected. I'll have to tinker with that. With a hot OS cache, there's no discernible difference in timings. For heap-including scans forward (using sum(abalance) instead of count(*)), the times are 418522.583ms with, and 425497.436ms without (still very small, still consistent improvement). For heap-including scans backward, the times are 1764391.372ms with (with an I/O of 8.4MB/s at 95% utilization), and 1794715.164ms wihtout (7.8MB/s at 92% utilization). So, not whopping, but better. However small the improvement (5% on backward scans), since I see no ill effect, I thought I'd post the patch. Besides, I'm convinced first, with more chaotic indexes, the improvement should be noticeably bigger, and second, with more complex queries, the parallelization may be too. Especially with queries involving more than one index scan. This I believe since, if I attach an strace to the backend process, the improvement with prefetch jumps to 30% on backward scans, probably due to the increased computation window between reads.
Attachment
pgsql-hackers by date: