Seq scans roadmap - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Seq scans roadmap
Date
Msg-id 46405393.507@enterprisedb.com
Whole thread Raw
Responses Re: Seq scans roadmap
Re: Seq scans roadmap
Re: Seq scans roadmap
List pgsql-hackers
Here's my roadmap for the "scan-resistant buffer cache" and 
"synchronized scans" patches.

1. Fix the current vacuum behavior of throwing dirty buffers to the 
freelist, forcing a lot of WAL flushes. Instead, use a backend-private 
ring of shared buffers that are recycled. This is what Simon's 
"scan-resistant buffer manager" did.

The theory here is that if a page is read in by vacuum, it's unlikely to 
be accessed in the near future, therefore it should be recycled. If 
vacuum doesn't dirty the page, it's best to reuse the buffer immediately 
for the next page. However, if the buffer is dirty (and not just because 
we set hint bits), we ought to delay writing it to disk until the 
corresponding WAL record has been flushed to disk.

Simon's patch used a fixed size ring of buffers that are recycled, but I 
think the ring should be dynamically sized. Start with a small ring, and 
whenever you need to do a WAL flush to write a dirty buffer, increase 
the ring size. On every full iteration through the ring, decrease its 
size to trim down an unnecessarily large ring.

This only alters the behavior of vacuums, and it's pretty safe to say it 
won't get worse than what we have now. In the future, we can use the 
buffer ring for seqscans as well; more on that on step 3.

2. Implement the list/table of last/ongoing seq scan positions. This is 
Jeff's "synchronized scans" patch. When a seq scan starts on a table 
larger than some threshold, it starts from where the previous seq scan 
is currently, or where it ended. This will synchronize the scans so that 
for two concurrent scans the total I/O is halved in the best case. There 
should be no other effect on performance.

If you have a partitioned table, or union of multiple tables or any 
other plan where multiple seq scans are performed in arbitrary order, 
this change won't change the order the partitions are scanned and won't 
therefore ensure they will be synchronized.


Now that we have both pieces of the puzzle in place, it's time to 
consider what more we can do with them:


3A. To take advantage of the "cache trail" of a previous seq scan, scan 
backwards from where the previous seq scan ended, until a you hit a 
buffer that's not in cache.

This will allow taking advantage of the buffer cache even if the table 
doesn't fit completely in RAM. That can make a big difference if the 
table size is just slightly bigger than RAM, and can avoid the nasty 
surprise when a table grows beyond RAM size and queries start taking 
minutes instead of seconds.

This should be a non-controversial change on its own from performance 
point of view. No query should get slower, and some will become faster. 
But see step 3B:

3B. Currently, sequential scans on a large table spoils the buffer cache 
by evicting other pages from the cache. In CVS HEAD, as soon as the 
table is larger than shared_buffers, the pages in the buffer won't be 
used to speed up running the same query again, and there's no reason to 
believe the pages read in would be more useful than any other page in 
the database, and in particular the pages that were in the buffer cache 
before the huge seq scan. If the table being scanned is > 5 * 
shared_buffers, the scan will evict every other page from the cache if 
there's no other activity in the database (max usage_count is 5).

If the table is much larger than shared_buffers, say 10 times as large, 
even with the change 3B to read the pages that are in cache first, using 
all shared_buffers to cache the table will only speed up the query by 
10%. We should not spoil the cache for such a small gain, and use the 
local buffer ring strategy instead. It's better to make queries that are 
slow anyway a little bit slower, than making queries that are normally 
really fast, slow.


As you may notice, 3A and 3B are at odds with each other. We can 
implement both, but you can't use both strategies in the same scan. 
Therefore we need to have decision logic of some kind to figure out 
which strategy is optimal.

A simple heuristic is to decide based on the table size:

< 0.1*shared_buffers    -> start from page 0, keep in cache (like we do now)
< 5 * shared_buffers    -> start from last read page, keep in cache> 5 * shared_buffers    -> start from last read
page,use buffer ring
 

I'm not sure about the constants, we might need to make them GUC 
variables as Simon argued, but that would be the general approach.

In the future, I'm envisioning a smarter algorithm to size the local 
buffer ring. Take all buffers with usage_count=0, plus a few with 
usage_count=1 from the clock sweep. That way if there's a lot of buffers 
in the buffer cache that are seldomly used, we'll use more buffers to 
cache the large scan, and vice versa. And no matter how large the scan, 
it wouldn't blow all buffers from the cache. But if you execute the same 
query again, the buffers left in the cache from the last scan were 
apparently useful, so we use a bigger ring this time.

I'm going to do this incrementally, and we'll see how far we get for 
8.3. We might push 3A and/or 3B to 8.4. First, I'm going to finish up 
Simon's patch (step 1), run some performance tests with vacuum, and 
submit a patch for that. Then I'll move to Jeff's patch (step 2).

Thoughts? Everyone happy with the roadmap?

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: [COMMITTERS] psqlodbc - psqlodbc: Put Autotools-generated files into subdirectory
Next
From: "Luke Lonergan"
Date:
Subject: Re: Seq scans roadmap