Re: Seq scans roadmap - Mailing list pgsql-hackers

From Simon Riggs
Subject Re: Seq scans roadmap
Date
Msg-id 1178702879.10861.29.camel@silverbirch.site
Whole thread Raw
In response to Seq scans roadmap  (Heikki Linnakangas <heikki@enterprisedb.com>)
List pgsql-hackers
On Tue, 2007-05-08 at 11:40 +0100, Heikki Linnakangas wrote:
> Here's my roadmap for the "scan-resistant buffer cache" and 
> "synchronized scans" patches.
> 
> 1. Fix the current vacuum behavior of throwing dirty buffers to the 
> freelist, forcing a lot of WAL flushes. Instead, use a backend-private 
> ring of shared buffers that are recycled. This is what Simon's 
> "scan-resistant buffer manager" did.
> 
> The theory here is that if a page is read in by vacuum, it's unlikely to 
> be accessed in the near future, therefore it should be recycled. If 
> vacuum doesn't dirty the page, it's best to reuse the buffer immediately 
> for the next page. However, if the buffer is dirty (and not just because 
> we set hint bits), we ought to delay writing it to disk until the 
> corresponding WAL record has been flushed to disk.
> 
> Simon's patch used a fixed size ring of buffers that are recycled, but I 
> think the ring should be dynamically sized. Start with a small ring, and 
> whenever you need to do a WAL flush to write a dirty buffer, increase 
> the ring size. On every full iteration through the ring, decrease its 
> size to trim down an unnecessarily large ring.
> 
> This only alters the behavior of vacuums, and it's pretty safe to say it 
> won't get worse than what we have now. 

I think thats too much code, why not just leave it as it is. Would a
dynamic buffer be substantially better? If not, why bother?

> In the future, we can use the 
> buffer ring for seqscans as well; more on that on step 3.

There was clear benefit for that. You sound like you are suggesting to
remove the behaviour for Seq Scans, which wouldn't make much sense??

> 2. Implement the list/table of last/ongoing seq scan positions. This is 
> Jeff's "synchronized scans" patch. When a seq scan starts on a table 
> larger than some threshold, it starts from where the previous seq scan 
> is currently, or where it ended. This will synchronize the scans so that 
> for two concurrent scans the total I/O is halved in the best case. There 
> should be no other effect on performance.
> 
> If you have a partitioned table, or union of multiple tables or any 
> other plan where multiple seq scans are performed in arbitrary order, 
> this change won't change the order the partitions are scanned and won't 
> therefore ensure they will be synchronized.
> 
> 
> Now that we have both pieces of the puzzle in place, it's time to 
> consider what more we can do with them:
> 
> 
> 3A. To take advantage of the "cache trail" of a previous seq scan, scan 
> backwards from where the previous seq scan ended, until a you hit a 
> buffer that's not in cache.
> 
> This will allow taking advantage of the buffer cache even if the table 
> doesn't fit completely in RAM. That can make a big difference if the 
> table size is just slightly bigger than RAM, and can avoid the nasty 
> surprise when a table grows beyond RAM size and queries start taking 
> minutes instead of seconds.
> 
> This should be a non-controversial change on its own from performance 
> point of view. No query should get slower, and some will become faster. 
> But see step 3B:
> 
> 3B. Currently, sequential scans on a large table spoils the buffer cache 
> by evicting other pages from the cache. In CVS HEAD, as soon as the 
> table is larger than shared_buffers, the pages in the buffer won't be 
> used to speed up running the same query again, and there's no reason to 
> believe the pages read in would be more useful than any other page in 
> the database, and in particular the pages that were in the buffer cache 
> before the huge seq scan. If the table being scanned is > 5 * 
> shared_buffers, the scan will evict every other page from the cache if 
> there's no other activity in the database (max usage_count is 5).
> 
> If the table is much larger than shared_buffers, say 10 times as large, 
> even with the change 3B to read the pages that are in cache first, using 
> all shared_buffers to cache the table will only speed up the query by 
> 10%. We should not spoil the cache for such a small gain, and use the 
> local buffer ring strategy instead. It's better to make queries that are 
> slow anyway a little bit slower, than making queries that are normally 
> really fast, slow.
> 
> 
> As you may notice, 3A and 3B are at odds with each other. We can 
> implement both, but you can't use both strategies in the same scan.

Not sure I've seen any evidence of that.

Most scans will be solo and so should use the ring buffer, since there
is clear evidence of that. If there were evidence to suggest the two
patches conflict then we should turn off the ring buffer only when
concurrent scans are in progress (while remembering that concurrent
scans will not typically overlap as much as the synch scan tests show
and so for much of their execution they too will be solo).

> Therefore we need to have decision logic of some kind to figure out 
> which strategy is optimal.
> 
> A simple heuristic is to decide based on the table size:
> 
> < 0.1*shared_buffers    -> start from page 0, keep in cache (like we do now)
> < 5 * shared_buffers    -> start from last read page, keep in cache
>  > 5 * shared_buffers    -> start from last read page, use buffer ring
> 
> I'm not sure about the constants, we might need to make them GUC 
> variables as Simon argued, but that would be the general approach.

If you want to hardcode it, I'd say use the ring buffer on scans of 1000
blocks or more, or we have a GUC. Sizing things to shared_buffers isn't
appropriate because of the effects of partitioning, as I argued in my
last post, which I think is still valid.

> Thoughts? Everyone happy with the roadmap?

I think separating the patches is now the best way forward, though both
are good.

--  Simon Riggs              EnterpriseDB   http://www.enterprisedb.com




pgsql-hackers by date:

Previous
From: Michael Meskes
Date:
Subject: Re: Windows Vista support (Buildfarm Vaquita)
Next
From: Dave Page
Date:
Subject: Re: Windows Vista support (Buildfarm Vaquita)