Re: Seq scans roadmap - Mailing list pgsql-hackers
From | Heikki Linnakangas |
---|---|
Subject | Re: Seq scans roadmap |
Date | |
Msg-id | 46497E24.6060500@enterprisedb.com Whole thread Raw |
In response to | Re: Seq scans roadmap (Heikki Linnakangas <heikki@enterprisedb.com>) |
Responses |
Re: Seq scans roadmap
|
List | pgsql-hackers |
Just to keep you guys informed, I've been busy testing and pondering over different buffer ring strategies for vacuum, seqscans and copy. Here's what I'm going to do: Use a fixed size ring. Fixed as in doesn't change after the ring is initialized, however different kinds of scans use differently sized rings. I said earlier that it'd be invasive change to see if a buffer needs a WAL flush and choose another victim if that's the case. I looked at it again and found a pretty clean way of doing that, so I took that approach for seq scans. 1. For VACUUM, use a ring of 32 buffers. 32 buffers is small enough to give the L2 cache benefits and keep cache pollution low, but at the same time it's large enough that it keeps the need to WAL flush reasonable (1/32 of what we do now). 2. For sequential scans, also use a ring of 32 buffers, but whenever a buffer in the ring would need a WAL flush to recycle, we throw it out of the buffer ring instead. On read-only scans (and scans that only update hint bit) this gives the L2 cache benefits and doesn't pollute the buffer cache. On bulk updates, it's effectively the current behavior. On scans that do some updates, it's something in between. In all cases it should be no worse than what we have now. 32 buffers should be large enough to leave a "cache trail" for Jeff's synchronized scans to work. 3. For COPY that doesn't write WAL, use the same strategy as for sequential scans. This keeps the cache pollution low and gives the L2 cache benefits. 4. For COPY that writes WAL, use a large ring of 2048-4096 buffers. We want to use a ring that can accommodate 1 WAL segment worth of data, to avoid having to do any extra WAL flushes, and the WAL segment size is 2048 pages in the default configuration. Some alternatives I considered but rejected: * Instead of throwing away dirtied buffers in seq scans, accumulate them in another fixed sized list. When the list gets full, do a WAL flush and put them to the shared freelist or a backend-private freelist. That would eliminate the cache pollution of bulk DELETEs and bulk UPDATEs, and it could be used for vacuum as well. I think this would be the optimal algorithm but I don't feel like inventing something that complicated at this stage anymore. Maybe for 8.4. * Using a different sized ring for 1st and 2nd vacuum phase. Decided that it's not worth the trouble, the above is already an order of magnitude better than the current behavior. I'm going to rerun the performance tests I ran earlier with new patch, tidy it up a bit, and submit it in the next few days. This turned out to be even more laborious patch to review than I thought. While the patch is short and in the end turned out to be very close to Simon's original patch, there's many different usage scenarios that need to be catered for and tested. I still need to check the interaction with Jeff's patch. This is close enough to Simon's original patch that I believe the results of the tests Jeff ran earlier are still valid. -- Heikki Linnakangas EnterpriseDB http://www.enterprisedb.com
pgsql-hackers by date: