Re: Seq scans roadmap - Mailing list pgsql-hackers

From Heikki Linnakangas
Subject Re: Seq scans roadmap
Date
Msg-id 46497E24.6060500@enterprisedb.com
Whole thread Raw
In response to Re: Seq scans roadmap  (Heikki Linnakangas <heikki@enterprisedb.com>)
Responses Re: Seq scans roadmap
List pgsql-hackers
Just to keep you guys informed, I've been busy testing and pondering 
over different buffer ring strategies for vacuum, seqscans and copy. 
Here's what I'm going to do:

Use a fixed size ring. Fixed as in doesn't change after the ring is 
initialized, however different kinds of scans use differently sized rings.

I said earlier that it'd be invasive change to see if a buffer needs a 
WAL flush and choose another victim if that's the case. I looked at it 
again and found a pretty clean way of doing that, so I took that 
approach for seq scans.

1. For VACUUM, use a ring of 32 buffers. 32 buffers is small enough to 
give the L2 cache benefits and keep cache pollution low, but at the same 
time it's large enough that it keeps the need to WAL flush reasonable 
(1/32 of what we do now).

2. For sequential scans, also use a ring of 32 buffers, but whenever a 
buffer in the ring would need a WAL flush to recycle, we throw it out of 
the buffer ring instead. On read-only scans (and scans that only update 
hint bit) this gives the L2 cache benefits and doesn't pollute the 
buffer cache. On bulk updates, it's effectively the current behavior. On 
scans that do some updates, it's something in between. In all cases it 
should be no worse than what we have now. 32 buffers should be large 
enough to leave a "cache trail" for Jeff's synchronized scans to work.

3. For COPY that doesn't write WAL, use the same strategy as for 
sequential scans. This keeps the cache pollution low and gives the L2 
cache benefits.

4. For COPY that writes WAL, use a large ring of 2048-4096 buffers. We 
want to use a ring that can accommodate 1 WAL segment worth of data, to 
avoid having to do any extra WAL flushes, and the WAL segment size is 
2048 pages in the default configuration.

Some alternatives I considered but rejected:

* Instead of throwing away dirtied buffers in seq scans, accumulate them 
in another fixed sized list. When the list gets full, do a WAL flush and 
put them to the shared freelist or a backend-private freelist. That 
would eliminate the cache pollution of bulk DELETEs and bulk UPDATEs, 
and it could be used for vacuum as well. I think this would be the 
optimal algorithm but I don't feel like inventing something that 
complicated at this stage anymore. Maybe for 8.4.

* Using a different sized ring for 1st and 2nd vacuum phase. Decided 
that it's not worth the trouble, the above is already an order of 
magnitude better than the current behavior.


I'm going to rerun the performance tests I ran earlier with new patch, 
tidy it up a bit, and submit it in the next few days. This turned out to 
be even more laborious patch to review than I thought. While the patch 
is short and in the end turned out to be very close to Simon's original 
patch, there's many different usage scenarios that need to be catered 
for and tested.

I still need to check the interaction with Jeff's patch. This is close 
enough to Simon's original patch that I believe the results of the tests 
Jeff ran earlier are still valid.

--   Heikki Linnakangas  EnterpriseDB   http://www.enterprisedb.com


pgsql-hackers by date:

Previous
From: Russell Smith
Date:
Subject: Re: [BUGS] Removing pg_auth_members.grantor (was Grantor name gets lost when grantor role dropped)
Next
From: "Luke Lonergan"
Date:
Subject: Re: Seq scans roadmap