Reducing spinlock acquisition within clock sweep loop - Mailing list pgsql-hackers
From | Merlin Moncure |
---|---|
Subject | Reducing spinlock acquisition within clock sweep loop |
Date | |
Msg-id | CAHyXU0x_cRPcBXqNeMye_tEZW2Dsy1T0cCDLW+EWQjH_nH6KnQ@mail.gmail.com Whole thread Raw |
Responses |
Re: Reducing spinlock acquisition within clock sweep loop
|
List | pgsql-hackers |
Background: The main sources of contention, buffer_strategy_lock, has been removed FWICT via 5d7962c6 and d72731a7. However, during the sweep each tick locks the buffer header via spinlock in order to to adjust usage_count. I believe that removing this lock is possible optimization fruit. I doubt that this is a major contention point for most workloads (over the years, I've only seen one or two cases where I suspected clock sweep was getting gummed up by buffer locks) but for systems that are already under lock duress it's hard to argue that spamming spinlocks in a tight loop makes things easier for the hardware that is protecting cachelines. In short, clock sweep heavy workloads are issuing a lot of atomic ops. Any savings here is good. Right now usage_count ranges across a small number (currently 0-5), incrementing upon allocation and decrementing on sweep. Because of this, each tick of the clock sweep spinlocks the buffer in preparation of A. the buffer being returned or B. the buffer having a non zero usage count so that it can be safely decremented. B is potential optimization fruit, I think. The idea: Instead, why not let usage_count rotate around on it's own in a clock-like fashion? the idea is that, instead of defining usage count as a fixed offset from zero, it is based on the number of times the buffer has been touched since the last sweep. In other words, each time the buffer clock crosses midnight, 'usage count' for the whole pool drops by one, implicitly. Each time a buffer is touched, it's usage_count is set to 'completePasses' + 1, but, just like today, is never allowed to increment past completePasses + 5. So, what used to be usage_count becomes defined as the distance between the usage_count and completePasses. Because of usage_count tracking rotation, calculating what used to be usage_count requires a bit of logic to arrive at the right number as rotation happens. I'm pretty sure all the rotation edge cases can be handled but I'm glossing over them for now. Clock sweep buffer acquisition currently looks like this: loop: tick(); spinlock(); in use? goto loop usage_count > 0? yes: decrement, goto loop return buffer; Proposed buffer acquisition would look like: loop: tick(); victim < last_victim? yes: atomic re-fetch completePasses (see below) usage_count > completePasses? yes: goto loop try_spinlock(); no lock? goto loop usage_count > completePasses? yes: goto loop in use? goto loop return buffer; (try_spinlock is an adorned TAS() as a opposed to a full spin). Usage count is double checked after lock acquisition in case the local copy is stale. Bad stuff: 1. Complexity. Is it worth it? Calculating proper 'usage_count' distance in the face of two rotating numbers requires some clever coding. 2. Unfortunately, since we've (for very good reasons) got a clock sweep that can be concurrently engaged by many actors, it's no longer safe to assume that completePasses can be safely examined outside of the sweep loop but must intead be checked with every tick. Doing this with another atomic read would defeat the entire purpose of the patch, but perhaps we can check it based on a heuristic: it's examined when victim is less than the last_seen victim (which is locally saved off). ...which would add up to a tiny (zero?) risk of misjudging usage_distance. The fallout, if it were to happen, would be to possibly to return a buffer that would have a usage_count = 1 with the old method. 3. Since proper usage_count, aka distance is managed by being relative to completePasses, which slightly changes the semantics due to the fact the entire buffer pool drop in terms of usage_count at once. I'm doubtful this matters though. Just spitballing here -- wondering if the idea has legs. I think this should compatible with Robert's usage_count tweaking with flags FWICT. I'm inclined to work up a POC patch unless this is unworkable for some reason. The payoff, for workloads with high sweep activity, would be drastically reduced spinlock activity and associated cache line stress. merlin
pgsql-hackers by date: