Re: Checkpoint sync pause - Mailing list pgsql-hackers
From | Jeff Janes |
---|---|
Subject | Re: Checkpoint sync pause |
Date | |
Msg-id | CAMkU=1xvJjsRusYu8WgfLRRCbocDCraV1A1PqawkJe-WZWNfEw@mail.gmail.com Whole thread Raw |
In response to | Re: Checkpoint sync pause (Greg Smith <gsmith@gregsmith.com>) |
Responses |
Re: Checkpoint sync pause
|
List | pgsql-hackers |
On Tue, Feb 7, 2012 at 1:22 PM, Greg Smith <gsmith@gregsmith.com> wrote: > On 02/03/2012 11:41 PM, Jeff Janes wrote: >>> >>> -The steady stream of backend writes that happen between checkpoints have >>> filled up most of the OS write cache. A look at /proc/meminfo shows >>> around >>> 2.5GB "Dirty:" >> >> "backend writes" includes bgwriter writes, right? > > > Right. > > >> Has using a newer kernal with dirty_background_bytes been tried, so it >> could be set to a lower level? If so, how did it do? Or does it just >> refuse to obey below the 5% level, as well? > > > Trying to dip below 5% using dirty_background_bytes slows VACUUM down faster > than it improves checkpoint latency. Does it cause VACUUM to create latency for other processes (like the checkpoint syncs do, by gumming up the IO for everyone) or does VACUUM just slow down without effecting other tasks? It seems to me that just lowering dirty_background_bytes (while not also lowering dirty_bytes) should not cause the latter to happen, but it seems like these kernel tunables never do exactly what they advertise. This may not be relevant to the current situation, but I wonder if we don't need a "vacuum_cost_page_dirty_seq" so that if the pages we are dirtying are consecutive (or at least closely spaced) they cost less, in anticipation that the eventual writes will be combined and thus consume less IO resources. I would think it would be common for some regions of table to be intensely dirtied, and some to be lightly dirtied (but still aggregating up to a considerable amount of random IO). But the vacuum process might also need to be made more "bursty", as even if it generates sequential dirty pages the IO system might write them randomly anyway if there are too many delays interspersed > Since the sort of servers that have > checkpoint issues are quite often ones that have VACUUM ones, too, that > whole path doesn't seem very productive. The one test I haven't tried yet > is whether increasing the size of the VACUUM ring buffer might improve how > well the server responds to a lower write cache. I wouldn't expect this to help. It seems like it would hurt, as it just leaves the data for even longer (however long it takes to circumnavigate the ring buffer) before there is any possibility of it getting written. I guess it does increase the chances that the dirty pages will "accidentally" get written by the bgwriter rather than the vacuum itself, but I doubt that that would be significant. ... >> >> Was the sorted checkpoint with an fsync after every file (real file, >> not VFD) one of the changes you tried? > > ... > > I haven't had very good luck with sorting checkpoints at the PostgreSQL > relation level on server-size systems. There is a lot of sorting already > happening at both the OS (~3GB) and BBWC (>=512MB) levels on this server. > My own tests on my smaller test server--with a scaled down OS (~750MB) and > BBWC (256MB) cache--haven't ever validated sorting as a useful technique on > top of that. It's never bubbled up to being considered a likely win on the > production one as a result. Without sorted checkpoints (or some other fancier method) you have to write out the entire pool before you can do any fsyncs. Or you have to do multiple fsyncs of the same file, with at least one occurring after the entire pool was written. With a sorted checkpoint, you can start issuing once-only fsyncs very early in the checkpoint process. I think that on large servers, that would be the main benefit, not the actually more efficient IO. (On small servers I've seen sorted checkpoints be much faster on shutdown checkpoints, but not on natural checkpoints, and presumably this improvement *is* due to better ordering). On your servers, you need big delays between fsyncs and not between writes (as they are buffered until the fsync). But in other situations, people need the delays between the writes. By using sorted checkpoints with fsyncs between each file, the delays between writes are naturally delays between fsyncs as well. So I think the benefit of using sorted checkpoints is that code to improve your situations is less likely to degrade someone else's situation, without having to introduce an extra layer of tunables. > >> What I/O are they trying to do? It seems like all your data is in RAM >> (if not, I'm surprised you can get queries to ran fast enough to >> create this much dirty data). So they probably aren't blocking on >> reads which are being interfered with by all the attempted writes. > > > Reads on infrequently read data. Long tail again; even though caching is > close to 100%, the occasional outlier client who wants some rarely accessed > page with their personal data on it shows up. Pollute the write caches > badly enough, and what happens to reads mixed into there gets very fuzzy. > Depends on the exact mechanics of the I/O scheduler used in the kernel > version deployed. OK, but I would still think it is a minority of transactions which need at least one of those infrequently read data and most do not. So a few clients would freeze, but the rest should keep going until they either try to execute a read themselves, or they run into a heavyweight lock held by someone else who is read-blocking. So if 1/1000 of all transactions need to make a disk read, but clients are running at 100s of TPS, then I guess after a few tens of seconds all clients will be blocked on reads and you will see total freeze up. But it seems more likely to me that they are in fact freezing on writes. Is there a way to directly observe what they are blocking on?I wish "top" would separate %wait into read and write. > > >> The current shared_buffer allocation method (or my misunderstanding of >> it) reminds me of the joke about the guy who walks into his kitchen >> with a cow-pie in his hand and tells his wife "Look what I almost >> stepped in". If you find a buffer that is usagecount=0 and unpinned, >> but dirty, then why is it dirty? It is likely to be dirty because the >> background writer can't keep up. And if the background writer can't >> keep up, it is probably having trouble with writes blocking. So, for >> Pete's sake, don't try to write it out yourself! If you can't find a >> clean, reusable buffer in a reasonable number of attempts, I guess at >> some point you need to punt and write one out. But currently it grabs >> the first unpinned usagecount=0 buffer it sees and writes it out if >> dirty, without even checking if the next one might be clean. > > > Don't forget that in the version deployed here, the background writer isn't > running during the sync phase. Oh, I had thought you had compiled your own custom work around to that. So much of the problem might go away upon a new release and an upgrade, as far as we know? > I think the direction you're talking about > here circles back to "why doesn't the BGW just put things it finds clean > onto the free list?", I wouldn't put it that way, because to me the freelist is the code located in freelist.c. The linked list is a freelist. But the clock sweep is also a freelist, just implemented in a different way. If the hypothetical BGW doesn't remove the entry from the buffer mapping table and invalidate it when it adds to the linked list, then we might pull a "free" buffer from the linked list and discover it is not actually free. If we want to make it so that it does remove the entry from the buffer mapping table (which doesn't seem like a good idea to me) we could implement that just as well with the clock-sweep as we could with the linked list. I think the linked list is a bit of a red herring. Many of the concepts people discuss implementing on the linked list could just as easily be implemented with the clock sweep. And I've seen no evidence at all that the clock sweep is the problem. The LWLock that protects can obviously be a problem, but that seems to be due to the overhead of acquiring a contended lock, not the work done under the lock. Reducing the lock-strength around this might be a good idea, but that reduction could be done just as easily (and as far as I can tell, more easily) with the clock sweep than the linked list. > a direction which would make "nothing on the free > list" a noteworthy event suggesting the BGW needs to run more often. Isn't seeing a dirty unpinned usage_count==0 buffer in the clocksweep just as noteworthy as seeing an empty linked list? From what I can tell, you can't dirty a buffer without pinning it, you can't pin a buffer without making usage_count>0, and we never decrement usage_count on a pinned buffer. So, the only way to see a dirty buffer that is unpinned and has zero usage_count is if another normal backend saw it unpinned and decremented the count, which would have to be a full clock sweep ago, and the bgwriter hasn't visited it since then. If our goal is to autotune the bgwriter_* parameters, then detecting either an empty linked list or dirty but usable buffer in the clock sweep would be a good way to do that. But, I think the bigger issue is to assume that the bgwriter is already tuned as well as it can be, and that beating on it further will not improve its moral. If the IO write caches are all full, there is nothing bgwriter can do about it by running more often. In that case, we can't really do anything about the dirty pages it is leaving around our yard. But what we can do is not pick up those little piles of toxic waste and bring them into our living rooms. That is, don't try to write out the dirty page in the foreground, instead go looking for a clean one. We can evict it without doing a write, and hopefully we can read in the replacement either from OS cache, or from disk if reads are not as gummed up as writes are. >> But I would think that pgbench can be configured to do that as well, >> and would probably offer a wider array of other testers. Of course,if >> they have to copy and specify 30 different -f files, maybe getting >> dbt-2 to install and run would be easier than that. My attempts at >> getting dbt-5 to work for me do not make me eager jump from pgbench to >> try more other things. > > > dbt-5 is a work in progress, known to be tricky to get going. dbt-2 is > mature enough that it was used for this sort of role in 8.3 development. > And it's even used by other database systems for similar testing. It's the > closest thing to an open-source standard for write-heavy workloads as we'll > find here. OK, thanks for the reassurance. I'll no longer be afraid to give it a try if I get an opportunity. Cheers, Jeff
pgsql-hackers by date: