Re: [RFC] Should we fix postmaster to avoid slow shutdown? - Mailing list pgsql-hackers
From | Alvaro Herrera |
---|---|
Subject | Re: [RFC] Should we fix postmaster to avoid slow shutdown? |
Date | |
Msg-id | 20161122203413.qbad4jrcgevkzdnk@alvherre.pgsql Whole thread Raw |
In response to | Re: [RFC] Should we fix postmaster to avoid slow shutdown? (Robert Haas <robertmhaas@gmail.com>) |
Responses |
Re: [RFC] Should we fix postmaster to avoid slow shutdown?
Re: [RFC] Should we fix postmaster to avoid slow shutdown? |
List | pgsql-hackers |
Robert Haas wrote: > On Tue, Nov 22, 2016 at 1:37 PM, Alvaro Herrera > <alvherre@2ndquadrant.com> wrote: > >> > Yes, I am, and I disagree with you. The current decision on this point > >> > was made ages ago, before autovacuum even existed let alone relied on > >> > the stats for proper functioning. The tradeoff you're saying you're > >> > okay with is "we'll shut down a few seconds faster, but you're going > >> > to have table bloat problems later because autovacuum won't know it > >> > needs to do anything". I wonder how many of the complaints we get > >> > about table bloat are a consequence of people not realizing that > >> > "pg_ctl stop -m immediate" is going to cost them. > >> > >> That would be useful information to have, but I bet the answer is "not > >> that many". Most people don't shut down their database very often; > >> they're looking for continuous uptime. It looks to me like autovacuum > >> activity causes the statistics files to get refreshed at least once > >> per autovacuum_naptime, which defaults to once a minute, so on the > >> average we're talking about the loss of perhaps 30 seconds worth of > >> statistics. > > > > I think you're misunderstanding how this works. Losing that file > > doesn't lose just the final 30 seconds worth of data -- it loses > > *everything*, and every counter goes back to zero. So it's not a few > > parts-per-million, it loses however many millions there were. > > OK, that's possible, but I'm not sure. I think there are two separate > issues here. One is whether we should nuke the stats file on > recovery, and the other is whether we should force a final write of > the stats file before agreeing to an immediate shutdown. It seems to > me that the first one affects whether all of the counters go to zero, > and the second affects whether we lose a small amount of data from > just prior to the shutdown. Right now, we are doing the first, so the > second is a waste. If we decide to start doing the first, we can > independently decide whether to also do the second. Well, the problem is that the stats data is not on disk while the system is in operation, as far as I recall -- it's only in the collector's local memory. On shutdown we tell it to write it down to a file, and on startup we tell it to read it from the file and then delete it. I think the rationale for this is to avoid leaving a file with stale data on disk while the system is running. > > Those writes are slow because of the concurrent activity. If all > > backends just throw their hands in the air, no more writes come from > > them, so the OS is going to finish the writes pretty quickly (or at > > least empty enough of the caches so that the pgstat data fits); so > > neither (1) nor (3) should be terribly serious. I agree that (2) is a > > problem, but it's not a problem for everyone. > > If the operating system buffer cache doesn't contain much dirty data, > then I agree. But there is a large backlog of dirty data there, then > it might be quite slow. That's true, but if the system isn't crashing, then writing a bunch of pages would make room for the pgstat data to be written to the OS, which is enough (we request only a write, not a flush, as I recall). So we don't need to wait for a very long period. > > A fast shutdown is not all that fast -- it needs to write the whole > > contents of shared buffers down to disk, which may be enormous. > > Millions of times bigger than pgstat data. So a fast shutdown is > > actually very slow in a large machine. An immediate shutdown, even if > > it writes pgstat data, is still going to be much smaller in terms of > > what is written. > > I agree. However, in many cases, the major cost of a fast shutdown is > getting the dirty data already in the operating system buffers down to > disk, not in writing out shared_buffers itself. The latter is > probably a single-digit number of gigabytes, or maybe double-digit. > The former might be a lot more, and the write of the pgstat file may > back up behind it. I've seen cases where an 8kB buffered write from > Postgres takes tens of seconds to complete because the OS buffer cache > is already saturated with dirty data, and the stats files could easily > be a lot more than that. In the default config, background flushing is invoked when memory is 10% dirty (dirty_background_ratio); foreground flushing is forced when memory is 40% dirty (dirty_ratio). That means the pgstat process can dirty 30% additional memory before being forced to perform flushing. -- Álvaro Herrera https://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
pgsql-hackers by date: