Re: [RFC] Should we fix postmaster to avoid slow shutdown? - Mailing list pgsql-hackers

From Alvaro Herrera
Subject Re: [RFC] Should we fix postmaster to avoid slow shutdown?
Date
Msg-id 20161122203413.qbad4jrcgevkzdnk@alvherre.pgsql
Whole thread Raw
In response to Re: [RFC] Should we fix postmaster to avoid slow shutdown?  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: [RFC] Should we fix postmaster to avoid slow shutdown?
Re: [RFC] Should we fix postmaster to avoid slow shutdown?
List pgsql-hackers
Robert Haas wrote:
> On Tue, Nov 22, 2016 at 1:37 PM, Alvaro Herrera
> <alvherre@2ndquadrant.com> wrote:
> >> > Yes, I am, and I disagree with you.  The current decision on this point
> >> > was made ages ago, before autovacuum even existed let alone relied on
> >> > the stats for proper functioning.  The tradeoff you're saying you're
> >> > okay with is "we'll shut down a few seconds faster, but you're going
> >> > to have table bloat problems later because autovacuum won't know it
> >> > needs to do anything".  I wonder how many of the complaints we get
> >> > about table bloat are a consequence of people not realizing that
> >> > "pg_ctl stop -m immediate" is going to cost them.
> >>
> >> That would be useful information to have, but I bet the answer is "not
> >> that many".  Most people don't shut down their database very often;
> >> they're looking for continuous uptime.  It looks to me like autovacuum
> >> activity causes the statistics files to get refreshed at least once
> >> per autovacuum_naptime, which defaults to once a minute, so on the
> >> average we're talking about the loss of perhaps 30 seconds worth of
> >> statistics.
> >
> > I think you're misunderstanding how this works.  Losing that file
> > doesn't lose just the final 30 seconds worth of data -- it loses
> > *everything*, and every counter goes back to zero.  So it's not a few
> > parts-per-million, it loses however many millions there were.
> 
> OK, that's possible, but I'm not sure.  I think there are two separate
> issues here.  One is whether we should nuke the stats file on
> recovery, and the other is whether we should force a final write of
> the stats file before agreeing to an immediate shutdown.  It seems to
> me that the first one affects whether all of the counters go to zero,
> and the second affects whether we lose a small amount of data from
> just prior to the shutdown.  Right now, we are doing the first, so the
> second is a waste.  If we decide to start doing the first, we can
> independently decide whether to also do the second.

Well, the problem is that the stats data is not on disk while the system
is in operation, as far as I recall -- it's only in the collector's
local memory.  On shutdown we tell it to write it down to a file, and on
startup we tell it to read it from the file and then delete it.  I think
the rationale for this is to avoid leaving a file with stale data on
disk while the system is running.

> > Those writes are slow because of the concurrent activity.  If all
> > backends just throw their hands in the air, no more writes come from
> > them, so the OS is going to finish the writes pretty quickly (or at
> > least empty enough of the caches so that the pgstat data fits); so
> > neither (1) nor (3) should be terribly serious.  I agree that (2) is a
> > problem, but it's not a problem for everyone.
> 
> If the operating system buffer cache doesn't contain much dirty data,
> then I agree.  But there is a large backlog of dirty data there, then
> it might be quite slow.

That's true, but if the system isn't crashing, then writing a bunch of
pages would make room for the pgstat data to be written to the OS, which
is enough (we request only a write, not a flush, as I recall).  So we
don't need to wait for a very long period.

> > A fast shutdown is not all that fast -- it needs to write the whole
> > contents of shared buffers down to disk, which may be enormous.
> > Millions of times bigger than pgstat data.  So a fast shutdown is
> > actually very slow in a large machine.  An immediate shutdown, even if
> > it writes pgstat data, is still going to be much smaller in terms of
> > what is written.
> 
> I agree.  However, in many cases, the major cost of a fast shutdown is
> getting the dirty data already in the operating system buffers down to
> disk, not in writing out shared_buffers itself.  The latter is
> probably a single-digit number of gigabytes, or maybe double-digit.
> The former might be a lot more, and the write of the pgstat file may
> back up behind it.  I've seen cases where an 8kB buffered write from
> Postgres takes tens of seconds to complete because the OS buffer cache
> is already saturated with dirty data, and the stats files could easily
> be a lot more than that.

In the default config, background flushing is invoked when memory is 10%
dirty (dirty_background_ratio); foreground flushing is forced when
memory is 40% dirty (dirty_ratio).  That means the pgstat process can
dirty 30% additional memory before being forced to perform flushing.

-- 
Álvaro Herrera                https://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



pgsql-hackers by date:

Previous
From: Tom Lane
Date:
Subject: Re: dblink get_connect_string() passes FDW option "updatable" to the connect string, connection fails.
Next
From: Alvaro Herrera
Date:
Subject: Re: patch: function xmltable