Re: [HACKERS] Re: Clarifying "server starting" messaging in pg_ctlstart without --wait - Mailing list pgsql-hackers

From Andres Freund
Subject Re: [HACKERS] Re: Clarifying "server starting" messaging in pg_ctlstart without --wait
Date
Msg-id 20170120015427.z7kv5avhju56hr7a@alap3.anarazel.de
Whole thread Raw
In response to Re: [HACKERS] Re: Clarifying "server starting" messaging in pg_ctlstart without --wait  (Stephen Frost <sfrost@snowman.net>)
Responses Re: [HACKERS] Re: Clarifying "server starting" messaging in pg_ctlstart without --wait  (Stephen Frost <sfrost@snowman.net>)
List pgsql-hackers
On 2017-01-19 20:45:57 -0500, Stephen Frost wrote:
> * Andres Freund (andres@anarazel.de) wrote:
> > On 2017-01-19 10:06:09 -0500, Stephen Frost wrote:
> > > WAL replay does do more work, generally speaking (the WAL has to be
> > > read, the checksum validated on it, and then the write has to go out,
> > > while the checkpointer just writes the page out from memory), but it's
> > > also dealing with less contention on the system (there aren't a bunch of
> > > backends hammering the disks to pull data in with reads when you're
> > > doing crash recovery...).
> > 
> > There's a huge difference though: WAL replay is single threaded, whereas
> > generating WAL is not.  
> 
> I'm aware- but *checkpointing* is still single-threaded, unless, as I
> mentioned, you end up with backends pushing out their own changes to the
> heap to make room for new pages to come in.

Sure, but buffer checkpointing isn't necessarily that large a portion of
the work done in one checkpoint cycle, in comparison to all the WAL
being generated.  Quite commonly a lot of the buffers will already have
been flushed to disk by backend and/or bgwriter, and are clean by the
time checkpointer gets to them.  So I don't think checkpointer being
single threaded necessarily means much WRT replay performance.

> > Especially if there's synchronous IO required
> > (most commonly reading in data, because more data was modified in the
> > current checkpointthan fit in shared buffers, so FPIs don't pre-fill
> > buffers), you can be significantly slower than generating the WAL.
> 
> That is an interesting point, if I'm following what you're saying
> correctly- during the replay we can end up having more pages modified
> than fit in shared buffers, which means that we have to read back in
> pages that we pushed out to implement the non-FPI WAL changes to that
> page.

Right. (And not just during replay obviously, also during the intial WAL
generation).

> I wonder if we should have a way to configure the amount of memory
> allowed to be used for WAL replay, independent of shared_buffers?

I don't quite see how that'd work, especially with HS.  We just use the
normal shared buffers code etc, and there we can't just resize the
amount of shared_buffers allocated after doing crash recovery.

> That said, I wonder if our eviction algorithm could be
> improved/changed when performing WAL replay too to reduce the chances
> that we'll have to read a page back in.

I don't think that's a that promising angle of attach. Having a separate
pre-fetching backend that parses the WAL and pre-reads everything
necessary seems more promising.

Greetings,

Andres Freund



pgsql-hackers by date:

Previous
From: Stephen Frost
Date:
Subject: Re: [HACKERS] Re: Clarifying "server starting" messaging in pg_ctlstart without --wait
Next
From: Haribabu Kommi
Date:
Subject: Re: [HACKERS] pg_hba_file_settings view patch