Re: [HACKERS] WAL logging problem in 9.4.3? - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: [HACKERS] WAL logging problem in 9.4.3?
Date
Msg-id 20190822.210606.07927021.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: [HACKERS] WAL logging problem in 9.4.3?  (Noah Misch <noah@leadboat.com>)
Responses Re: [HACKERS] WAL logging problem in 9.4.3?  (Noah Misch <noah@leadboat.com>)
List pgsql-hackers
Hello.

At Mon, 19 Aug 2019 23:03:14 -0700, Noah Misch <noah@leadboat.com> wrote in
<20190820060314.GA3086296@rfd.leadboat.com>
> On Mon, Aug 19, 2019 at 06:59:59PM +0900, Kyotaro Horiguchi wrote:
> > At Sat, 17 Aug 2019 20:52:30 -0700, Noah Misch <noah@leadboat.com> wrote in
<20190818035230.GB3021338@rfd.leadboat.com>
> > > For two-phase commit, PrepareTransaction() needs to execute pending syncs.
> > 
> > Now TwoPhaseFileHeader has two new members for (commit-time)
> > pending syncs. Pending-syncs are useless on wal-replay, but that
> > is needed for commit-prepared.
> 
> There's no need to modify TwoPhaseFileHeader or the COMMIT PREPARED sql
> command, which is far too late to be syncing new relation files.  (A crash may
> have already destroyed their data.)  PrepareTransaction(), which implements
> the PREPARE TRANSACTION command, is the right place for these syncs.
> 
> A failure in these new syncs needs to prevent the transaction from being
> marked committed.  Hence, in CommitTransaction(), these new syncs need to

Agreed.

> happen after the last step that could create assign a new relfilenode and
> before RecordTransactionCommit().  I suspect it's best to do it after
> PreCommit_on_commit_actions() and before AtEOXact_LargeObject().

I don't find an obvious problem there. Since pending deletes and
pending syncs are separately processed, I'm planning to make a
separate list for syncs from deletes.

> > > This should sync all forks, not just MAIN_FORKNUM.  Code that writes WAL for
> > > FSM_FORKNUM and VISIBILITYMAP_FORKNUM checks RelationNeedsWAL().  There may be
> > > no bug today, but it's conceptually wrong to make RelationNeedsWAL() return
> > > false due to this code, use RelationNeedsWAL() for multiple forks, and then
> > > not actually sync all forks.
> > 
> > I agree that all forks needs syncing, but FSM and VM are checking
> > RelationNeedsWAL(modified). To make sure, are you suggesting to
> > sync all forks instead of emitting WAL for them, or suggesting
> > that VM and FSM to emit WALs even when the modified
> > RelationNeedsWAL returns false (+ sync all forks)?
> 
> I hadn't thought that far.  What do you think is best?

As in the latest patch, sync ALL forks then no WALs. We could
skip syncing FSM but I'm not sure it's work doing.


> > > The https://postgr.es/m/559FA0BA.3080808@iki.fi design had another component
> > > not appearing here.  It said, "Instead, at COMMIT, we'd fsync() the relation,
> > > or if it's smaller than some threshold, WAL-log the contents of the whole file
> > > at that point."  Please write the part to WAL-log the contents of small files
> > > instead of syncing them.
> > 
> > I'm not sure the point of the behavior. I suppose that the "log"
> > is a sequence of new_page records. It also needs to be synced and
> > it is always larger than the file to be synced. I can't think of
> > an appropriate threshold without the point.
> 
> Yes, it would be a sequence of new-page records.  FlushRelationBuffers() locks
> every buffer header containing a buffer of the current database.  The belief
> has been that writing one page to xlog is cheaper than FlushRelationBuffers()
> in a busy system with large shared_buffers.

I'm at a loss.. The decision between WAL and sync is made at
commit time, when we no longer have a pin on a buffer. When
emitting WAL, opposite to the assumption, lock needs to be
re-acquired for every page to emit log_new_page. What is worse,
we may need to reload evicted buffers.  If the file has been
CopyFrom'ed, ring buffer strategy makes the situnation farther
worse. That doesn't seem cheap at all..

If there were any chance on WAL for smaller files here, it would
be on the files smaller than the ring size of bulk-write
strategy(16MB).

If we pick up every buffer page of the file instead of scanning
through all buffers, that makes things worse by conflicts on
partition locks.

Any thoughts?



# Sorry time's up today.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



pgsql-hackers by date:

Previous
From: Konstantin Knizhnik
Date:
Subject: Re: Why overhead of SPI is so large?
Next
From: Peter Eisentraut
Date:
Subject: mingw32 floating point diff