Re: [HACKERS] WAL logging problem in 9.4.3? - Mailing list pgsql-hackers

From Kyotaro Horiguchi
Subject Re: [HACKERS] WAL logging problem in 9.4.3?
Date
Msg-id 20191125.110854.1439648354665518969.horikyota.ntt@gmail.com
Whole thread Raw
In response to Re: [HACKERS] WAL logging problem in 9.4.3?  (Noah Misch <noah@leadboat.com>)
Responses Re: [HACKERS] WAL logging problem in 9.4.3?  (Noah Misch <noah@leadboat.com>)
List pgsql-hackers
At Sat, 23 Nov 2019 16:21:36 -0500, Noah Misch <noah@leadboat.com> wrote in 
> On Wed, Nov 20, 2019 at 03:05:46PM +0900, Kyotaro Horiguchi wrote:
> > By the way, before finalize this, I'd like to share the result of a
> > brief benchmarking.
> 
> What non-default settings did you use?  Please give the output of this or a
> similar command:

Only wal_level=minimal and max_wal_senders=0.

>   select name, setting from pg_settings where setting <> boot_val;
> 
> If you run more benchmarks and weren't already using wal_buffers=16MB, I
> recommend using it.

Roger.

> > With 10 pgbench sessions.
> > pages   SYNC     WAL     
> >     1:   915 ms   301 ms
> >     3:  1634 ms   508 ms
> >     5:  1634 ms   293ms
> >    10:  1671 ms  1043 ms
> >    17:  1600 ms   333 ms
> >    31:  1864 ms   314 ms
> >    56:  1562 ms   448 ms
> >   100:  1538 ms   394 ms
> >   177:  1697 ms  1047 ms
> >   316:  3074 ms  1788 ms
> >   562:  3306 ms  1245 ms
> >  1000:  3440 ms  2182 ms
> >  1778:  5064 ms  6464 ms  # WAL's slope becomes steep
> >  3162:  8675 ms  8165 ms
> 
> For picking a default wal_skip_threshold, it would have been more informative
> to see how this changes pgbench latency statistics.  Some people want DDL to
> be fast, but more people want DDL not to reduce the performance of concurrent
> non-DDL.  This benchmark procedure may help:
> 
> 1. Determine $DDL_COUNT, a number of DDL transactions that take about one
>    minute when done via syncs.
> 2. Start "pgbench -rP1 --progress-timestamp -T180 -c10 -j10".
> 3. Wait 10s.
> 4. Start one DDL backend that runs $DDL_COUNT transactions.
> 5. Save DDL start timestamp, DDL end timestamp, and pgbench output.
> 
> I would compare pgbench tps and latency between the seconds when DDL is and is
> not running.  As you did in earlier tests, I would repeat it using various
> page counts, with and without sync.

I understood the "DDL" is not pure DDLs but a kind of
define-then-load, like "CREATE TABLE AS" , "CREATE TABLE" then "COPY
FROM".

> On Wed, Nov 20, 2019 at 05:31:43PM +0900, Kyotaro Horiguchi wrote:
> > +Prefer to do the same in future access methods.  However, two other approaches
> > +can work.  First, an access method can irreversibly transition a given fork
> > +from WAL-skipping to WAL-writing by calling FlushRelationBuffers() and
> > +smgrimmedsync().  Second, an access method can opt to write WAL
> > +unconditionally for permanent relations.  When using the second method, do not
> > +call RelationCopyStorage(), which skips WAL.
> > 
> > Even using these methods, TransactionCommit flushes out buffers then
> > sync files again. Isn't a description something like the following
> > needed?
> > 
> > ===
> > Even an access method switched a in-transaction created relfilenode to
> > WAL-writing, Commit(Prepare)Transaction flushed all buffers for the
> > file then smgrimmedsync() the file.
> > ===
> 
> It is enough that the text says to prefer the approach that core access
> methods use.  The extra flush and sync when using a non-preferred approach
> wastes some performance, but it is otherwise harmless.

Ah, right and I agreed.

> > +    rel1 = relation_open(r1, AccessExclusiveLock);
> > +    RelationAssumeNewRelfilenode(rel1);
> > 
> > It cannot be accessed from other sessions. Theoretically it doesn't
> > need a lock but NoLock cannot be used there since there's a path that
> > doesn't take lock on the relation. But AEL seems too strong and it
> > causes unecessary side effect. Couldn't we use weaker locks?
> 
> We could use NoLock.  I assumed we already hold AccessExclusiveLock, in which
> case this has no side effects.

I forgot that this optimization is used only in non-replication
configuragion. So I agree that AEL doesn't have no side
effect.

> On Thu, Nov 21, 2019 at 04:01:07PM +0900, Kyotaro Horiguchi wrote:
> > At Sun, 17 Nov 2019 20:54:34 -0800, Noah Misch <noah@leadboat.com> wrote in 
> > > === Defect 1: gistGetFakeLSN()
> > > 
> > > When I modified pg_regress.c to use wal_level=minimal for all suites,
> > > src/test/isolation/specs/predicate-gist.spec failed the assertion in
> > > gistGetFakeLSN().  One could reproduce the problem just by running this
> > > sequence in psql:
> > > 
> > >           begin;
> > >           create table gist_point_tbl(id int4, p point);
> > >           create index gist_pointidx on gist_point_tbl using gist(p);
> > >           insert into gist_point_tbl (id, p)
> > >           select g, point(g*10, g*10) from generate_series(1, 1000) g;
> > > 
> > > I've included a wrong-in-general hack to make the test pass.  I see two main
> > > options for fixing this:
> > > 
> > > (a) Introduce an empty WAL record that reserves an LSN and has no other
> > > effect.  Make GiST use that for permanent relations that are skipping WAL.
> > > Further optimizations are possible.  For example, we could use a backend-local
> > > counter (like the one gistGetFakeLSN() uses for temp relations) until the
> > > counter is greater a recent real LSN.  That optimization is probably too
> > > clever, though it would make the new WAL record almost never appear.
> > > 
> > > (b) Exempt GiST from most WAL skipping.  GiST index build could still skip
> > > WAL, but it would do its own smgrimmedsync() in addition to the one done at
> > > commit.  Regular GiST mutations would test RELPERSISTENCE_PERMANENT instead of
> > > RelationNeedsWal(), and we'd need some hack for index_copy_data() and possibly
> > > other AM-independent code that skips WAL.
> > > 
> > > Overall, I like the cleanliness of (a).  The main argument for (b) is that it
> > > ensures we have all the features to opt-out of WAL skipping, which could be
> > > useful for out-of-tree index access methods.  (I think we currently have the
> > > features for a tableam to do so, but not for an indexam to do so.)  Overall, I
> > > lean toward (a).  Any other ideas or preferences?
> > 
> > I don't like (b), too.
> > 
> > What we need there is any sequential numbers for page LSN but really
> > compatible with real LSN. Couldn't we use GetXLogInsertRecPtr() in the
> > case?
> 
> No.  If nothing is inserting WAL, GetXLogInsertRecPtr() does not increase.
> GiST pages need an increasing LSN value.

Sorry, I noticed that after the mail went out. I agree to (a) and will
do that.

> I noticed an additional defect:
> 
> BEGIN;
> CREATE TABLE t (c) AS SELECT 1;
> CHECKPOINT; -- write and fsync the table's one page
> TRUNCATE t; -- no WAL
> COMMIT; -- no FPI, just the commit record
> 
> If we crash after the COMMIT and before the next fsync or OS-elected sync of
> the table's file, the table will stay on disk with its pre-TRUNCATE content.

The TRUNCATE replaces relfilenode in the catalog and the pre-TRUNCATE
content wouldn't be seen after COMMIT.  Since the file has no pages,
it's right that no FPI emitted. What we should make sure the empty
file's metadata is synced out. But I think that kind of failure
shoudn't happen on modern file systems. If we don't rely on such
behavior, we can make sure thhat by turning the zero-pages case from
WAL into file sync. I'll do that in the next version.

I'll post the next version as a single patch.

regards.

-- 
Kyotaro Horiguchi
NTT Open Source Software Center



pgsql-hackers by date:

Previous
From: Ranier Vilela
Date:
Subject: RE: [PATCH] Style: fix function declaration
Next
From: Michael Paquier
Date:
Subject: Re: Safeguards against incorrect fd flags for fsync()