Re: WALInsertLock contention - Mailing list pgsql-hackers

From Tatsuo Ishii
Subject Re: WALInsertLock contention
Date
Msg-id 20110217.131353.527849116109800679.t-ishii@sraoss.co.jp
Whole thread Raw
In response to WALInsertLock contention  (Robert Haas <robertmhaas@gmail.com>)
Responses Re: WALInsertLock contention
List pgsql-hackers
> I've been thinking about the problem of $SUBJECT, and while I know
> it's too early to think seriously about any 9.2 development, I want to
> get my thoughts down in writing while they're fresh in my head.
> 
> It seems to me that there are two basic approaches to this problem.
> We could either split up the WAL stream into several streams, say one
> per database or one per tablespace or something of that sort, or we
> could keep it as a single stream but try not to do so much locking
> whilst in the process of getting it out the door.  Or we could try to
> do both, and maybe ultimately we'll need to.  However, if the second
> one is practical, it's got two major advantages: it'll probably be a
> lot less invasive, and it won't add any extra fsync traffic.  In
> thinking about how we might accomplish the goal of reducing lock
> contention, it occurred to me there's probably no need for the final
> WAL stream to reflect the exact order in which WAL is generated.
> 
> For example, suppose transaction T1 inserts a tuple into table A;
> transaction T2 inserts a tuple into table B; T1 commits; T2 commits.
> The commit records need to be in the right order, and all the actions
> that are part of a given transaction need to precede the associated
> commit record, but, for example, I don't think it would matter if you
> emitted the commit record for T1 before T2's insert into B.  Or you
> could switch the order in which you logged the inserts, since they're
> not touching the same buffers.
> 
> So here's the basic idea.  Each backend, if it so desires, is
> permitted to maintain a per-backend WAL buffer.  Per-backend WAL
> buffers live in shared memory and can be accessed by any backend, but
> the idea is that most of the time only one backend will be accessing
> them, so that the locks won't be heavily contended.  Any WAL written
> to a per-backend WAL buffer will eventually be transferred into the
> main WAL buffers, and flushed.  When a process writes to a per-backend
> WAL buffer, it writes (1) the actual WAL data and (2) the list of
> buffers affected.  Those buffers are stamped with a fake LSN that
> points back to the per-backend WAL buffer, and they can't be written
> until the WAL has been moved from the per-backend WAL buffers to the
> main WAL buffers.
> 
> So, if a buffer with a fake LSN needs to be (a) written back to the OS
> or (b) modified by a backend other than the one that owns the fake
> LSN, this triggers a flush of the per-backend WAL buffers to the main
> WAL buffers.  When this happens, all the affected buffers get stamped
> with a real LSN and the entries are discarded from the per-backend WAL
> buffers.  Such a flush would also be needed when a backend commits or
> otherwise needs an XLOG flush, or when there's no more per-backend
> buffer space.  In theory, all of this taken together should mean that
> WAL gets pushed out in larger chunks: a transaction that does three
> inserts and commits should only need to grab WALInsertLock once,
> instead of once per heap insert, once per index insert, and again for
> the commit, though it'll have to write a bigger chunk of data when it
> does get the lock.  It'll have to repeatedly grab the lock on its
> per-backend WAL buffer, but ideally that's uncontended.
> 
> A further refinement would be to try to jigger things so that as a
> backend fills up per-backend WAL buffers, it somehow throws them over
> the fence to one of the background processes to write out.  For
> short-running transactions, that won't really make any difference,
> since the commit will force the per-backend buffers out to the main
> buffers anyway.  But for long-running transactions it seems like it
> could be quite useful; in essence, the task of assembling the final
> WAL stream from the WAL output of individual backends becomes a
> background activity, and ideally the background process doing the work
> is the only one touching the cache lines being shuffled around.  Of
> course, to make this work, backends would need a steady supply of
> available per-backend WAL buffers.  Maybe shared buffers could be used
> for this purpose, with the buffer header being marked in some special
> way to indicate that this is what the buffer's being used for.
> 
> One not-so-good property of this algorithm is that the operation of
> moving per-backend WAL into the main WAL buffers requires relocking
> all the buffers whose fake LSNs now need to changed to "real" LSNs.
> That could possible be problematic from a performance standpoint, and
> there are deadlock risks to worry about too.
> 
> Any thoughts?  Other ideas?

I vaguely recall that UNISYS used to present patches to reduce the WAL
buffer lock contention and enhanced the CPU scalability limit from 12
to 16 or so(if my memory serves). Your second idea is somewhat related
to the patches?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: WALInsertLock contention
Next
From: Bruce Momjian
Date:
Subject: Re: Explicit psqlrc