Thread: WALInsertLock contention

WALInsertLock contention

From

Robert Haas

Date:

17 February 2011, 00:02:54

I've been thinking about the problem of $SUBJECT, and while I know
it's too early to think seriously about any 9.2 development, I want to
get my thoughts down in writing while they're fresh in my head.

It seems to me that there are two basic approaches to this problem.
We could either split up the WAL stream into several streams, say one
per database or one per tablespace or something of that sort, or we
could keep it as a single stream but try not to do so much locking
whilst in the process of getting it out the door.  Or we could try to
do both, and maybe ultimately we'll need to.  However, if the second
one is practical, it's got two major advantages: it'll probably be a
lot less invasive, and it won't add any extra fsync traffic.  In
thinking about how we might accomplish the goal of reducing lock
contention, it occurred to me there's probably no need for the final
WAL stream to reflect the exact order in which WAL is generated.

For example, suppose transaction T1 inserts a tuple into table A;
transaction T2 inserts a tuple into table B; T1 commits; T2 commits.
The commit records need to be in the right order, and all the actions
that are part of a given transaction need to precede the associated
commit record, but, for example, I don't think it would matter if you
emitted the commit record for T1 before T2's insert into B.  Or you
could switch the order in which you logged the inserts, since they're
not touching the same buffers.

So here's the basic idea.  Each backend, if it so desires, is
permitted to maintain a per-backend WAL buffer.  Per-backend WAL
buffers live in shared memory and can be accessed by any backend, but
the idea is that most of the time only one backend will be accessing
them, so that the locks won't be heavily contended.  Any WAL written
to a per-backend WAL buffer will eventually be transferred into the
main WAL buffers, and flushed.  When a process writes to a per-backend
WAL buffer, it writes (1) the actual WAL data and (2) the list of
buffers affected.  Those buffers are stamped with a fake LSN that
points back to the per-backend WAL buffer, and they can't be written
until the WAL has been moved from the per-backend WAL buffers to the
main WAL buffers.

So, if a buffer with a fake LSN needs to be (a) written back to the OS
or (b) modified by a backend other than the one that owns the fake
LSN, this triggers a flush of the per-backend WAL buffers to the main
WAL buffers.  When this happens, all the affected buffers get stamped
with a real LSN and the entries are discarded from the per-backend WAL
buffers.  Such a flush would also be needed when a backend commits or
otherwise needs an XLOG flush, or when there's no more per-backend
buffer space.  In theory, all of this taken together should mean that
WAL gets pushed out in larger chunks: a transaction that does three
inserts and commits should only need to grab WALInsertLock once,
instead of once per heap insert, once per index insert, and again for
the commit, though it'll have to write a bigger chunk of data when it
does get the lock.  It'll have to repeatedly grab the lock on its
per-backend WAL buffer, but ideally that's uncontended.

A further refinement would be to try to jigger things so that as a
backend fills up per-backend WAL buffers, it somehow throws them over
the fence to one of the background processes to write out.  For
short-running transactions, that won't really make any difference,
since the commit will force the per-backend buffers out to the main
buffers anyway.  But for long-running transactions it seems like it
could be quite useful; in essence, the task of assembling the final
WAL stream from the WAL output of individual backends becomes a
background activity, and ideally the background process doing the work
is the only one touching the cache lines being shuffled around.  Of
course, to make this work, backends would need a steady supply of
available per-backend WAL buffers.  Maybe shared buffers could be used
for this purpose, with the buffer header being marked in some special
way to indicate that this is what the buffer's being used for.

One not-so-good property of this algorithm is that the operation of
moving per-backend WAL into the main WAL buffers requires relocking
all the buffers whose fake LSNs now need to changed to "real" LSNs.
That could possible be problematic from a performance standpoint, and
there are deadlock risks to worry about too.

Any thoughts?  Other ideas?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: WALInsertLock contention

From

Tatsuo Ishii

Date:

17 February 2011, 00:12:41

> I've been thinking about the problem of $SUBJECT, and while I know
> it's too early to think seriously about any 9.2 development, I want to
> get my thoughts down in writing while they're fresh in my head.
> 
> It seems to me that there are two basic approaches to this problem.
> We could either split up the WAL stream into several streams, say one
> per database or one per tablespace or something of that sort, or we
> could keep it as a single stream but try not to do so much locking
> whilst in the process of getting it out the door.  Or we could try to
> do both, and maybe ultimately we'll need to.  However, if the second
> one is practical, it's got two major advantages: it'll probably be a
> lot less invasive, and it won't add any extra fsync traffic.  In
> thinking about how we might accomplish the goal of reducing lock
> contention, it occurred to me there's probably no need for the final
> WAL stream to reflect the exact order in which WAL is generated.
> 
> For example, suppose transaction T1 inserts a tuple into table A;
> transaction T2 inserts a tuple into table B; T1 commits; T2 commits.
> The commit records need to be in the right order, and all the actions
> that are part of a given transaction need to precede the associated
> commit record, but, for example, I don't think it would matter if you
> emitted the commit record for T1 before T2's insert into B.  Or you
> could switch the order in which you logged the inserts, since they're
> not touching the same buffers.
> 
> So here's the basic idea.  Each backend, if it so desires, is
> permitted to maintain a per-backend WAL buffer.  Per-backend WAL
> buffers live in shared memory and can be accessed by any backend, but
> the idea is that most of the time only one backend will be accessing
> them, so that the locks won't be heavily contended.  Any WAL written
> to a per-backend WAL buffer will eventually be transferred into the
> main WAL buffers, and flushed.  When a process writes to a per-backend
> WAL buffer, it writes (1) the actual WAL data and (2) the list of
> buffers affected.  Those buffers are stamped with a fake LSN that
> points back to the per-backend WAL buffer, and they can't be written
> until the WAL has been moved from the per-backend WAL buffers to the
> main WAL buffers.
> 
> So, if a buffer with a fake LSN needs to be (a) written back to the OS
> or (b) modified by a backend other than the one that owns the fake
> LSN, this triggers a flush of the per-backend WAL buffers to the main
> WAL buffers.  When this happens, all the affected buffers get stamped
> with a real LSN and the entries are discarded from the per-backend WAL
> buffers.  Such a flush would also be needed when a backend commits or
> otherwise needs an XLOG flush, or when there's no more per-backend
> buffer space.  In theory, all of this taken together should mean that
> WAL gets pushed out in larger chunks: a transaction that does three
> inserts and commits should only need to grab WALInsertLock once,
> instead of once per heap insert, once per index insert, and again for
> the commit, though it'll have to write a bigger chunk of data when it
> does get the lock.  It'll have to repeatedly grab the lock on its
> per-backend WAL buffer, but ideally that's uncontended.
> 
> A further refinement would be to try to jigger things so that as a
> backend fills up per-backend WAL buffers, it somehow throws them over
> the fence to one of the background processes to write out.  For
> short-running transactions, that won't really make any difference,
> since the commit will force the per-backend buffers out to the main
> buffers anyway.  But for long-running transactions it seems like it
> could be quite useful; in essence, the task of assembling the final
> WAL stream from the WAL output of individual backends becomes a
> background activity, and ideally the background process doing the work
> is the only one touching the cache lines being shuffled around.  Of
> course, to make this work, backends would need a steady supply of
> available per-backend WAL buffers.  Maybe shared buffers could be used
> for this purpose, with the buffer header being marked in some special
> way to indicate that this is what the buffer's being used for.
> 
> One not-so-good property of this algorithm is that the operation of
> moving per-backend WAL into the main WAL buffers requires relocking
> all the buffers whose fake LSNs now need to changed to "real" LSNs.
> That could possible be problematic from a performance standpoint, and
> there are deadlock risks to worry about too.
> 
> Any thoughts?  Other ideas?

I vaguely recall that UNISYS used to present patches to reduce the WAL
buffer lock contention and enhanced the CPU scalability limit from 12
to 16 or so(if my memory serves). Your second idea is somewhat related
to the patches?
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: WALInsertLock contention

From

Robert Haas

Date:

17 February 2011, 00:16:35

On Wed, Feb 16, 2011 at 11:13 PM, Tatsuo Ishii <ishii@postgresql.org> wrote:
> I vaguely recall that UNISYS used to present patches to reduce the WAL
> buffer lock contention and enhanced the CPU scalability limit from 12
> to 16 or so(if my memory serves). Your second idea is somewhat related
> to the patches?

Not sure.  Do you have a link to the archives, or any idea when this
discussion occurred/what the subject line was?

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: WALInsertLock contention

From

Stephen Frost

Date:

17 February 2011, 00:40:03

* Robert Haas (robertmhaas@gmail.com) wrote:
> On Wed, Feb 16, 2011 at 11:13 PM, Tatsuo Ishii <ishii@postgresql.org> wrote:
> > I vaguely recall that UNISYS used to present patches to reduce the WAL
> > buffer lock contention and enhanced the CPU scalability limit from 12
> > to 16 or so(if my memory serves). Your second idea is somewhat related
> > to the patches?
>
> Not sure.  Do you have a link to the archives, or any idea when this
> discussion occurred/what the subject line was?

They presented at PgCon a couple of years in a row, iirc..

http://www.pgcon.org/2007/schedule/events/16.en.html

I thought there was another one but I'm not finding it atm..
Thanks,
    Stephen

Re: WALInsertLock contention

From

Tatsuo Ishii

Date:

17 February 2011, 00:46:12

>> Not sure.  Do you have a link to the archives, or any idea when this
>> discussion occurred/what the subject line was?
> 
> They presented at PgCon a couple of years in a row, iirc..
> 
> http://www.pgcon.org/2007/schedule/events/16.en.html

Yes, this one. On page 18, they talked about their customized version
of PostgreSQL called "Postgres 8.2.4-uis":

Change WALInsertLock access $(Q#|(B Using SpinLockAcquire () as WALInsertLock locked most of time $(Q#|(B
Consideringa queue mechanism for WALInsertLock
 

I'm not sure if they brought their patches to public or not though...
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp

Re: WALInsertLock contention

From

Merlin Moncure

Date:

08 June 2011, 02:59:14

On Wed, Feb 16, 2011 at 11:02 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> I've been thinking about the problem of $SUBJECT, and while I know
> it's too early to think seriously about any 9.2 development, I want to
> get my thoughts down in writing while they're fresh in my head.
>
> It seems to me that there are two basic approaches to this problem.
> We could either split up the WAL stream into several streams, say one
> per database or one per tablespace or something of that sort, or we
> could keep it as a single stream but try not to do so much locking
> whilst in the process of getting it out the door.  Or we could try to
> do both, and maybe ultimately we'll need to.  However, if the second
> one is practical, it's got two major advantages: it'll probably be a
> lot less invasive, and it won't add any extra fsync traffic.  In
> thinking about how we might accomplish the goal of reducing lock
> contention, it occurred to me there's probably no need for the final
> WAL stream to reflect the exact order in which WAL is generated.
>
> For example, suppose transaction T1 inserts a tuple into table A;
> transaction T2 inserts a tuple into table B; T1 commits; T2 commits.
> The commit records need to be in the right order, and all the actions
> that are part of a given transaction need to precede the associated
> commit record, but, for example, I don't think it would matter if you
> emitted the commit record for T1 before T2's insert into B.  Or you
> could switch the order in which you logged the inserts, since they're
> not touching the same buffers.
>
> So here's the basic idea.  Each backend, if it so desires, is
> permitted to maintain a per-backend WAL buffer.  Per-backend WAL
> buffers live in shared memory and can be accessed by any backend, but
> the idea is that most of the time only one backend will be accessing
> them, so that the locks won't be heavily contended.  Any WAL written
> to a per-backend WAL buffer will eventually be transferred into the
> main WAL buffers, and flushed.  When a process writes to a per-backend
> WAL buffer, it writes (1) the actual WAL data and (2) the list of
> buffers affected.  Those buffers are stamped with a fake LSN that
> points back to the per-backend WAL buffer, and they can't be written
> until the WAL has been moved from the per-backend WAL buffers to the
> main WAL buffers.
>
> So, if a buffer with a fake LSN needs to be (a) written back to the OS
> or (b) modified by a backend other than the one that owns the fake
> LSN, this triggers a flush of the per-backend WAL buffers to the main
> WAL buffers.  When this happens, all the affected buffers get stamped
> with a real LSN and the entries are discarded from the per-backend WAL
> buffers.  Such a flush would also be needed when a backend commits or
> otherwise needs an XLOG flush, or when there's no more per-backend
> buffer space.  In theory, all of this taken together should mean that
> WAL gets pushed out in larger chunks: a transaction that does three
> inserts and commits should only need to grab WALInsertLock once,
> instead of once per heap insert, once per index insert, and again for
> the commit, though it'll l have to write a bigger chunk of data when it
> does get the lock.  It'lhave to repeatedly grab the lock on its
> per-backend WAL buffer, but ideally that's uncontended.

There's probably an obvious explanation that I'm not seeing, but if
you're not delegating the work of writing the buffers out to someone
else, why do you need to lock the per backend buffer at all?  That is,
why does it have to be in shared memory?  Suppose that if the
following are true:
*) Writing qualifying data (non commit, non switch)
*) There is room left in whatever you are copying to
you could trylock WalInsertLock, and if failing to get it, just copy
qualifying data into a private buffer and punt if the following are
true...otherwise just do the current behavior.

When you *do* get a lock, either because you got lucky or because you
had to wait anyways, you write out the data your previously staged,
fixing up the LSNs as you go.  Even if you do have to write it to
shared memory, I think your idea is a winner -- probably a fair amount
of work can get done before ultimately forced to wait...maybe enough
to change the scaling dyanmics.

> A further refinement would be to try to jigger things so that as a
> backend fills up per-backend WAL buffers, it somehow throws them over
> the fence to one of the background processes to write out.  For
> short-running transactions, that won't really make any difference,
> since the commit will force the per-backend buffers out to the main
> buffers anyway.  But for long-running transactions it seems like it
> could be quite useful; in essence, the task of assembling the final
> WAL stream from the WAL output of individual backends becomes a
> background activity, and ideally the background process doing the work
> is the only one touching the cache lines being shuffled around.  Of
> course, to make this work, backends would need a steady supply of
> available per-backend WAL buffers.  Maybe shared buffers could be used
> for this purpose, with the buffer header being marked in some special
> way to indicate that this is what the buffer's being used for.

That seems complicated -- plus I think the key is to distribute as
much of the work as possible. Why would the forward lateral to the
background processor not require a similar lock to WalInsertLock?

merlin

Re: WALInsertLock contention

From

Robert Haas

Date:

08 June 2011, 09:45:03

On Wed, Jun 8, 2011 at 1:59 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> There's probably an obvious explanation that I'm not seeing, ...

Yep.  :-)

> but if
> you're not delegating the work of writing the buffers out to someone
> else, why do you need to lock the per backend buffer at all?  That is,
> why does it have to be in shared memory?  Suppose that if the
> following are true:
> *) Writing qualifying data (non commit, non switch)
> *) There is room left in whatever you are copying to
> you could trylock WalInsertLock, and if failing to get it, just copy
> qualifying data into a private buffer and punt if the following are
> true...otherwise just do the current behavior.

And here it is: Writing a buffer requires a write & fsync of WAL
through the buffer LSN.  If the WAL for the buffers were completely
inaccessible to other backends, then those buffers would be pinned in
shared memory.  Which would make things very difficult at buffer
eviction time, or for checkpoints.

At any rate, even if it were possible to make it work, it'd be a
misplaced optimization.  It isn't touching shared memory - or even
touching the LWLock - that's expensive; it's the LWLock contention
that kills you, either because stuff blocks, or just because the CPUs
burn a lot of cycles fighting over cache lines.  An LWLock that is
typically taken by only one backend at a time is pretty cheap.  I
suppose I couldn't afford to be so blasé if we were trying to scale to
2048-core systems where even inserting a memory barrier is expensive
enough to worry about, but we've got a ways to go before we need to
start worrying about that.

[...snip...]
>> A further refinement would be to try to jigger things so that as a
>> backend fills up per-backend WAL buffers, it somehow throws them over
>> the fence to one of the background processes to write out.  For
>> short-running transactions, that won't really make any difference,
>> since the commit will force the per-backend buffers out to the main
>> buffers anyway.  But for long-running transactions it seems like it
>> could be quite useful; in essence, the task of assembling the final
>> WAL stream from the WAL output of individual backends becomes a
>> background activity, and ideally the background process doing the work
>> is the only one touching the cache lines being shuffled around.  Of
>> course, to make this work, backends would need a steady supply of
>> available per-backend WAL buffers.  Maybe shared buffers could be used
>> for this purpose, with the buffer header being marked in some special
>> way to indicate that this is what the buffer's being used for.
>
> That seems complicated -- plus I think the key is to distribute as
> much of the work as possible. Why would the forward lateral to the
> background processor not require a similar lock to WalInsertLock?

Well, that's the problem.  It would.  Now, in an ideal world, you
might still hope to get some benefit: only the background writer would
typically be writing to the real WAL stream, so that's not contended.
And the contention between the background writer and the individual
backends is only two-way.  There's no single point where you have
every process on the system piling on to a single lock.

But I'm not sure we can really make it work well enough to do more
than nibble around at the edges of the problem.  Consider:

INSERT INTO foo VALUES (1,2,3);

This is going to generate XLOG_HEAP_INSERT followed by
XLOG_XACT_COMMIT.  And now it wants to flush WAL.  So now you're
pretty much forced to have it go perform the serialization operation
itself, and you're right back in contention soup.  Batching two
records together and inserting them in one operation is presumably
going to be more efficient than inserting them one at a time, but not
all that much more efficient; and there are bookkeeping and memory
bandwidth costs to get there.  If we are dealing with long-running
transactions, or asynchronous commit, then this approach might have
legs -- but I suspect that in real life most transactions are small,
and the default configuration is synchronous_commit=on.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: WALInsertLock contention

From

Merlin Moncure

Date:

08 June 2011, 11:19:02

On Wed, Jun 8, 2011 at 7:44 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jun 8, 2011 at 1:59 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> There's probably an obvious explanation that I'm not seeing, ...
>
> Yep.  :-)
>
>> but if
>> you're not delegating the work of writing the buffers out to someone
>> else, why do you need to lock the per backend buffer at all?  That is,
>> why does it have to be in shared memory?  Suppose that if the
>> following are true:
>> *) Writing qualifying data (non commit, non switch)
>> *) There is room left in whatever you are copying to
>> you could trylock WalInsertLock, and if failing to get it, just copy
>> qualifying data into a private buffer and punt if the following are
>> true...otherwise just do the current behavior.
>
> And here it is: Writing a buffer requires a write & fsync of WAL
> through the buffer LSN.  If the WAL for the buffers were completely
> inaccessible to other backends, then those buffers would be pinned in
> shared memory.  Which would make things very difficult at buffer
> eviction time, or for checkpoints.

Well, (bear with me here) I'm not giving up that easy. Pinning a
judiciously small amount buffers into shared memory so you can recuce
congestion on the insert lock might be an acceptable trade-off in high
contention scenarios...in fact I assumed that was the whole point of
your original idea, which I still think has tremendous potential.
Obviously, you wouldn't want more than a very small percentage of
shared buffers overall (say 1-10% max) to be pinned in this way.  The
trylock is an attempt to cap the downside case so that you aren't
unnecessarily pinning buffers in say, long running i/o bound
transactions where insert lock contention is low.  Maybe you could
experiment with very small private insert buffer sizes (say 64 kb)
that would hopefully provide some of the benefits (if there are in
fact any) and mitigate potential costs.  Another tweak you could make
is that, once having trylocked and failed in a transaction and failed
acquirement, you always punt from there on in until you fill up or
need to block per ordering requirements.  Or maybe the whole thing
doesn't help at all...just trying to understand the problem better.

> At any rate, even if it were possible to make it work, it'd be a
> misplaced optimization.  It isn't touching shared memory - or even
> touching the LWLock - that's expensive; it's the LWLock contention
> that kills you, either because stuff blocks, or just because the CPUs
> burn a lot of cycles fighting over cache lines.  An LWLock that is
> typically taken by only one backend at a time is pretty cheap.  I
> suppose I couldn't afford to be so blasé if we were trying to scale to
> 2048-core systems where even inserting a memory barrier is expensive
> enough to worry about, but we've got a ways to go before we need to
> start worrying about that.

Right -- although it isn't so much of an optimization (although you
still want to do everything reasonable to keep work under the lock as
light as possible, and shm->shm copy is going to be slower than
mem->shm) as a simplification trade-off.  You don't have to worry
about deadlocks messing around with your per backend buffers during
your internal 'flush', and it's generally just easier messing around
with private memory (less code, less locking, less everything).

One point i'm missing though.  Getting back to your original idea, how
does writing to shmem prevent you from having to keep buffers pinned?
I'm reading your comment here:
"Those buffers are stamped with a fake LSN that
points back to the per-backend WAL buffer, and they can't be written
until the WAL has been moved from the per-backend WAL buffers to the
main WAL buffers."

That suggests to me that you have to keep them pinned anyways.  I'm
still a bit fuzzy on how the per-backend buffers being in shm conveys
any advantage.  IOW, (trying not to be obtuse) under what
circumstances would backend A want to read from or (especially) write
to backend B's wal buffer?

merlin

Re: WALInsertLock contention

From

Robert Haas

Date:

08 June 2011, 12:15:19

On Wed, Jun 8, 2011 at 10:18 AM, Merlin Moncure <mmoncure@gmail.com> wrote:
> One point i'm missing though.  Getting back to your original idea, how
> does writing to shmem prevent you from having to keep buffers pinned?
> I'm reading your comment here:
> "Those buffers are stamped with a fake LSN that
> points back to the per-backend WAL buffer, and they can't be written
> until the WAL has been moved from the per-backend WAL buffers to the
> main WAL buffers."
>
> That suggests to me that you have to keep them pinned anyways.  I'm
> still a bit fuzzy on how the per-backend buffers being in shm conveys
> any advantage.  IOW, (trying not to be obtuse) under what
> circumstances would backend A want to read from or (especially) write
> to backend B's wal buffer?

If backend A needs to evict a buffer with a fake LSN, it can go find
the WAL that needs to be serialized, do that, flush WAL, and then
evict the buffer.

IOW, backend A's private WAL buffer will not be completely private.
Only A will write to the buffer, but we don't know who will remove WAL
from the buffer and insert it into the main stream.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: WALInsertLock contention

From

Jim Nasby

Date:

08 June 2011, 19:49:59

On Jun 8, 2011, at 10:15 AM, Robert Haas wrote:
>> That suggests to me that you have to keep them pinned anyways.  I'm
>> still a bit fuzzy on how the per-backend buffers being in shm conveys
>> any advantage.  IOW, (trying not to be obtuse) under what
>> circumstances would backend A want to read from or (especially) write
>> to backend B's wal buffer?
>
> If backend A needs to evict a buffer with a fake LSN, it can go find
> the WAL that needs to be serialized, do that, flush WAL, and then
> evict the buffer.

Isn't the only time that you'd need to evict if you ran out of buffers? If the buffer was truly private, would that
stillbe an issue? 

Perhaps the only way to make that work is multiple WAL streams, as was originally suggested...
--
Jim C. Nasby, Database Architect                   jim@nasby.net
512.569.9461 (cell)                         http://jim.nasby.net

Re: WALInsertLock contention

From

Robert Haas

Date:

08 June 2011, 23:21:26

On Wed, Jun 8, 2011 at 6:49 PM, Jim Nasby <jim@nasby.net> wrote:
>> If backend A needs to evict a buffer with a fake LSN, it can go find
>> the WAL that needs to be serialized, do that, flush WAL, and then
>> evict the buffer.
>
> Isn't the only time that you'd need to evict if you ran out of buffers?

Sure, but that happens all the time.  See pg_stat_bgwriter.buffers_backend.

> If the buffer was truly private, would that still be an issue?

I'm not sure if you mean make the buffer private or make the WAL
storage arena private, but I'm pretty well convinced that neither one
can work.

> Perhaps the only way to make that work is multiple WAL streams, as was originally suggested...

Maybe...  but I hope not.  I just found an academic paper on this
subject, about which I will post shortly.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: WALInsertLock contention

From

Merlin Moncure

Date:

09 June 2011, 00:20:24

On Wed, Jun 8, 2011 at 10:21 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jun 8, 2011 at 6:49 PM, Jim Nasby <jim@nasby.net> wrote:
>>> If backend A needs to evict a buffer with a fake LSN, it can go find
>>> the WAL that needs to be serialized, do that, flush WAL, and then
>>> evict the buffer.
>>
>> Isn't the only time that you'd need to evict if you ran out of buffers?
>
> Sure, but that happens all the time.  See pg_stat_bgwriter.buffers_backend.
>
>> If the buffer was truly private, would that still be an issue?
>
> I'm not sure if you mean make the buffer private or make the WAL
> storage arena private, but I'm pretty well convinced that neither one
> can work.

You're probably right.  I think though there is enough hypothetical
upside to the private buffer case that it should be attempted just to
see what breaks. The major tricky bit is dealing with the new
pin/unpin mechanics.  I'd like to give it the 'college try'. (being
typically vain and attention seeking, this is right up my alley) :-D.

>> Perhaps the only way to make that work is multiple WAL streams, as was originally suggested...

If this was an easy way out all high performance file systems would
have multiple journals which you could write to concurrently (which
they don't afaik).

> Maybe...  but I hope not.  I just found an academic paper on this
> subject, about which I will post shortly.

I'm thinking that as long as your transactions have to be rigidly
ordered you have a fundamental bottleneck you can't really work
around.  One way to maybe get around this is to try and work out on
the fly if transaction 'A' functionally independent from transaction
'B' -- maybe then you could try and write them concurrently to
pre-allocated space on the log, or to separate logs maintained for
that purpose.  Good luck with that...even if you could somehow get it
to work, you would still have degenerate cases (like, 99% of real
world cases) to contend with.

Another thing you could try is to keep separate logs for rigidly
ordered data (commits, xlog switch, etc) and non rigidly ordered data
(everything else). On the non rigidly ordered side, you can
pre-allocate log space and write to it.  This is more or less a third
potential route (#1 and #2 being the shared/private buffer approaches)
of leveraging the fact that some of the data does not have to be
rigidly ordered.  Ultimately though even that could only get you so
far, because it incurs other costs and even contention on the lock for
inserting the commit records could start to bottleneck you.

If something useful squirts out of academia, I'd sure like to see it :-).

merlin

Re: WALInsertLock contention

From

Robert Haas

Date:

09 June 2011, 00:28:10

On Wed, Jun 8, 2011 at 11:20 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> You're probably right.  I think though there is enough hypothetical
> upside to the private buffer case that it should be attempted just to
> see what breaks. The major tricky bit is dealing with the new
> pin/unpin mechanics.  I'd like to give it the 'college try'. (being
> typically vain and attention seeking, this is right up my alley) :-D.

Well, I think it's fairly clear what will break:

- If you make the data-file buffer completely private, then what will
happen when some other backend needs to read or write that buffer?
- If you make the XLOG spool private, you will not be able to checkpoint.

But I just work here.  Feel free to hit your head on that brick wall
all you like.  If you manage to make a hole (in the wall, not your
head), I'll be as happy as anyone to climb through...!

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: WALInsertLock contention

From

Merlin Moncure

Date:

09 June 2011, 01:30:12

On Wed, Jun 8, 2011 at 11:27 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Jun 8, 2011 at 11:20 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
>> You're probably right.  I think though there is enough hypothetical
>> upside to the private buffer case that it should be attempted just to
>> see what breaks. The major tricky bit is dealing with the new
>> pin/unpin mechanics.  I'd like to give it the 'college try'. (being
>> typically vain and attention seeking, this is right up my alley) :-D.
>
> Well, I think it's fairly clear what will break:
>
> - If you make the data-file buffer completely private, then what will
> happen when some other backend needs to read or write that buffer?

The private wal buffer?  The whole point (maybe impossible) is to try
and engineer it so that the other backends *never* have to read and
write it -- from their point of view, it hasn't happened yet (even
though it has been written into some heap buffers).

Since all data action on ongoing transactions can happen at any time,
moving wal inserts into the private buffer is delaying its entry into
the log so you can avoid taking locks for pre-commit heap activity.
Doing this allows the backends doing that to pretend they are actually
did write data out into the log without breaking the 'wal before data'
rule which is effected by keeping the pin on pages with your magic LSN
(which I'm starting to wonder if it should be a flag like
BM_DEFERRED_WAL).  We essentially are moving xlog activity as far
ahead in time as possible (although in a very limited time space) in
order to combine locks and hopefully gain efficiency. It all comes
down to which rules you can bend and which you can break.

The heap pages that have been marked this way may or may not have to
be off limits from the backend other than the one that did the
marking, and if they have to be off limits logically, there may be no
realistic path to make them so.  I just don't know...I'm learning as I
go.  At the end of the day, it's all coming off as pretty fragile if
it even works, but it's fun to think about. Anyways, I'm inclined to
experiment.

> - If you make the XLOG spool private, you will not be able to checkpoint.

Correct -- but I don't think this problem is intractable, and is
really a secondary issue vs making sure the wal/heap/mvcc/backend
interactions 'work'.  The intent here is to spool only a relatively
small amount of uncommitted transaction data for a short period of
time, like 5-10 seconds.  Maybe you bite the bullet and tell everyone
to flush private WAL at checkpoint time via signal or something.
Maybe you bend the some rules on checkpoints.

merlin

Re: WALInsertLock contention

From

Merlin Moncure

Date:

09 June 2011, 13:38:06

On Wed, Jun 8, 2011 at 11:30 PM, Merlin Moncure <mmoncure@gmail.com> wrote:
> The heap pages that have been marked this way may or may not have to
> be off limits from the backend other than the one that did the
> marking, and if they have to be off limits logically, there may be no
> realistic path to make them so.

After some more thought, plus a bit of off-list coaching from Haas, I
see now the whole approach is basically a non-starter due to the
above.  Heap pages *are* off limits, because once deferred they can't
be scribbled on and committed by other transactions -- that would
violate the 'wal before data' rule.  To make it 'work', you'd have to
implement shared memory machinery to do cooperative flushing as
suggested upthread (complex, nasty) or simply block on deferred
pages...which would be a deadlock factory.

Oh well.  :(

merlin