Thread: Re: [HACKERS] Getting rid of AtEOXact Buffers (was Re: [Testperf-general] Re: First set of OSDL Shared Memscalability results, some wierdness ...)

Seeing as I've missed the last N messages... I'll just reply to this
one, rather than each of them in turn...

Tom Lane <tgl@sss.pgh.pa.us> wrote on 16.10.2004, 18:54:17:
> I wrote:
> > Josh Berkus  writes:
> >> First off, two test runs with OProfile are available at:
> >> http://khack.osdl.org/stp/298124/
> >> http://khack.osdl.org/stp/298121/
>
> > Hmm.  The stuff above 1% in the first of these is
>
> > Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count
100000
> > samples  %        app name                 symbol name
> > ...
> > 920369    2.1332  postgres                 AtEOXact_Buffers
> > ...
>
> > In the second test AtEOXact_Buffers is much lower (down around 0.57
> > percent) but the other suspects are similar.  Since the only difference
> > in parameters is shared_buffers (36000 vs 9000), it does look like we
> > are approaching the point where AtEOXact_Buffers is a problem, but so
> > far it's only a 2% drag.

Yes... as soon as you first mentioned AtEOXact_Buffers, I realised I'd
seen it near the top of the oprofile results on previous tests.

Although you don't say this, I presume you're acting on the thought that
a 2% drag would soon become a much larger contention point with more
users and/or smaller transactions - since these things are highly
non-linear.

>
> It occurs to me that given the 8.0 resource manager mechanism, we could
> in fact dispense with AtEOXact_Buffers, or perhaps better turn it into a
> no-op unless #ifdef USE_ASSERT_CHECKING.  We'd just get rid of the
> special case for transaction termination in resowner.c and let the
> resource owner be responsible for releasing locked buffers always.  The
> OSDL results suggest that this won't matter much at the level of 10000
> or so shared buffers, but for 100000 or more buffers the linear scan in
> AtEOXact_Buffers is going to become a problem.

If the resource owner is always responsible for releasing locked
buffers, who releases the locks if the backend crashes? Do we need some
additional code in bgwriter (or?) to clean up buffer locks?

>
> We could also get rid of the linear search in UnlockBuffers().  The only
> thing it's for anymore is to release a BM_PIN_COUNT_WAITER flag, and
> since a backend could not be doing more than one of those at a time,
> we don't really need an array of flags for that, only a single variable.
> This does not show in the OSDL results, which I presume means that their
> test case is not exercising transaction aborts; but I think we need to
> zap both routines to make the world safe for large shared_buffers
> values.  (See also
> http://archives.postgresql.org/pgsql-performance/2004-10/msg00218.php)

Yes, that's important.

> Any objection to doing this for 8.0?
>

As you say, if these issues are definitely kicking in at 100000
shared_buffers - there's a good few people out there with 800Mb
shared_buffers already.

Could I also suggest that we adopt your earlier suggestion of raising
the bgwriter parameters as a permanent measure - i.e. changing the
defaults in postgresql.conf. That way, StrategyDirtyBufferList won't
immediately show itself as a problem when using the default parameter
set. It would be a shame to remove one obstacle only to leave another
one following so close behind. [...and that also argues against an
earlier thought to introduce more fine grained values for the
bgwriter's parameters, ISTM]

Also, what will Vacuum delay do to the O(N) effect of
FlushRelationBuffers when called by VACUUM? Will the locks be held for
longer?

I think we should do some tests while running a VACUUM in the background
also, which isn't part of the DBT-2 set-up, but perhaps we might argue
*it should be for the PostgreSQL version*?

Dare we hope for a scalability increase in 8.0 after all....

Best Regards,

Simon Riggs

<simon@2ndquadrant.com> writes:
> If the resource owner is always responsible for releasing locked
> buffers, who releases the locks if the backend crashes?

The ensuing system reset takes care of that.

            regards, tom lane

Re: Getting rid of AtEOXact Buffers (was Re: [Testperf-general]

From
Jan Wieck
Date:
On 10/17/2004 3:40 PM, simon@2ndquadrant.com wrote:

> Seeing as I've missed the last N messages... I'll just reply to this
> one, rather than each of them in turn...
>
> Tom Lane <tgl@sss.pgh.pa.us> wrote on 16.10.2004, 18:54:17:
>> I wrote:
>> > Josh Berkus  writes:
>> >> First off, two test runs with OProfile are available at:
>> >> http://khack.osdl.org/stp/298124/
>> >> http://khack.osdl.org/stp/298121/
>>
>> > Hmm.  The stuff above 1% in the first of these is
>>
>> > Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count
100000
>> > samples  %        app name                 symbol name
>> > ...
>> > 920369    2.1332  postgres                 AtEOXact_Buffers
>> > ...
>>
>> > In the second test AtEOXact_Buffers is much lower (down around 0.57
>> > percent) but the other suspects are similar.  Since the only difference
>> > in parameters is shared_buffers (36000 vs 9000), it does look like we
>> > are approaching the point where AtEOXact_Buffers is a problem, but so
>> > far it's only a 2% drag.
>
> Yes... as soon as you first mentioned AtEOXact_Buffers, I realised I'd
> seen it near the top of the oprofile results on previous tests.
>
> Although you don't say this, I presume you're acting on the thought that
> a 2% drag would soon become a much larger contention point with more
> users and/or smaller transactions - since these things are highly
> non-linear.
>
>>
>> It occurs to me that given the 8.0 resource manager mechanism, we could
>> in fact dispense with AtEOXact_Buffers, or perhaps better turn it into a
>> no-op unless #ifdef USE_ASSERT_CHECKING.  We'd just get rid of the
>> special case for transaction termination in resowner.c and let the
>> resource owner be responsible for releasing locked buffers always.  The
>> OSDL results suggest that this won't matter much at the level of 10000
>> or so shared buffers, but for 100000 or more buffers the linear scan in
>> AtEOXact_Buffers is going to become a problem.
>
> If the resource owner is always responsible for releasing locked
> buffers, who releases the locks if the backend crashes? Do we need some
> additional code in bgwriter (or?) to clean up buffer locks?

If the backend crashes, the postmaster (assuming a possibly corrupted
shared memory) restarts the whole lot ... so why bother?

>
>>
>> We could also get rid of the linear search in UnlockBuffers().  The only
>> thing it's for anymore is to release a BM_PIN_COUNT_WAITER flag, and
>> since a backend could not be doing more than one of those at a time,
>> we don't really need an array of flags for that, only a single variable.
>> This does not show in the OSDL results, which I presume means that their
>> test case is not exercising transaction aborts; but I think we need to
>> zap both routines to make the world safe for large shared_buffers
>> values.  (See also
>> http://archives.postgresql.org/pgsql-performance/2004-10/msg00218.php)
>
> Yes, that's important.
>
>> Any objection to doing this for 8.0?
>>
>
> As you say, if these issues are definitely kicking in at 100000
> shared_buffers - there's a good few people out there with 800Mb
> shared_buffers already.
>
> Could I also suggest that we adopt your earlier suggestion of raising
> the bgwriter parameters as a permanent measure - i.e. changing the
> defaults in postgresql.conf. That way, StrategyDirtyBufferList won't
> immediately show itself as a problem when using the default parameter
> set. It would be a shame to remove one obstacle only to leave another
> one following so close behind. [...and that also argues against an
> earlier thought to introduce more fine grained values for the
> bgwriter's parameters, ISTM]

I realized that StrategyDirtyBufferList currently wasts a lot of time by
first scanning over all the buffers that haven't even been hit since
it's last call and neither have been dirty last time (and thus, are at
the beginning of the list and can't be dirty anyway). If we would have a
way to give it a smart "point in the list to start scanning" ...


>
> Also, what will Vacuum delay do to the O(N) effect of
> FlushRelationBuffers when called by VACUUM? Will the locks be held for
> longer?

Vacuum only naps at the points where it checks for interrupts, and at
that time it isn't supposed to hold any critical locks.


Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Jan Wieck <JanWieck@Yahoo.com> writes:
> I realized that StrategyDirtyBufferList currently wasts a lot of time by
> first scanning over all the buffers that haven't even been hit since
> it's last call and neither have been dirty last time (and thus, are at
> the beginning of the list and can't be dirty anyway). If we would have a
> way to give it a smart "point in the list to start scanning" ...

I don't think it's true that they *can't* be dirty.

(1) Buffers are marked dirty when released, whereas they are moved to
the fronts of the lists when acquired.

(2) the cntxDirty bit can be set asynchronously to any BufMgrLock'd
operation.

But it sure seems like we are doing more work than we really need to.

One idea I had was for the bgwriter to collect all the dirty pages up to
say halfway on the LRU lists, and then write *all* of these, not just
the first N, over as many rounds as are needed.  Then go back and call
StrategyDirtyBufferList again to get a new list.  (We don't want it to
write every dirty buffer this way, because the ones near the front of
the list are likely to be dirtied again right away.  But certainly we
could write more than 1% of the dirty buffers without getting into the
area of the recently-used buffers.)

There isn't any particularly good reason for this to share code with
checkpoint-style BufferSync, btw.  BufferSync could just as easily scan
the buffers linearly, since it doesn't matter what order it writes them
in.  So we could change StrategyDirtyBufferList to stop as soon as it's
halfway up the LRU lists, which would save at least a few cycles.

            regards, tom lane

Trying to think a little out of the box, how "common" is it in modern
operating systems to be able to swap out shared memory?

Maybe we're not using the ARC algorithm correctly after all. The ARC
algorithm does not consider the second level OS buffer cache in it's
design. Maybe the total size of the ARC cache directory should not be 2x
the size of what is configured as the shared buffer cache, but rather 2x
the size of the effective cache size (in 8k pages). If we assume that
the job of the T1 queue is better done by the OS buffers anyway (and
this is what this discussion seems to point out), we shouldn't hold them
in shared buffers (only very few of them and evict them ASAP). We just
account for them and assume that the OS will have those cached that we
find in our T1 directory. I think with the right configuration for
effective cache size, this is a fair assumption. The T2 queue represents
the frequently used blocks. If our implementation would cause the
unused/used portions of the buffers not to move around, the OS will swap
out currently unused portions of the shared buffer cache and utilize
those as OS buffers.

To verify this theory it would be interesting what the ARC strategy
after a long DBT run with a "large" buffer cache thinks a good T2 size
would be. Enabling the strategy debug message and running postmaster
with -d1 will show that. In theory, this size should be anywhere near
the sweet spot.


Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #