Re: [Testperf-general] Re: First set of OSDL Shared Memscalability results, some wierdness ... - Mailing list pgsql-performance

From Tom Lane
Subject Re: [Testperf-general] Re: First set of OSDL Shared Memscalability results, some wierdness ...
Date
Msg-id 10202.1097858894@sss.pgh.pa.us
Whole thread Raw
In response to Re: [Testperf-general] Re: First set of OSDL Shared Memscalability results, some wierdness ...  ("Simon Riggs" <simon@2ndquadrant.com>)
Responses Re: [Testperf-general] Re: First set of OSDL Shared Memscalability  (Bruce Momjian <pgman@candle.pha.pa.us>)
Re: [Testperf-general] Re: First set of OSDL Shared Memscalability results, some wierdness ...  (Josh Berkus <josh@agliodbs.com>)
List pgsql-performance
"Simon Riggs" <simon@2ndquadrant.com> writes:
> Speculating wildly because I don't know that portion of the code this might
> be:
> CONJECTURE 1: the act of searching for a block in cache is an O(n)
> operation, not an O(1) or O(log n) operation

I'm not sure how this meme got into circulation, but I've seen a couple
of people recently either conjecturing or asserting that.  Let me remind
people of the actual facts:

1. We use a hashtable to keep track of which blocks are currently in
shared buffers.  Either a cache hit or a cache miss should be O(1),
because the hashtable size is scaled proportionally to shared_buffers,
and so the number of hash entries examined should remain constant.

2. There are some allegedly-not-performance-critical operations that do
scan through all the buffers, and therefore are O(N) in shared_buffers.

I just eyeballed all the latter, and came up with this list of O(N)
operations and their call points:

AtEOXact_Buffers
    transaction commit or abort
UnlockBuffers
    transaction abort, backend exit
StrategyDirtyBufferList
    background writer's idle loop
FlushRelationBuffers
    VACUUM
    DROP TABLE, DROP INDEX
    TRUNCATE, CLUSTER, REINDEX
    ALTER TABLE SET TABLESPACE
DropRelFileNodeBuffers
    TRUNCATE (only for ON COMMIT TRUNC temp tables)
    REINDEX (inplace case only)
    smgr_internal_unlink (ie, the tail end of DROP TABLE/INDEX)
DropBuffers
    DROP DATABASE

The fact that the first two are called during transaction commit/abort
is mildly alarming.  The constant factors are going to be very tiny
though, because what these routines actually do is scan backend-local
status arrays looking for locked buffers, which they're not going to
find very many of.  For instance AtEOXact_Buffers looks like

    int            i;

    for (i = 0; i < NBuffers; i++)
    {
        if (PrivateRefCount[i] != 0)
        {
        // some code that should never be executed at all in the commit
        // case, and not that much in the abort case either
        }
    }

I suppose with hundreds of thousands of shared buffers this might get to
the point of being noticeable, but I've never seen it show up at all in
profiling with more-normal buffer counts.  Not sure if it's worth
devising a more complex data structure to aid in finding locked buffers.
(To some extent this code is intended to be belt-and-suspenders stuff
for catching omissions elsewhere, and so a more complex data structure
that could have its own bugs is not especially attractive.)

The one that's bothering me at the moment is StrategyDirtyBufferList,
which is a new overhead in 8.0.  It wouldn't directly affect foreground
query performance, but indirectly it would hurt by causing the bgwriter
to suck more CPU cycles than one would like (and it holds the BufMgrLock
while it's doing it, too :-().  One easy way you could see whether this
is an issue in the OSDL test is to see what happens if you double all
three bgwriter parameters (delay, percent, maxpages).  This should
result in about the same net I/O demand from the bgwriter, but
StrategyDirtyBufferList will be executed half as often.

I doubt that the other ones are issues.  We could improve them by
devising a way to quickly find all buffers for a given relation, but
I am just about sure that complicating the buffer management to do so
would be a net loss for normal workloads.

> For the record, what I think we need is dynamically resizable
> shared_buffers, not a-priori knowledge of what you should set
> shared_buffers to.

This isn't likely to happen because the SysV shared memory API isn't
conducive to it.  Absent some amazingly convincing demonstration that
we have to have it, the effort of making it happen in a portable way
isn't going to get spent.

> I've been thinking about implementing a scheme that helps you decide how
> big the shared_buffers SHOULD BE, by making the LRU list bigger than the
> cache itself, so you'd be able to see whether there is beneficial effect in
> increasing shared_buffers.

ARC already keeps such a list --- couldn't you learn what you want to
know from the existing data structure?  It'd be fairly cool if we could
put out warnings "you ought to increase shared_buffers" analogous to the
existing facility for noting excessive checkpointing.

            regards, tom lane

pgsql-performance by date:

Previous
From: "Merlin Moncure"
Date:
Subject: Re: [pgsql-hackers-win32] Performance on Win32 vs Cygwin
Next
From: Richard_D_Levine@raytheon.com
Date:
Subject: Does PostgreSQL run with Oracle?