Thread: First set of OSDL Shared Mem scalability results, some wierdness ...

First set of OSDL Shared Mem scalability results, some wierdness ...

From

Josh Berkus

Date:

08 October 2004, 22:42:15

Folks,

I'm hoping that some of you can shed some light on this.

I've been trying to peg the "sweet spot" for shared memory using OSDL's
equipment.   With Jan's new ARC patch, I was expecting that the desired
amount of shared_buffers to be greatly increased.  This has not turned out to
be the case.

The first test series was using OSDL's DBT2 (OLTP) test, with 150
"warehouses".   All tests were run on a 4-way Pentium III 700mhz 3.8GB RAM
system hooked up to a rather high-end storage device (14 spindles).    Tests
were on PostgreSQL 8.0b3, Linux 2.6.7.

Here's a top-level summary:

shared_buffers        % RAM    NOTPM20*
1000                0.2%        1287
23000            5%        1507
46000            10%        1481
69000            15%        1382
92000            20%        1375
115000            25%        1380
138000            30%        1344

* = New Order Transactions Per Minute, last 20 Minutes
     Higher is better.  The maximum possible is 1800.

As you can see, the "sweet spot" appears to be between 5% and 10% of RAM,
which is if anything *lower* than recommendations for 7.4!

This result is so surprising that I want people to take a look at it and tell
me if there's something wrong with the tests or some bottlenecking factor
that I've not seen.

in order above:
http://khack.osdl.org/stp/297959/
http://khack.osdl.org/stp/297960/
http://khack.osdl.org/stp/297961/
http://khack.osdl.org/stp/297962/
http://khack.osdl.org/stp/297963/
http://khack.osdl.org/stp/297964/
http://khack.osdl.org/stp/297965/

Please note that many of the Graphs in these reports are broken.  For one
thing, some aren't recorded (flat lines) and the CPU usage graph has
mislabeled lines.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: First set of OSDL Shared Mem scalability results,

From

"J. Andrew Rogers"

Date:

08 October 2004, 23:13:16

I have an idea that makes some assumptions about internals that I think
are correct.

When you have a huge number of buffers in a list that has to be
traversed to look for things in cache, e.g. 100k, you will generate an
almost equivalent number of cache line misses on the processor to jump
through all those buffers.  As I understand it (and I haven't looked so
I could be wrong), the buffer cache is searched by traversing it
sequentially.  OTOH, it seems reasonable to me that the OS disk cache
may actually be using a tree structure that would generate vastly fewer
cache misses by comparison to find a buffer.  This could mean a
substantial linear search cost as a function of the number of buffers,
big enough to rise above the noise floor when you have hundreds of
thousands of buffers.

Cache misses start to really add up when a code path generates many,
many thousands of them, and differences in the access path between the
buffer cache and disk cache would be reflected when you have that many
buffers.  I've seen these types of unexpected performance anomalies
before that got traced back to code patterns and cache efficiency and
gotten integer factors improvements by making some seemingly irrelevant
code changes.

So I guess my question would be 1) are my assumptions about the
internals correct, and 2) if they are, is there a way to optimize
searching the buffer cache so that a search doesn't iterate over a
really long buffer list that is bottlenecked on cache line replacement.

My random thought of the day,

j. andrew rogers

Re: First set of OSDL Shared Mem scalability results, some wierdness ...

From

Tom Lane

Date:

08 October 2004, 23:21:44

Josh Berkus <josh@agliodbs.com> writes:
> Here's a top-level summary:

> shared_buffers    % RAM        NOTPM20*
> 1000            0.2%        1287
> 23000            5%        1507
> 46000            10%        1481
> 69000            15%        1382
> 92000            20%        1375
> 115000        25%        1380
> 138000        30%        1344

> As you can see, the "sweet spot" appears to be between 5% and 10% of RAM,
> which is if anything *lower* than recommendations for 7.4!

This doesn't actually surprise me a lot.  There are a number of aspects
of Postgres that will get slower the more buffers there are.

One thing that I hadn't focused on till just now, which is a new
overhead in 8.0, is that StrategyDirtyBufferList() scans the *entire*
buffer list *every time it's called*, which is to say once per bgwriter
loop.  And to add insult to injury, it's doing that with the BufMgrLock
held (not that it's got any choice).

We could alleviate this by changing the API between this function and
BufferSync, such that StrategyDirtyBufferList can stop as soon as it's
found all the buffers that are going to be written in this bgwriter
cycle ... but AFAICS that means abandoning the "bgwriter_percent" knob
since you'd never really know how many dirty pages there were
altogether.

BTW, what is the actual size of the test database (disk footprint wise)
and how much of that do you think is heavily accessed during the run?
It's possible that the test conditions are such that adjusting
shared_buffers isn't going to mean anything anyway.

            regards, tom lane

Re: First set of OSDL Shared Mem scalability results,

From

Tom Lane

Date:

08 October 2004, 23:32:51

"J. Andrew Rogers" <jrogers@neopolitan.com> writes:
> As I understand it (and I haven't looked so I could be wrong), the
> buffer cache is searched by traversing it sequentially.

You really should look first.

The main-line code paths use hashed lookups.  There are some cases that
do linear searches through the buffer headers or the CDB lists; in
theory those are supposed to be non-performance-critical cases, though
I am suspicious that some are not (see other response).  In any case,
those structures are considerably more compact than the buffers proper,
and I doubt that cache misses per se are the killer factor.

This does raise a question for Josh though, which is "where's the
oprofile results?"  If we do have major problems at the level of cache
misses then oprofile would be able to prove it.

            regards, tom lane

Re: First set of OSDL Shared Mem scalability results,

From

"Steinar H. Gunderson"

Date:

08 October 2004, 23:40:05

On Fri, Oct 08, 2004 at 06:32:32PM -0400, Tom Lane wrote:
> This does raise a question for Josh though, which is "where's the
> oprofile results?"  If we do have major problems at the level of cache
> misses then oprofile would be able to prove it.

Or cachegrind. I've found it to be really effective at pinpointing cache
misses in the past (one CPU-intensive routine was sped up by 30% just by
avoiding a memory clear). :-)

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: First set of OSDL Shared Mem scalability results,

From

Josh Berkus

Date:

09 October 2004, 00:07:43

Tom,

> This does raise a question for Josh though, which is "where's the
> oprofile results?"  If we do have major problems at the level of cache
> misses then oprofile would be able to prove it.

Missing, I'm afraid.  OSDL has been having technical issues with STP all week.

Hopefully the next test run will have them.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: First set of OSDL Shared Mem scalability results, some wierdness ...

From

Josh Berkus

Date:

09 October 2004, 00:30:36

Tom,

> BTW, what is the actual size of the test database (disk footprint wise)
> and how much of that do you think is heavily accessed during the run?
> It's possible that the test conditions are such that adjusting
> shared_buffers isn't going to mean anything anyway.

The raw data is 32GB, but a lot of the activity is incremental, that is
inserts and updates to recent inserts.    Still, according to Mark, most of
the data does get queried in the course of filling orders.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: First set of OSDL Shared Mem scalability results, some wierdness ...

From

Christopher Browne

Date:

09 October 2004, 04:20:05

josh@agliodbs.com (Josh Berkus) wrote:
> I've been trying to peg the "sweet spot" for shared memory using
> OSDL's equipment.  With Jan's new ARC patch, I was expecting that
> the desired amount of shared_buffers to be greatly increased.  This
> has not turned out to be the case.

That doesn't surprise me.

My primary expectation would be that ARC would be able to make small
buffers much more effective alongside vacuums and seq scans than they
used to be.  That does not establish anything about the value of
increasing the size buffer caches...

> This result is so surprising that I want people to take a look at it
> and tell me if there's something wrong with the tests or some
> bottlenecking factor that I've not seen.

I'm aware of two conspicuous scenarios where ARC would be expected to
_substantially_ improve performance:

 1.  When it allows a VACUUM not to throw useful data out of
     the shared cache in that VACUUM now only 'chews' on one
     page of the cache;

 2.  When it allows a Seq Scan to not push useful data out of
     the shared cache, for much the same reason.

I don't imagine either scenario are prominent in the OSDL tests.

Increasing the number of cache buffers _is_ likely to lead to some
slowdowns:

 - Data that passes through the cache also passes through kernel
   cache, so it's recorded twice, and read twice...

 - The more cache pages there are, the more work is needed for
   PostgreSQL to manage them.  That will notably happen anywhere
   that there is a need to scan the cache.

 - If there are any inefficiencies in how the OS kernel manages shared
   memory, as their size scales, well, that will obviously cause a
   slowdown.
--
If this was helpful, <http://svcs.affero.net/rm.php?r=cbbrowne> rate me
http://www.ntlug.org/~cbbrowne/internet.html
"One World. One Web. One Program."   -- MICROS~1 hype
"Ein Volk, ein Reich, ein Fuehrer"   -- Nazi hype
(One people, one country, one leader)

Re: First set of OSDL Shared Mem scalability results, some wierdness ...

From

Kevin Brown

Date:

09 October 2004, 12:21:03

Christopher Browne wrote:
> Increasing the number of cache buffers _is_ likely to lead to some
> slowdowns:
>
>  - Data that passes through the cache also passes through kernel
>    cache, so it's recorded twice, and read twice...

Even worse, memory that's used for the PG cache is memory that's not
available to the kernel's page cache.  Even if the overall memory
usage in the system isn't enough to cause some paging to disk, most
modern kernels will adjust the page/disk cache size dynamically to fit
the memory demands of the system, which in this case means it'll be
smaller if running programs need more memory for their own use.

This is why I sometimes wonder whether or not it would be a win to use
mmap() to access the data and index files -- doing so under a truly
modern OS would surely at the very least save a buffer copy (from the
page/disk cache to program memory) because the OS could instead
direcly map the buffer cache pages directly to the program's memory
space.

Since PG often has to have multiple files open at the same time, and
in a production database many of those files will be rather large, PG
would have to limit the size of the mmap()ed region on 32-bit
platforms, which means that things like the order of mmap() operations
to access various parts of the file can become just as important in
the mmap()ed case as it is in the read()/write() case (if not more
so!).  I would imagine that the use of mmap() on a 64-bit platform
would be a much, much larger win because PG would most likely be able
to mmap() entire files and let the OS work out how to order disk reads
and writes.

The biggest problem as I see it is that (I think) mmap() would have to
be made to cooperate with malloc() for virtual address space.  I
suspect issues like this have already been worked out by others,
however...

--
Kevin Brown                          kevin@sysexperts.com

Re: First set of OSDL Shared Mem scalability results, some

From

Matthew

Date:

09 October 2004, 16:05:30

Christopher Browne wrote:

>josh@agliodbs.com (Josh Berkus) wrote:
>
>
>>This result is so surprising that I want people to take a look at it
>>and tell me if there's something wrong with the tests or some
>>bottlenecking factor that I've not seen.
>>
>>
>I'm aware of two conspicuous scenarios where ARC would be expected to
>_substantially_ improve performance:
>
> 1.  When it allows a VACUUM not to throw useful data out of
>     the shared cache in that VACUUM now only 'chews' on one
>     page of the cache;
>

Right, Josh, I assume you didn't run these test with pg_autovacuum
running, which might be interesting.

Also how do these numbers compare to 7.4?   They may not be what you
expected, but they might still be an improvment.

Matthew

Re: First set of OSDL Shared Mem scalability results, some wierdness ...

From

Tom Lane

Date:

09 October 2004, 16:07:33

Kevin Brown <kevin@sysexperts.com> writes:
> This is why I sometimes wonder whether or not it would be a win to use
> mmap() to access the data and index files --

mmap() is Right Out because it does not afford us sufficient control
over when changes to the in-memory data will propagate to disk.  The
address-space-management problems you describe are also a nasty
headache, but that one is the showstopper.

            regards, tom lane

Re: First set of OSDL Shared Mem scalability results, some wierdness ...

From

Kevin Brown

Date:

09 October 2004, 21:37:34

Tom Lane wrote:
> Kevin Brown <kevin@sysexperts.com> writes:
> > This is why I sometimes wonder whether or not it would be a win to use
> > mmap() to access the data and index files --
>
> mmap() is Right Out because it does not afford us sufficient control
> over when changes to the in-memory data will propagate to disk.  The
> address-space-management problems you describe are also a nasty
> headache, but that one is the showstopper.

Huh?  Surely fsync() or fdatasync() of the file descriptor associated
with the mmap()ed region at the appropriate times would accomplish
much of this?  I'm particularly confused since PG's entire approach to
disk I/O is predicated on the notion that the OS, and not PG, is the
best arbiter of when data hits the disk.  Otherwise it would be using
raw partitions for the highest-speed data store, yes?

Also, there isn't any particular requirement to use mmap() for
everything -- you can use traditional open/write/close calls for the
WAL and mmap() for the data/index files (but it wouldn't surprise me
if this would require some extensive code changes).

That said, if it's typical for many changes to made to a page
internally before PG needs to commit that page to disk, then your
argument makes sense, and that's especially true if we simply cannot
have the page written to disk in a partially-modified state (something
I can easily see being an issue for the WAL -- would the same hold
true of the index/data files?).

--
Kevin Brown                          kevin@sysexperts.com

Re: First set of OSDL Shared Mem scalability results, some wierdness ...

From

Kevin Brown

Date:

09 October 2004, 22:01:15

I wrote:
> That said, if it's typical for many changes to made to a page
> internally before PG needs to commit that page to disk, then your
> argument makes sense, and that's especially true if we simply cannot
> have the page written to disk in a partially-modified state (something
> I can easily see being an issue for the WAL -- would the same hold
> true of the index/data files?).

Also, even if multiple changes would be made to the page, with the
page being valid for a disk write only after all such changes are
made, the use of mmap() (in conjunction with an internal buffer that
would then be copied to the mmap()ed memory space at the appropriate
time) would potentially save a system call over the use of write()
(even if write() were used to write out multiple pages).  However,
there is so much lower-hanging fruit than this that an mmap()
implementation almost certainly isn't worth pursuing for this alone.

So: it seems to me that mmap() is worth pursuing only if most internal
buffers tend to be written to only once or if it's acceptable for a
partially modified data/index page to be written to disk (which I
suppose could be true for data/index pages in the face of a rock-solid
WAL).

--
Kevin Brown                          kevin@sysexperts.com

Re: First set of OSDL Shared Mem scalability results, some wierdness ...

From

Tom Lane

Date:

10 October 2004, 00:06:56

Kevin Brown <kevin@sysexperts.com> writes:
> Tom Lane wrote:
>> mmap() is Right Out because it does not afford us sufficient control
>> over when changes to the in-memory data will propagate to disk.

> ... that's especially true if we simply cannot
> have the page written to disk in a partially-modified state (something
> I can easily see being an issue for the WAL -- would the same hold
> true of the index/data files?).

You're almost there.  Remember the fundamental WAL rule: log entries
must hit disk before the data changes they describe.  That means that we
need not only a way of forcing changes to disk (fsync) but a way of
being sure that changes have *not* gone to disk yet.  In the existing
implementation we get that by just not issuing write() for a given page
until we know that the relevant WAL log entries are fsync'd down to
disk.  (BTW, this is what the LSN field on every page is for: it tells
the buffer manager the latest WAL offset that has to be flushed before
it can safely write the page.)

mmap provides msync which is comparable to fsync, but AFAICS it
provides no way to prevent an in-memory change from reaching disk too
soon.  This would mean that WAL entries would have to be written *and
flushed* before we could make the data change at all, which would
convert multiple updates of a single page into a series of write-and-
wait-for-WAL-fsync steps.  Not good.  fsync'ing WAL once per transaction
is bad enough, once per atomic action is intolerable.

There is another reason for doing things this way.  Consider a backend
that goes haywire and scribbles all over shared memory before crashing.
When the postmaster sees the abnormal child termination, it forcibly
kills the other active backends and discards shared memory altogether.
This gives us fairly good odds that the crash did not affect any data on
disk.  It's not perfect of course, since another backend might have been
in process of issuing a write() when the disaster happens, but it's
pretty good; and I think that that isolation has a lot to do with PG's
good reputation for not corrupting data in crashes.  If we had a large
fraction of the address space mmap'd then this sort of crash would be
just about guaranteed to propagate corruption into the on-disk files.

            regards, tom lane

Re: First set of OSDL Shared Mem scalability results, some wierdness

From

Gaetano Mendola

Date:

10 October 2004, 10:50:24

Josh Berkus wrote:
> Folks,
>
> I'm hoping that some of you can shed some light on this.
>
> I've been trying to peg the "sweet spot" for shared memory using OSDL's
> equipment.   With Jan's new ARC patch, I was expecting that the desired
> amount of shared_buffers to be greatly increased.  This has not turned out to
> be the case.
>
> The first test series was using OSDL's DBT2 (OLTP) test, with 150
> "warehouses".   All tests were run on a 4-way Pentium III 700mhz 3.8GB RAM
> system hooked up to a rather high-end storage device (14 spindles).    Tests
> were on PostgreSQL 8.0b3, Linux 2.6.7.

I'd like to see these tests running using the cpu affinity capability in order
to oblige a backend to not change CPU during his life, this could drastically
increase the cache hit.


Regards
Gaetano Mendola

Re: First set of OSDL Shared Mem scalability results, some

From

Dennis Bjorklund

Date:

10 October 2004, 22:49:02

On Fri, 8 Oct 2004, Josh Berkus wrote:

> As you can see, the "sweet spot" appears to be between 5% and 10% of RAM,
> which is if anything *lower* than recommendations for 7.4!

What recommendation is that? To have shared buffers being about 10% of the
ram sounds familiar to me. What was recommended for 7.4? In the past we
used to say that the worst value is 50% since then the same things might
be cached both by pg and the os disk cache.

Why do we excpect the shared buffer size sweet spot to change because of
the new arc stuff? And why would it make it better to have bigger shared
mem?

Wouldn't it be the opposit, that now we don't invalidate as much of the
cache for vacuums and seq. scan so now we can do as good caching as
before but with less shared buffers.

That said, testing and getting some numbers of good sizes for shared mem
is good.

--
/Dennis Björklund

Re: First set of OSDL Shared Mem scalability results, some

From

Jan Wieck

Date:

14 October 2004, 01:49:31

On 10/8/2004 10:10 PM, Christopher Browne wrote:

> josh@agliodbs.com (Josh Berkus) wrote:
>> I've been trying to peg the "sweet spot" for shared memory using
>> OSDL's equipment.  With Jan's new ARC patch, I was expecting that
>> the desired amount of shared_buffers to be greatly increased.  This
>> has not turned out to be the case.
>
> That doesn't surprise me.

Neither does it surprise me.

>
> My primary expectation would be that ARC would be able to make small
> buffers much more effective alongside vacuums and seq scans than they
> used to be.  That does not establish anything about the value of
> increasing the size buffer caches...

The primary goal of ARC is to prevent total cache eviction caused by
sequential scans. Which means it is designed to avoid the catastrophic
impact of a pg_dump or other, similar access in parallel to the OLTP
traffic. It would be much more interesting to see how a half way into a
2 hour measurement interval started pg_dump affects the response times.

One also has to take a closer look at the data of the DBT2. What amount
of that 32GB is high-frequently accessed, and therefore a good thing to
live in the PG shared cache? A cache significantly larger than that
doesn't make sense to me, under no cache strategy.


Jan


>
>> This result is so surprising that I want people to take a look at it
>> and tell me if there's something wrong with the tests or some
>> bottlenecking factor that I've not seen.
>
> I'm aware of two conspicuous scenarios where ARC would be expected to
> _substantially_ improve performance:
>
>  1.  When it allows a VACUUM not to throw useful data out of
>      the shared cache in that VACUUM now only 'chews' on one
>      page of the cache;
>
>  2.  When it allows a Seq Scan to not push useful data out of
>      the shared cache, for much the same reason.
>
> I don't imagine either scenario are prominent in the OSDL tests.
>
> Increasing the number of cache buffers _is_ likely to lead to some
> slowdowns:
>
>  - Data that passes through the cache also passes through kernel
>    cache, so it's recorded twice, and read twice...
>
>  - The more cache pages there are, the more work is needed for
>    PostgreSQL to manage them.  That will notably happen anywhere
>    that there is a need to scan the cache.
>
>  - If there are any inefficiencies in how the OS kernel manages shared
>    memory, as their size scales, well, that will obviously cause a
>    slowdown.


--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: First set of OSDL Shared Mem scalability results, some

From

Jan Wieck

Date:

14 October 2004, 01:52:56

On 10/9/2004 7:20 AM, Kevin Brown wrote:

> Christopher Browne wrote:
>> Increasing the number of cache buffers _is_ likely to lead to some
>> slowdowns:
>>
>>  - Data that passes through the cache also passes through kernel
>>    cache, so it's recorded twice, and read twice...
>
> Even worse, memory that's used for the PG cache is memory that's not
> available to the kernel's page cache.  Even if the overall memory

Which underlines my previous statement, that a PG shared cache much
larger than the high-frequently accessed data portion of the DB is
counterproductive. Double buffering (kernel-disk-buffer plus shared
buffer) only makes sense for data that would otherwise cause excessive
memory copies in and out of the shared buffer. After that, in only
lowers the memory available for disk buffers.


Jan

> usage in the system isn't enough to cause some paging to disk, most
> modern kernels will adjust the page/disk cache size dynamically to fit
> the memory demands of the system, which in this case means it'll be
> smaller if running programs need more memory for their own use.
>
> This is why I sometimes wonder whether or not it would be a win to use
> mmap() to access the data and index files -- doing so under a truly
> modern OS would surely at the very least save a buffer copy (from the
> page/disk cache to program memory) because the OS could instead
> direcly map the buffer cache pages directly to the program's memory
> space.
>
> Since PG often has to have multiple files open at the same time, and
> in a production database many of those files will be rather large, PG
> would have to limit the size of the mmap()ed region on 32-bit
> platforms, which means that things like the order of mmap() operations
> to access various parts of the file can become just as important in
> the mmap()ed case as it is in the read()/write() case (if not more
> so!).  I would imagine that the use of mmap() on a 64-bit platform
> would be a much, much larger win because PG would most likely be able
> to mmap() entire files and let the OS work out how to order disk reads
> and writes.
>
> The biggest problem as I see it is that (I think) mmap() would have to
> be made to cooperate with malloc() for virtual address space.  I
> suspect issues like this have already been worked out by others,
> however...
>
>
>


--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: First set of OSDL Shared Mem scalability results, some

From

Greg Stark

Date:

14 October 2004, 04:52:46

Jan Wieck <JanWieck@Yahoo.com> writes:

> On 10/8/2004 10:10 PM, Christopher Browne wrote:
>
> > josh@agliodbs.com (Josh Berkus) wrote:
> >> I've been trying to peg the "sweet spot" for shared memory using
> >> OSDL's equipment.  With Jan's new ARC patch, I was expecting that
> >> the desired amount of shared_buffers to be greatly increased.  This
> >> has not turned out to be the case.
> > That doesn't surprise me.
>
> Neither does it surprise me.

There's been some speculation that having a large shared buffers be about 50%
of your RAM is pessimal as it guarantees the OS cache is merely doubling up on
all the buffers postgres is keeping. I wonder whether there's a second sweet
spot where the postgres cache is closer to the total amount of RAM.

That configuration would have disadvantages for servers running other jobs
besides postgres. And I was led to believe earlier that postgres starts each
backend with a fairly fresh slate as far as the ARC algorithm, so it wouldn't
work well for a postgres server that had lots of short to moderate life
sessions.

But if it were even close it could be interesting. Reading the data with
O_DIRECT and having a single global cache could be interesting experiments. I
know there are arguments against each of these, but ...

I'm still pulling for an mmap approach to eliminate postgres's buffer cache
entirely in the long term, but it seems like slim odds now. But one way or the
other having two layers of buffering seems like a waste.

--
greg

Re: First set of OSDL Shared Mem scalability results, some

From

Jan Wieck

Date:

14 October 2004, 05:17:56

On 10/13/2004 11:52 PM, Greg Stark wrote:

> Jan Wieck <JanWieck@Yahoo.com> writes:
>
>> On 10/8/2004 10:10 PM, Christopher Browne wrote:
>>
>> > josh@agliodbs.com (Josh Berkus) wrote:
>> >> I've been trying to peg the "sweet spot" for shared memory using
>> >> OSDL's equipment.  With Jan's new ARC patch, I was expecting that
>> >> the desired amount of shared_buffers to be greatly increased.  This
>> >> has not turned out to be the case.
>> > That doesn't surprise me.
>>
>> Neither does it surprise me.
>
> There's been some speculation that having a large shared buffers be about 50%
> of your RAM is pessimal as it guarantees the OS cache is merely doubling up on
> all the buffers postgres is keeping. I wonder whether there's a second sweet
> spot where the postgres cache is closer to the total amount of RAM.

Which would require that shared memory is not allowed to be swapped out,
and that is allowed in Linux by default IIRC, not to completely distort
the entire test.


Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: First set of OSDL Shared Mem scalability results, some

From

Greg Stark

Date:

14 October 2004, 05:22:55

Jan Wieck <JanWieck@Yahoo.com> writes:

> Which would require that shared memory is not allowed to be swapped out, and
> that is allowed in Linux by default IIRC, not to completely distort the entire
> test.

Well if it's getting swapped out then it's clearly not being used effectively.

There are APIs to bar swapping out pages and the tests could be run without
swap. I suggested it only as an experiment though, there are lots of details
between here and having it be a good configuration for production use.

--
greg

Re: First set of OSDL Shared Mem scalability results, some

From

Jan Wieck

Date:

14 October 2004, 05:29:26

On 10/14/2004 12:22 AM, Greg Stark wrote:

> Jan Wieck <JanWieck@Yahoo.com> writes:
>
>> Which would require that shared memory is not allowed to be swapped out, and
>> that is allowed in Linux by default IIRC, not to completely distort the entire
>> test.
>
> Well if it's getting swapped out then it's clearly not being used effectively.

Is it really that easy if 3 different cache algorithms (PG cache, kernel
buffers and swapping) are competing for the same chips?


Jan

>
> There are APIs to bar swapping out pages and the tests could be run without
> swap. I suggested it only as an experiment though, there are lots of details
> between here and having it be a good configuration for production use.
>


--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: First set of OSDL Shared Mem scalability results, some wierdness ...

From

Kevin Brown

Date:

14 October 2004, 21:26:11

Tom Lane wrote:
> Kevin Brown <kevin@sysexperts.com> writes:
> > Tom Lane wrote:
> >> mmap() is Right Out because it does not afford us sufficient control
> >> over when changes to the in-memory data will propagate to disk.
>
> > ... that's especially true if we simply cannot
> > have the page written to disk in a partially-modified state (something
> > I can easily see being an issue for the WAL -- would the same hold
> > true of the index/data files?).
>
> You're almost there.  Remember the fundamental WAL rule: log entries
> must hit disk before the data changes they describe.  That means that we
> need not only a way of forcing changes to disk (fsync) but a way of
> being sure that changes have *not* gone to disk yet.  In the existing
> implementation we get that by just not issuing write() for a given page
> until we know that the relevant WAL log entries are fsync'd down to
> disk.  (BTW, this is what the LSN field on every page is for: it tells
> the buffer manager the latest WAL offset that has to be flushed before
> it can safely write the page.)
>
> mmap provides msync which is comparable to fsync, but AFAICS it
> provides no way to prevent an in-memory change from reaching disk too
> soon.  This would mean that WAL entries would have to be written *and
> flushed* before we could make the data change at all, which would
> convert multiple updates of a single page into a series of write-and-
> wait-for-WAL-fsync steps.  Not good.  fsync'ing WAL once per transaction
> is bad enough, once per atomic action is intolerable.

Hmm...something just occurred to me about this.

Would a hybrid approach be possible?  That is, use mmap() to handle
reads, and use write() to handle writes?

Any code that wishes to write to a page would have to recognize that
it's doing so and fetch a copy from the storage manager (or
something), which would look to see if the page already exists as a
writeable buffer.  If it doesn't, it creates it by allocating the
memory and then copying the page from the mmap()ed area to the new
buffer, and returning it.  If it does, it just returns a pointer to
the buffer.  There would obviously have to be some bookkeeping
involved: the storage manager would have to know how to map a mmap()ed
page back to a writeable buffer and vice-versa, so that once it
decides to write the buffer it can determine which page in the
original file the buffer corresponds to (so it can do the appropriate
seek()).

In a write-heavy database, you'll end up with a lot of memory copy
operations, but with the scheme we currently use you get that anyway
(it just happens in kernel code instead of user code), so I don't see
that as much of a loss, if any.  Where you win is in a read-heavy
database: you end up being able to read directly from the pages in the
kernel's page cache and thus save a memory copy from kernel space to
user space, not to mention the context switch that happens due to
issuing the read().

Obviously you'd want to mmap() the file read-only in order to prevent
the issues you mention regarding an errant backend, and then reopen
the file read-write for the purpose of writing to it.  In fact, you
could decouple the two: mmap() the file, then close the file -- the
mmap()ed region will remain mapped.  Then, as long as the file remains
mapped, you need to open the file again only when you want to write to
it.

--
Kevin Brown                          kevin@sysexperts.com

Re: First set of OSDL Shared Mem scalability results, some wierdness ...

From

"Simon Riggs"

Date:

14 October 2004, 23:18:18

First off, I'd like to get involved with these tests - pressure of other
work only has prevented me.

Here's my take on the results so far:

I think taking the ratio of the memory allocated to shared_buffers against
the total memory available on the server is completely fallacious. That is
why they cannnot be explained - IMHO the ratio has no real theoretical
basis.

The important ratio for me is the amount of shared_buffers against the total
size of the database in the benchmark test. Every database workload has a
differing percentage of the total database size that represents the "working
set", or the memory that can be beneficially cached. For the tests that
DBT-2 is performing, I say that there is only so many blocks that are worth
the trouble caching. If you cache more than this, you are wasting your time.

For me, these tests don't show that there is a "sweet spot" that you should
set your shared_buffers to, only that for that specific test, you have
located the correct size for shared_buffers. For me, it would be an
incorrect inference that this could then be interpreted that this was the
percentage of the available RAM where the "sweet spot" lies for all
workloads.

The theoretical basis for my comments is this: DBT-2 is essentially a static
workload. That means, for a long test, we can work out with reasonable
certainty the probability that a block will be requested, for every single
block in the database. Given a particular size of cache, you can work out
what your overall cache hit ratio is and therfore what your speed up is
compared with retrieving every single block from disk (the no cache
scenario). If you draw a graph of speedup (y) against cache size as a % of
total database size, the graph looks like an upside-down "L" - i.e. the
graph rises steeply as you give it more memory, then turns sharply at a
particular point, after which it flattens out. The "turning point" is the
"sweet spot" we all seek - the optimum amount of cache memory to allocate -
but this spot depends upon the worklaod and database size, not on available
RAM on the system under test.

Clearly, the presence of the OS disk cache complicates this. Since we have
two caches both allocated from the same pot of memory, it should be clear
that if we overallocate one cache beyond its optimium effectiveness, while
the second cache is still in its "more is better" stage, then we will get
reduced performance. That seems to be the case here. I wouldn't accept that
a fixed ratio between the two caches exists for ALL, or even the majority of
workloads - though clearly broad brush workloads such as "OLTP" and "Data
Warehousing" do have similar-ish requirements.

As an example, lets look at an example:
An application with two tables: SmallTab has 10,000 rows of 100 bytes each
(so table is ~1 Mb)- one row per photo in a photo gallery web site. LargeTab
has large objects within it and has 10,000 photos, average size 10 Mb (so
table is ~100Gb). Assuming all photos are requested randomly, you can see
that an optimum cache size for this workload is 1Mb RAM, 100Gb disk. Trying
to up the cache doesn't have much effect on  the probability that a photo
(from LargeTab) will be in cache, unless you have a large % of 100Gb of RAM,
when you do start to make gains. (Please don't be picky about indexes,
catalog, block size etc). That clearly has absolutely nothing at all to do
with the RAM of the system on which it is running.

I think Jan has said this also in far fewer words, but I'll leave that to
Jan to agree/disagree...

I say this: ARC in 8.0 PostgreSQL allows us to sensibly allocate as large a
shared_buffers cache as is required by the database workload, and this
should not be constrained to a small percentage of server RAM.

Best Regards,

Simon Riggs

> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org
> [mailto:pgsql-performance-owner@postgresql.org]On Behalf Of Josh Berkus
> Sent: 08 October 2004 22:43
> To: pgsql-performance@postgresql.org
> Cc: testperf-general@pgfoundry.org
> Subject: [PERFORM] First set of OSDL Shared Mem scalability results,
> some wierdness ...
>
>
> Folks,
>
> I'm hoping that some of you can shed some light on this.
>
> I've been trying to peg the "sweet spot" for shared memory using OSDL's
> equipment.   With Jan's new ARC patch, I was expecting that the desired
> amount of shared_buffers to be greatly increased.  This has not
> turned out to
> be the case.
>
> The first test series was using OSDL's DBT2 (OLTP) test, with 150
> "warehouses".   All tests were run on a 4-way Pentium III 700mhz
> 3.8GB RAM
> system hooked up to a rather high-end storage device (14
> spindles).    Tests
> were on PostgreSQL 8.0b3, Linux 2.6.7.
>
> Here's a top-level summary:
>
> shared_buffers        % RAM    NOTPM20*
> 1000                0.2%        1287
> 23000            5%        1507
> 46000            10%        1481
> 69000            15%        1382
> 92000            20%        1375
> 115000            25%        1380
> 138000            30%        1344
>
> * = New Order Transactions Per Minute, last 20 Minutes
>      Higher is better.  The maximum possible is 1800.
>
> As you can see, the "sweet spot" appears to be between 5% and 10% of RAM,
> which is if anything *lower* than recommendations for 7.4!
>
> This result is so surprising that I want people to take a look at
> it and tell
> me if there's something wrong with the tests or some bottlenecking factor
> that I've not seen.
>
> in order above:
> http://khack.osdl.org/stp/297959/
> http://khack.osdl.org/stp/297960/
> http://khack.osdl.org/stp/297961/
> http://khack.osdl.org/stp/297962/
> http://khack.osdl.org/stp/297963/
> http://khack.osdl.org/stp/297964/
> http://khack.osdl.org/stp/297965/
>
> Please note that many of the Graphs in these reports are broken.  For one
> thing, some aren't recorded (flat lines) and the CPU usage graph has
> mislabeled lines.
>
> --
> --Josh
>
> Josh Berkus
> Aglio Database Solutions
> San Francisco
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend

Re: First set of OSDL Shared Mem scalability results, some wierdness ...

From

Josh Berkus

Date:

15 October 2004, 00:55:56

Simon,

<lots of good stuff clipped>

> If you draw a graph of speedup (y) against cache size as a
> % of total database size, the graph looks like an upside-down "L" - i.e.
> the graph rises steeply as you give it more memory, then turns sharply at a
> particular point, after which it flattens out. The "turning point" is the
> "sweet spot" we all seek - the optimum amount of cache memory to allocate -
> but this spot depends upon the worklaod and database size, not on available
> RAM on the system under test.

Hmmm ... how do you explain, then the "camel hump" nature of the real
performance?    That is, when we allocated even a few MB more than the
"optimum" ~190MB, overall performance stated to drop quickly.   The result is
that allocating 2x optimum RAM is nearly as bad as allocating too little
(e.g. 8MB).

The only explanation I've heard of this so far is that there is a significant
loss of efficiency with larger caches.  Or do you see the loss of 200MB out
of 3500MB would actually affect the Kernel cache that much?

Anyway, one test of your theory that I can run immediately is to run the exact
same workload on a bigger, faster server and see if the desired quantity of
shared_buffers is roughly the same.  I'm hoping that you're wrong -- not
because I don't find your argument persuasive, but because if you're right it
leaves us without any reasonable ability to recommend shared_buffer settings.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: [Testperf-general] Re: First set of OSDL Shared Mem

From

"Timothy D. Witham"

Date:

15 October 2004, 01:09:09

On Thu, 2004-10-14 at 16:57 -0700, Josh Berkus wrote:
> Simon,
>
> <lots of good stuff clipped>
>
> > If you draw a graph of speedup (y) against cache size as a
> > % of total database size, the graph looks like an upside-down "L" - i.e.
> > the graph rises steeply as you give it more memory, then turns sharply at a
> > particular point, after which it flattens out. The "turning point" is the
> > "sweet spot" we all seek - the optimum amount of cache memory to allocate -
> > but this spot depends upon the worklaod and database size, not on available
> > RAM on the system under test.
>
> Hmmm ... how do you explain, then the "camel hump" nature of the real
> performance?    That is, when we allocated even a few MB more than the
> "optimum" ~190MB, overall performance stated to drop quickly.   The result is
> that allocating 2x optimum RAM is nearly as bad as allocating too little
> (e.g. 8MB).
>
> The only explanation I've heard of this so far is that there is a significant
> loss of efficiency with larger caches.  Or do you see the loss of 200MB out
> of 3500MB would actually affect the Kernel cache that much?
>
    In a past life there seemed to be a sweet spot around the
applications
working set.   Performance went up until you got just a little larger
than
the cache needed to hold the working set and then went down.  Most of
the time a nice looking hump.    It seems to have to do with the
additional pages
not increasing your hit ratio but increasing the amount of work to get a
hit in cache.    This seemed to be independent of the actual database
software being used. (I observed this running Oracle, Informix, Sybase
and Ingres.)

> Anyway, one test of your theory that I can run immediately is to run the exact
> same workload on a bigger, faster server and see if the desired quantity of
> shared_buffers is roughly the same.  I'm hoping that you're wrong -- not
> because I don't find your argument persuasive, but because if you're right it
> leaves us without any reasonable ability to recommend shared_buffer settings.
>
--
Timothy D. Witham - Chief Technology Officer - wookie@osdl.org
Open Source Development Lab Inc - A non-profit corporation
12725 SW Millikan Way - Suite 400 - Beaverton OR, 97005
(503)-626-2455 x11 (office)    (503)-702-2871     (cell)
(503)-626-2436     (fax)

Re: First set of OSDL Shared Mem scalability results, some wierdness ...

From

Christopher Browne

Date:

15 October 2004, 02:19:26

Quoth simon@2ndquadrant.com ("Simon Riggs"):
> I say this: ARC in 8.0 PostgreSQL allows us to sensibly allocate as
> large a shared_buffers cache as is required by the database
> workload, and this should not be constrained to a small percentage
> of server RAM.

I don't think that this particularly follows from "what ARC does."

"What ARC does" is to prevent certain conspicuous patterns of
sequential accesses from essentially trashing the contents of the
cache.

If a particular benchmark does not include conspicuous vacuums or
sequential scans on large tables, then there is little reason to
expect ARC to have a noticeable impact on performance.

It _could_ be that this implies that ARC allows you to get some use
out of a larger shared cache, as it won't get blown away by vacuums
and Seq Scans.  But it is _not_ obvious that this is a necessary
truth.

_Other_ truths we know about are:

 a) If you increase the shared cache, that means more data that is
    represented in both the shared cache and the OS buffer cache,
    which seems rather a waste;

 b) The larger the shared cache, the more pages there are for the
    backend to rummage through before it looks to the filesystem,
    and therefore the more expensive cache misses get.  Cache hits
    get more expensive, too.  Searching through memory is not
    costless.
--
(format nil "~S@~S" "cbbrowne" "acm.org")
http://linuxfinances.info/info/linuxdistributions.html
"The X-Files are too optimistic.  The truth is *not* out there..."
-- Anthony Ord <nws@rollingthunder.co.uk>

Re: First set of OSDL Shared Mem scalability results, some wierdness ...

From

Tom Lane

Date:

15 October 2004, 06:13:42

Kevin Brown <kevin@sysexperts.com> writes:
> Hmm...something just occurred to me about this.

> Would a hybrid approach be possible?  That is, use mmap() to handle
> reads, and use write() to handle writes?

Nope.  Have you read the specs regarding mmap-vs-stdio synchronization?
Basically it says that there are no guarantees whatsoever if you try
this.  The SUS text is a bit weaselly ("the application must ensure
correct synchronization") but the HPUX mmap man page, among others,
lays it on the line:

     It is also unspecified whether write references to a memory region
     mapped with MAP_SHARED are visible to processes reading the file and
     whether writes to a file are visible to processes that have mapped the
     modified portion of that file, except for the effect of msync().

It might work on particular OSes but I think depending on such behavior
would be folly...

            regards, tom lane

Re: [Testperf-general] Re: First set of OSDL Shared Memscalability results, some wierdness ...

From

"Simon Riggs"

Date:

15 October 2004, 08:37:54

>Timothy D. Witham
> On Thu, 2004-10-14 at 16:57 -0700, Josh Berkus wrote:
> > Simon,
> >
> > <lots of good stuff clipped>
> >
> > > If you draw a graph of speedup (y) against cache size as a
> > > % of total database size, the graph looks like an upside-down
> "L" - i.e.
> > > the graph rises steeply as you give it more memory, then
> turns sharply at a
> > > particular point, after which it flattens out. The "turning
> point" is the
> > > "sweet spot" we all seek - the optimum amount of cache memory
> to allocate -
> > > but this spot depends upon the worklaod and database size,
> not on available
> > > RAM on the system under test.
> >
> > Hmmm ... how do you explain, then the "camel hump" nature of the real
> > performance?    That is, when we allocated even a few MB more than the
> > "optimum" ~190MB, overall performance stated to drop quickly.
> The result is
> > that allocating 2x optimum RAM is nearly as bad as allocating
> too little
> > (e.g. 8MB).

Two ways of explaining this:
1. Once you've hit the optimum size of shared_buffers, you may not yet have
hit the optimum size of the OS cache. If that is true, every extra block
given to shared_buffers is wasted, yet detracts from the beneficial effect
of the OS cache. I don't see how the small drop in size of the OS cache
could have the effect you have measured, so I suggest that this possible
explanation doesn't fit the results well.

2. There is some algorithmic effect within PostgreSQL that makes larger
shared_buffers much worse than smaller ones. Imagine that each extra block
we hold in cache has the positive benefit from caching, minus a postulated
negative drag effect. With that model we would get: Once the optimal size of
the cache has been reached the positive benefit tails off to almost zero and
we are just left with the situation that each new block added to
shared_buffers acts as a further drag on performance. That model would fit
the results, so we can begin to look at what the drag effect might be.

Speculating wildly because I don't know that portion of the code this might
be:
CONJECTURE 1: the act of searching for a block in cache is an O(n)
operation, not an O(1) or O(log n) operation - so searching a larger cache
has an additional slowing effect on the application, via a buffer cache lock
that is held while the cache is searched - larger caches are locked for
longer than smaller caches, so this causes additional contention in the
system, which then slows down performance.

The effect might show up by examining the oprofile results for the test
cases. What we would be looking for is something that is being called more
frequently with larger shared_buffers - this could be anything....but my
guess is the oprofile results won't be similar and could lead us to a better
understanding.

> >
> > The only explanation I've heard of this so far is that there is
> a significant
> > loss of efficiency with larger caches.  Or do you see the loss
> of 200MB out
> > of 3500MB would actually affect the Kernel cache that much?
> >
>     In a past life there seemed to be a sweet spot around the
> applications
> working set.   Performance went up until you got just a little larger
> than
> the cache needed to hold the working set and then went down.  Most of
> the time a nice looking hump.    It seems to have to do with the
> additional pages
> not increasing your hit ratio but increasing the amount of work to get a
> hit in cache.    This seemed to be independent of the actual database
> software being used. (I observed this running Oracle, Informix, Sybase
> and Ingres.)

Good, our experiences seems to be similar.

>
> > Anyway, one test of your theory that I can run immediately is
> to run the exact
> > same workload on a bigger, faster server and see if the desired
> quantity of
> > shared_buffers is roughly the same.

I agree that you could test this by running on a bigger or smaller server,
i.e. one with more or less RAM. Running on a faster/slower server at the
same time might alter the results and confuse the situation.

> I'm hoping that you're wrong -- not
> because I don't find your argument persuasive, but because if
> you're right it
> > leaves us without any reasonable ability to recommend
> shared_buffer settings.
>

For the record, what I think we need is dynamically resizable
shared_buffers, not a-priori knowledge of what you should set shared_buffers
to. I've been thinking about implementing a scheme that helps you decide how
big the shared_buffers SHOULD BE, by making the LRU list bigger than the
cache itself, so you'd be able to see whether there is beneficial effect in
increasing shared_buffers.

...remember that this applies to other databases too, and with those we find
that they have dynamically resizable memory.

Having said all that, there are still a great many other performance tests
to run so that we CAN recommend other settings, such as the optimizer cost
parameters, bg writer defaults etc.

Best Regards,

Simon Riggs
2nd Quadrant

Re: First set of OSDL Shared Mem scalability results, some wierdness ...

From

Kevin Brown

Date:

15 October 2004, 10:19:58

Tom Lane wrote:
> Kevin Brown <kevin@sysexperts.com> writes:
> > Hmm...something just occurred to me about this.
>
> > Would a hybrid approach be possible?  That is, use mmap() to handle
> > reads, and use write() to handle writes?
>
> Nope.  Have you read the specs regarding mmap-vs-stdio synchronization?
> Basically it says that there are no guarantees whatsoever if you try
> this.  The SUS text is a bit weaselly ("the application must ensure
> correct synchronization") but the HPUX mmap man page, among others,
> lays it on the line:
>
>      It is also unspecified whether write references to a memory region
>      mapped with MAP_SHARED are visible to processes reading the file and
>      whether writes to a file are visible to processes that have mapped the
>      modified portion of that file, except for the effect of msync().
>
> It might work on particular OSes but I think depending on such behavior
> would be folly...

Yeah, and at this point it can't be considered portable in any real
way because of this.  Thanks for the perspective.  I should have
expected the general specification to be quite broken in this regard,
not to mention certain implementations.  :-)

Good thing there's a lot of lower-hanging fruit than this...



--
Kevin Brown                          kevin@sysexperts.com

Re: First set of OSDL Shared Mem scalability results, some

From

Alan Stange

Date:

15 October 2004, 13:53:28

Tom Lane wrote:

>Kevin Brown <kevin@sysexperts.com> writes:
>
>
>>Hmm...something just occurred to me about this.
>>
>>
>>Would a hybrid approach be possible?  That is, use mmap() to handle
>>reads, and use write() to handle writes?
>>
>>
>
>Nope.  Have you read the specs regarding mmap-vs-stdio synchronization?
>Basically it says that there are no guarantees whatsoever if you try
>this.  The SUS text is a bit weaselly ("the application must ensure
>correct synchronization") but the HPUX mmap man page, among others,
>lays it on the line:
>
>     It is also unspecified whether write references to a memory region
>     mapped with MAP_SHARED are visible to processes reading the file and
>     whether writes to a file are visible to processes that have mapped the
>     modified portion of that file, except for the effect of msync().
>
>It might work on particular OSes but I think depending on such behavior
>would be folly...
>
We have some anecdotal experience along these lines:    There was a set
of kernel bugs in Solaris 2.6 or 7 related to this as well.   We had
several kernel panics and it took a bit to chase down, but the basic
feedback was "oops.  we're screwed".   I've forgotten most of the
details right now; the basic problem was a file was being read+written
via mmap and read()/write() at (essentially) the same time from the same
pid.   It would panic the system quite reliably.  I believe the bugs
related to this have been resolved in Solaris, but it was unpleasant to
chase that problem down...

-- Alan

Re: [Testperf-general] Re: First set of OSDL Shared Memscalability results, some wierdness ...

From

Tom Lane

Date:

15 October 2004, 17:48:31

"Simon Riggs" <simon@2ndquadrant.com> writes:
> Speculating wildly because I don't know that portion of the code this might
> be:
> CONJECTURE 1: the act of searching for a block in cache is an O(n)
> operation, not an O(1) or O(log n) operation

I'm not sure how this meme got into circulation, but I've seen a couple
of people recently either conjecturing or asserting that.  Let me remind
people of the actual facts:

1. We use a hashtable to keep track of which blocks are currently in
shared buffers.  Either a cache hit or a cache miss should be O(1),
because the hashtable size is scaled proportionally to shared_buffers,
and so the number of hash entries examined should remain constant.

2. There are some allegedly-not-performance-critical operations that do
scan through all the buffers, and therefore are O(N) in shared_buffers.

I just eyeballed all the latter, and came up with this list of O(N)
operations and their call points:

AtEOXact_Buffers
    transaction commit or abort
UnlockBuffers
    transaction abort, backend exit
StrategyDirtyBufferList
    background writer's idle loop
FlushRelationBuffers
    VACUUM
    DROP TABLE, DROP INDEX
    TRUNCATE, CLUSTER, REINDEX
    ALTER TABLE SET TABLESPACE
DropRelFileNodeBuffers
    TRUNCATE (only for ON COMMIT TRUNC temp tables)
    REINDEX (inplace case only)
    smgr_internal_unlink (ie, the tail end of DROP TABLE/INDEX)
DropBuffers
    DROP DATABASE

The fact that the first two are called during transaction commit/abort
is mildly alarming.  The constant factors are going to be very tiny
though, because what these routines actually do is scan backend-local
status arrays looking for locked buffers, which they're not going to
find very many of.  For instance AtEOXact_Buffers looks like

    int            i;

    for (i = 0; i < NBuffers; i++)
    {
        if (PrivateRefCount[i] != 0)
        {
        // some code that should never be executed at all in the commit
        // case, and not that much in the abort case either
        }
    }

I suppose with hundreds of thousands of shared buffers this might get to
the point of being noticeable, but I've never seen it show up at all in
profiling with more-normal buffer counts.  Not sure if it's worth
devising a more complex data structure to aid in finding locked buffers.
(To some extent this code is intended to be belt-and-suspenders stuff
for catching omissions elsewhere, and so a more complex data structure
that could have its own bugs is not especially attractive.)

The one that's bothering me at the moment is StrategyDirtyBufferList,
which is a new overhead in 8.0.  It wouldn't directly affect foreground
query performance, but indirectly it would hurt by causing the bgwriter
to suck more CPU cycles than one would like (and it holds the BufMgrLock
while it's doing it, too :-().  One easy way you could see whether this
is an issue in the OSDL test is to see what happens if you double all
three bgwriter parameters (delay, percent, maxpages).  This should
result in about the same net I/O demand from the bgwriter, but
StrategyDirtyBufferList will be executed half as often.

I doubt that the other ones are issues.  We could improve them by
devising a way to quickly find all buffers for a given relation, but
I am just about sure that complicating the buffer management to do so
would be a net loss for normal workloads.

> For the record, what I think we need is dynamically resizable
> shared_buffers, not a-priori knowledge of what you should set
> shared_buffers to.

This isn't likely to happen because the SysV shared memory API isn't
conducive to it.  Absent some amazingly convincing demonstration that
we have to have it, the effort of making it happen in a portable way
isn't going to get spent.

> I've been thinking about implementing a scheme that helps you decide how
> big the shared_buffers SHOULD BE, by making the LRU list bigger than the
> cache itself, so you'd be able to see whether there is beneficial effect in
> increasing shared_buffers.

ARC already keeps such a list --- couldn't you learn what you want to
know from the existing data structure?  It'd be fairly cool if we could
put out warnings "you ought to increase shared_buffers" analogous to the
existing facility for noting excessive checkpointing.

            regards, tom lane

Re: [Testperf-general] Re: First set of OSDL Shared Memscalability

From

Bruce Momjian

Date:

15 October 2004, 17:58:13

Tom Lane wrote:
> > I've been thinking about implementing a scheme that helps you decide how
> > big the shared_buffers SHOULD BE, by making the LRU list bigger than the
> > cache itself, so you'd be able to see whether there is beneficial effect in
> > increasing shared_buffers.
>
> ARC already keeps such a list --- couldn't you learn what you want to
> know from the existing data structure?  It'd be fairly cool if we could
> put out warnings "you ought to increase shared_buffers" analogous to the
> existing facility for noting excessive checkpointing.

Agreed.  ARC already keeps a list of buffers it had to push out recently
so if it needs them again soon it knows its sizing of recent/frequent
might be off (I think).  Anyway, such a log report would be super-cool,
say if you pushed out a buffer and needed it very soon, and the ARC
buffers are already at their maximum for that buffer pool.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: [Testperf-general] Re: First set of OSDL Shared Memscalability results, some wierdness ...

From

"Simon Riggs"

Date:

15 October 2004, 20:24:34

> Bruce Momjian
> Tom Lane wrote:
> > > I've been thinking about implementing a scheme that helps you
> decide how
> > > big the shared_buffers SHOULD BE, by making the LRU list
> bigger than the
> > > cache itself, so you'd be able to see whether there is
> beneficial effect in
> > > increasing shared_buffers.
> >
> > ARC already keeps such a list --- couldn't you learn what you want to
> > know from the existing data structure?  It'd be fairly cool if we could
> > put out warnings "you ought to increase shared_buffers" analogous to the
> > existing facility for noting excessive checkpointing.

First off, many thanks for taking the time to provide the real detail on the
code.

That gives us some much needed direction in interpreting the oprofile
output.

>
> Agreed.  ARC already keeps a list of buffers it had to push out recently
> so if it needs them again soon it knows its sizing of recent/frequent
> might be off (I think).  Anyway, such a log report would be super-cool,
> say if you pushed out a buffer and needed it very soon, and the ARC
> buffers are already at their maximum for that buffer pool.
>

OK, I guess I hadn't realised we were half-way there.

The "increase shared_buffers" warning would be useful, but it would be much
cooler to have some guidance as to how big to set it, especially since this
requires a restart of the server.

What I had in mind was a way of keeping track of how the buffer cache hit
ratio would look at various sizes of shared_buffers, for example 50%, 80%,
120%, 150%, 200% and 400% say. That way you'd stand a chance of plotting the
curve and thereby assessing how much memory could be allocated. I've got a
few ideas, but I need to check out the code first.

I'll investigate both simple/complex options as an 8.1 feature.

Best Regards, Simon Riggs

Re: [Testperf-general] Re: First set of OSDL Shared Memscalability results, some wierdness ...

From

Josh Berkus

Date:

15 October 2004, 20:43:05

People:

> First off, many thanks for taking the time to provide the real detail on
> the code.
>
> That gives us some much needed direction in interpreting the oprofile
> output.

I have some oProfile output; however, it's in 2 out of 20 tests I ran recently
and I need to get them sorted out.

--Josh

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: [Testperf-general] Re: First set of OSDL Shared Memscalability results, some wierdness ...

From

Josh Berkus

Date:

15 October 2004, 21:11:23

Tom, Simon:

First off, two test runs with OProfile are available at:
http://khack.osdl.org/stp/298124/
http://khack.osdl.org/stp/298121/

> AtEOXact_Buffers
>         transaction commit or abort
> UnlockBuffers
>         transaction abort, backend exit

Actually, this might explain the "hump" shape of the curve for this test.
DBT2 is an OLTP test, which means that (at this scale level) it's attempting
to do approximately 30 COMMITs per second as well as one ROLLBACK every 3
seconds.    When I get the tests on DBT3 running, if we see a more gentle
dropoff on overallocated memory, it would indicate that the above may be a
factor.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: First set of OSDL Shared Mem scalability results, some wierdness ...

From

Sean Chittenden

Date:

15 October 2004, 21:16:39

> this.  The SUS text is a bit weaselly ("the application must ensure
> correct synchronization") but the HPUX mmap man page, among others,
> lays it on the line:
>
>      It is also unspecified whether write references to a memory region
>      mapped with MAP_SHARED are visible to processes reading the file
> and
>      whether writes to a file are visible to processes that have
> mapped the
>      modified portion of that file, except for the effect of msync().
>
> It might work on particular OSes but I think depending on such behavior
> would be folly...

Agreed.  Only OSes with a coherent file system buffer cache should ever
use mmap(2).  In order for this to work on HPUX, msync(2) would need to
be used.  -sc

--
Sean Chittenden

Re: [Testperf-general] Re: First set of OSDL Shared Memscalability results, some wierdness ...

From

Tom Lane

Date:

15 October 2004, 21:34:13

Josh Berkus <josh@agliodbs.com> writes:
> First off, two test runs with OProfile are available at:
> http://khack.osdl.org/stp/298124/
> http://khack.osdl.org/stp/298121/

Hmm.  The stuff above 1% in the first of these is

Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        app name                 symbol name
8522858  19.7539  vmlinux                  default_idle
3510225   8.1359  vmlinux                  recalc_sigpending_tsk
1874601   4.3449  vmlinux                  .text.lock.signal
1653816   3.8331  postgres                 SearchCatCache
1080908   2.5053  postgres                 AllocSetAlloc
920369    2.1332  postgres                 AtEOXact_Buffers
806218    1.8686  postgres                 OpernameGetCandidates
803125    1.8614  postgres                 StrategyDirtyBufferList
746123    1.7293  vmlinux                  __copy_from_user_ll
651978    1.5111  vmlinux                  __copy_to_user_ll
640511    1.4845  postgres                 XLogInsert
630797    1.4620  vmlinux                  rm_from_queue
607833    1.4088  vmlinux                  next_thread
436682    1.0121  postgres                 LWLockAcquire
419672    0.9727  postgres                 yyparse

In the second test AtEOXact_Buffers is much lower (down around 0.57
percent) but the other suspects are similar.  Since the only difference
in parameters is shared_buffers (36000 vs 9000), it does look like we
are approaching the point where AtEOXact_Buffers is a problem, but so
far it's only a 2% drag.

I suspect the reason recalc_sigpending_tsk is so high is that the
original coding of PG_TRY involved saving and restoring the signal mask,
which led to a whole lot of sigsetmask-type kernel calls.  Is this test
with beta3, or something older?

Another interesting item here is the costs of __copy_from_user_ll/
__copy_to_user_ll:

36000 buffers:
746123    1.7293  vmlinux                  __copy_from_user_ll
651978    1.5111  vmlinux                  __copy_to_user_ll

9000 buffers:
866414    2.0810  vmlinux                  __copy_from_user_ll
852620    2.0479  vmlinux                  __copy_to_user_ll

Presumably the higher costs for 9000 buffers reflect an increased amount
of shuffling of data between kernel and user space.  So 36000 is not
enough to make the working set totally memory-resident, but even if we
drove this cost to zero we'd only be buying a couple percent.

            regards, tom lane

Re: [Testperf-general] Re: First set of OSDL Shared Memscalability results, some wierdness ...

From

Josh Berkus

Date:

15 October 2004, 21:36:27

Tom,

> I suspect the reason recalc_sigpending_tsk is so high is that the
> original coding of PG_TRY involved saving and restoring the signal mask,
> which led to a whole lot of sigsetmask-type kernel calls.  Is this test
> with beta3, or something older?

Beta3, *without* Gavin or Neil's Futex patch.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: [Testperf-general] Re: First set of OSDL Shared Memscalability results, some wierdness ...

From

Tom Lane

Date:

15 October 2004, 22:27:53

Josh Berkus <josh@agliodbs.com> writes:
>> I suspect the reason recalc_sigpending_tsk is so high is that the
>> original coding of PG_TRY involved saving and restoring the signal mask,
>> which led to a whole lot of sigsetmask-type kernel calls.  Is this test
>> with beta3, or something older?

> Beta3, *without* Gavin or Neil's Futex patch.

Hmm, in that case the cost deserves some further investigation.  Can we
find out just what that routine does and where it's being called from?

            regards, tom lane

Re: [Testperf-general] Re: First set of OSDL Shared Memscalability results, some wierdness ...

From

Mark Wong

Date:

15 October 2004, 22:32:24

On Fri, Oct 15, 2004 at 05:27:29PM -0400, Tom Lane wrote:
> Josh Berkus <josh@agliodbs.com> writes:
> >> I suspect the reason recalc_sigpending_tsk is so high is that the
> >> original coding of PG_TRY involved saving and restoring the signal mask,
> >> which led to a whole lot of sigsetmask-type kernel calls.  Is this test
> >> with beta3, or something older?
>
> > Beta3, *without* Gavin or Neil's Futex patch.
>
> Hmm, in that case the cost deserves some further investigation.  Can we
> find out just what that routine does and where it's being called from?
>

There's a call-graph feature with oprofile as of version 0.8 with
the opstack tool, but I'm having a terrible time figuring out why the
output isn't doing the graphing part.  Otherwise, I'd have that
available already...

Mark

Re: [Testperf-general] Re: First set of OSDL Shared Memscalability results, some wierdness ...

From

Tom Lane

Date:

15 October 2004, 22:45:03

Mark Wong <markw@osdl.org> writes:
> On Fri, Oct 15, 2004 at 05:27:29PM -0400, Tom Lane wrote:
>> Hmm, in that case the cost deserves some further investigation.  Can we
>> find out just what that routine does and where it's being called from?

> There's a call-graph feature with oprofile as of version 0.8 with
> the opstack tool, but I'm having a terrible time figuring out why the
> output isn't doing the graphing part.  Otherwise, I'd have that
> available already...

I was wondering if this might be associated with do_sigaction.
do_sigaction is only 0.23 percent of the runtime according to the
oprofile results:
http://khack.osdl.org/stp/298124/oprofile/DBT_2_Profile-all.oprofile.txt
but the profile results for the same run:
http://khack.osdl.org/stp/298124/profile/DBT_2_Profile-tick.sort
show do_sigaction very high and recalc_sigpending_tsk nowhere at all.
Something funny there.

            regards, tom lane

Re: [Testperf-general] Re: First set of OSDL Shared Memscalability results, some wierdness ...

From

Mark Wong

Date:

15 October 2004, 23:11:02

On Fri, Oct 15, 2004 at 05:44:34PM -0400, Tom Lane wrote:
> Mark Wong <markw@osdl.org> writes:
> > On Fri, Oct 15, 2004 at 05:27:29PM -0400, Tom Lane wrote:
> >> Hmm, in that case the cost deserves some further investigation.  Can we
> >> find out just what that routine does and where it's being called from?
>
> > There's a call-graph feature with oprofile as of version 0.8 with
> > the opstack tool, but I'm having a terrible time figuring out why the
> > output isn't doing the graphing part.  Otherwise, I'd have that
> > available already...
>
> I was wondering if this might be associated with do_sigaction.
> do_sigaction is only 0.23 percent of the runtime according to the
> oprofile results:
> http://khack.osdl.org/stp/298124/oprofile/DBT_2_Profile-all.oprofile.txt
> but the profile results for the same run:
> http://khack.osdl.org/stp/298124/profile/DBT_2_Profile-tick.sort
> show do_sigaction very high and recalc_sigpending_tsk nowhere at all.
> Something funny there.
>

I have always attributed those kind of differences based on how
readprofile and oprofile collect their data.  Granted I don't exactly
understand it.  Anyone familiar with the two differences?

Mark

Getting rid of AtEOXact_Buffers (was Re: [Testperf-general] Re: First set of OSDL Shared Memscalability results, some wierdness ...)

From

Tom Lane

Date:

16 October 2004, 17:54:28

I wrote:
> Josh Berkus <josh@agliodbs.com> writes:
>> First off, two test runs with OProfile are available at:
>> http://khack.osdl.org/stp/298124/
>> http://khack.osdl.org/stp/298121/

> Hmm.  The stuff above 1% in the first of these is

> Counted CPU_CLK_UNHALTED events (clocks processor is not halted) with a unit mask of 0x00 (No unit mask) count 100000
> samples  %        app name                 symbol name
> ...
> 920369    2.1332  postgres                 AtEOXact_Buffers
> ...

> In the second test AtEOXact_Buffers is much lower (down around 0.57
> percent) but the other suspects are similar.  Since the only difference
> in parameters is shared_buffers (36000 vs 9000), it does look like we
> are approaching the point where AtEOXact_Buffers is a problem, but so
> far it's only a 2% drag.

It occurs to me that given the 8.0 resource manager mechanism, we could
in fact dispense with AtEOXact_Buffers, or perhaps better turn it into a
no-op unless #ifdef USE_ASSERT_CHECKING.  We'd just get rid of the
special case for transaction termination in resowner.c and let the
resource owner be responsible for releasing locked buffers always.  The
OSDL results suggest that this won't matter much at the level of 10000
or so shared buffers, but for 100000 or more buffers the linear scan in
AtEOXact_Buffers is going to become a problem.

We could also get rid of the linear search in UnlockBuffers().  The only
thing it's for anymore is to release a BM_PIN_COUNT_WAITER flag, and
since a backend could not be doing more than one of those at a time,
we don't really need an array of flags for that, only a single variable.
This does not show in the OSDL results, which I presume means that their
test case is not exercising transaction aborts; but I think we need to
zap both routines to make the world safe for large shared_buffers
values.  (See also
http://archives.postgresql.org/pgsql-performance/2004-10/msg00218.php)

Any objection to doing this for 8.0?

            regards, tom lane

Re: Getting rid of AtEOXact_Buffers (was Re: [Testperf-general] Re: First set of OSDL Shared Memscalability results, some wierdness ...)

From

Josh Berkus

Date:

16 October 2004, 22:16:36

Tom,

> We could also get rid of the linear search in UnlockBuffers().  The only
> thing it's for anymore is to release a BM_PIN_COUNT_WAITER flag, and
> since a backend could not be doing more than one of those at a time,
> we don't really need an array of flags for that, only a single variable.
> This does not show in the OSDL results, which I presume means that their
> test case is not exercising transaction aborts;

In the test, one out of every 100 new order transactions is aborted (about 1
out of 150 transactions overall).

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: Getting rid of AtEOXact_Buffers (was Re: [Testperf-general] Re: First set of OSDL Shared Memscalability results, some wierdness ...)

From

Tom Lane

Date:

16 October 2004, 22:19:25

Josh Berkus <josh@agliodbs.com> writes:
>> This does not show in the OSDL results, which I presume means that their
>> test case is not exercising transaction aborts;

> In the test, one out of every 100 new order transactions is aborted (about 1
> out of 150 transactions overall).

Okay, but that just ensures that any bottlenecks in xact abort will be
down in the noise in this test case ...

In any case, those changes are in CVS now if you want to try them.

            regards, tom lane

Re: Getting rid of AtEOXact_Buffers (was Re: [Testperf-general] Re: First set of OSDL Shared Memscalability results, some wierdness ...)

From

Josh Berkus

Date:

16 October 2004, 22:35:06

Tom,

> In any case, those changes are in CVS now if you want to try them.

OK.  Will have to wait until OSDL gives me a dedicated testing machine
sometime mon/tues/wed.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: First set of OSDL Shared Mem scalability results, some

From

Jan Wieck

Date:

18 October 2004, 20:17:44

On 10/14/2004 6:36 PM, Simon Riggs wrote:

> [...]
> I think Jan has said this also in far fewer words, but I'll leave that to
> Jan to agree/disagree...

I do agree. The total DB size has as little to do with the optimum
shared buffer cache size as the total available RAM of the machine.

After reading your comments it appears more clear to me. All what those
tests did show is the amount of high frequently accessed data in this
database population and workload combination.

>
> I say this: ARC in 8.0 PostgreSQL allows us to sensibly allocate as large a
> shared_buffers cache as is required by the database workload, and this
> should not be constrained to a small percentage of server RAM.

Right.

Jan

--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: First set of OSDL Shared Mem scalability results, some

From

Jan Wieck

Date:

18 October 2004, 20:37:59

On 10/14/2004 8:10 PM, Christopher Browne wrote:

> Quoth simon@2ndquadrant.com ("Simon Riggs"):
>> I say this: ARC in 8.0 PostgreSQL allows us to sensibly allocate as
>> large a shared_buffers cache as is required by the database
>> workload, and this should not be constrained to a small percentage
>> of server RAM.
>
> I don't think that this particularly follows from "what ARC does."

The combination of ARC together with the background writer is supposed
to allow us to allocate the optimum even if that is large. The former
implementation of the LRU without background writer would just hang the
server for a long time during a checkpoint, which is absolutely
inacceptable for any OLTP system.


Jan

>
> "What ARC does" is to prevent certain conspicuous patterns of
> sequential accesses from essentially trashing the contents of the
> cache.
>
> If a particular benchmark does not include conspicuous vacuums or
> sequential scans on large tables, then there is little reason to
> expect ARC to have a noticeable impact on performance.
>
> It _could_ be that this implies that ARC allows you to get some use
> out of a larger shared cache, as it won't get blown away by vacuums
> and Seq Scans.  But it is _not_ obvious that this is a necessary
> truth.
>
> _Other_ truths we know about are:
>
>  a) If you increase the shared cache, that means more data that is
>     represented in both the shared cache and the OS buffer cache,
>     which seems rather a waste;
>
>  b) The larger the shared cache, the more pages there are for the
>     backend to rummage through before it looks to the filesystem,
>     and therefore the more expensive cache misses get.  Cache hits
>     get more expensive, too.  Searching through memory is not
>     costless.


--
#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #

Re: [Testperf-general] Re: First set of OSDL Shared Memscalability results, some wierdness ...

From

Josh Berkus

Date:

18 October 2004, 21:59:08

Simon,

> I agree that you could test this by running on a bigger or smaller server,
> i.e. one with more or less RAM. Running on a faster/slower server at the
> same time might alter the results and confuse the situation.

Unfortunately, a faster server is the only option I have that also has more
RAM.   If I double the RAM and double the processors at the same time, what
would you expect to happen to the shared_buffers curve?

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Links to OSDL test results up

From

Josh Berkus

Date:

21 October 2004, 21:26:49

Simon, Folks,

I've put links to all of my OSDL-STP test results up on the TestPerf project:
http://pgfoundry.org/forum/forum.php?thread_id=164&forum_id=160

SHare&Enjoy!

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: First set of OSDL Shared Mem scalability results, some

From

Curt Sampson

Date:

23 October 2004, 08:44:55

On Sat, 9 Oct 2004, Tom Lane wrote:

> mmap provides msync which is comparable to fsync, but AFAICS it
> provides no way to prevent an in-memory change from reaching disk too
> soon.  This would mean that WAL entries would have to be written *and
> flushed* before we could make the data change at all, which would
> convert multiple updates of a single page into a series of write-and-
> wait-for-WAL-fsync steps.  Not good.  fsync'ing WAL once per transaction
> is bad enough, once per atomic action is intolerable.

Back when I was working out how to do this, I reckoned that you could
use mmap by keeping a write queue for each modified page. Reading,
you'd have to read the datum from the page and then check the write
queue for that page to see if that datum had been updated, using the
new value if it's there. Writing, you'd add the modified datum to the
write queue, but not apply the write queue to the page until you'd had
confirmation that the corresponding transaction log entry had been
written. So multiple writes are no big deal; they just all queue up in
the write queue, and at any time you can apply as much of the write
queue to the page itself as the current log entry will allow.

There are several different strategies available for mapping and
unmapping the pages, and in fact there might need to be several
available to get the best performance out of different systems. Most
OSes do not seem to be optimized for having thousands or tens of
thousands of small mappings (certainly NetBSD isn't), but I've never
done any performance tests to see what kind of strategies might work
well or not.

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.NetBSD.org
     Make up enjoying your city life...produced by BIC CAMERA

Re: First set of OSDL Shared Mem scalability results, some wierdness ...

From

Tom Lane

Date:

23 October 2004, 19:10:40

Curt Sampson <cjs@cynic.net> writes:
> Back when I was working out how to do this, I reckoned that you could
> use mmap by keeping a write queue for each modified page. Reading,
> you'd have to read the datum from the page and then check the write
> queue for that page to see if that datum had been updated, using the
> new value if it's there. Writing, you'd add the modified datum to the
> write queue, but not apply the write queue to the page until you'd had
> confirmation that the corresponding transaction log entry had been
> written. So multiple writes are no big deal; they just all queue up in
> the write queue, and at any time you can apply as much of the write
> queue to the page itself as the current log entry will allow.

Seems to me the overhead of any such scheme would swamp the savings from
avoiding kernel/userspace copies ... the locking issues alone would be
painful.

            regards, tom lane

Re: First set of OSDL Shared Mem scalability results, some

From

Curt Sampson

Date:

24 October 2004, 06:46:32

On Sat, 23 Oct 2004, Tom Lane wrote:

> Seems to me the overhead of any such scheme would swamp the savings from
> avoiding kernel/userspace copies ...

Well, one really can't know without testing, but memory copies are
extremely expensive if they go outside of the cache.

> the locking issues alone would be painful.

I don't see why they would be any more painful than the current locking
issues. In fact, I don't see any reason to add more locking than we
already use when updating pages.

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.NetBSD.org
     Make up enjoying your city life...produced by BIC CAMERA

Re: First set of OSDL Shared Mem scalability results, some

From

Tom Lane

Date:

24 October 2004, 15:39:38

Curt Sampson <cjs@cynic.net> writes:
> On Sat, 23 Oct 2004, Tom Lane wrote:
>> Seems to me the overhead of any such scheme would swamp the savings from
>> avoiding kernel/userspace copies ...

> Well, one really can't know without testing, but memory copies are
> extremely expensive if they go outside of the cache.

Sure, but what about all the copying from write queue to page?

>> the locking issues alone would be painful.

> I don't see why they would be any more painful than the current locking
> issues.

Because there are more locks --- the write queue data structure will
need to be locked separately from the page.  (Even with a separate write
queue per page, there will need to be a shared data structure that
allows you to allocate and find write queues, and that thing will be a
subject of contention.  See BufMgrLock, which is not held while actively
twiddling the contents of pages, but is a serious cause of contention
anyway.)

            regards, tom lane

Re: First set of OSDL Shared Mem scalability results, some

From

Curt Sampson

Date:

25 October 2004, 01:30:55

On Sun, 24 Oct 2004, Tom Lane wrote:

> > Well, one really can't know without testing, but memory copies are
> > extremely expensive if they go outside of the cache.
>
> Sure, but what about all the copying from write queue to page?

There's a pretty big difference between few-hundred-bytes-on-write and
eight-kilobytes-with-every-read memory copy.

As for the queue allocation, again, I have no data to back this up, but
I don't think it would be as bad as BufMgrLock. Not every page will have
a write queue, and a "hot" page is only going to get one once. (If a
page has a write queue, you might as well leave it with the page after
flushing it, and get rid of it only when the page leaves memory.)

I see the OS issues related to mapping that much memory as a much bigger
potential problem.

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.NetBSD.org
     Make up enjoying your city life...produced by BIC CAMERA

Re: First set of OSDL Shared Mem scalability results, some

From

Tom Lane

Date:

25 October 2004, 02:18:07

Curt Sampson <cjs@cynic.net> writes:
> I see the OS issues related to mapping that much memory as a much bigger
> potential problem.

I see potential problems everywhere I look ;-)

Considering that the available numbers suggest we could win just a few
percent (and that's assuming that all this extra mechanism has zero
cost), I can't believe that the project is worth spending manpower on.
There is a lot of much more attractive fruit hanging at lower levels.
The bitmap-indexing stuff that was recently being discussed, for
instance, would certainly take less effort than this; it would create
no new portability issues; and at least for the queries where it helps,
it could offer integer-multiple speedups, not percentage points.

My engineering professors taught me that you put large effort where you
have a chance at large rewards.  Converting PG to mmap doesn't seem to
meet that test, even if I believed it would work.

            regards, tom lane

Re: First set of OSDL Shared Mem scalability results, some

From

Curt Sampson

Date:

25 October 2004, 02:33:01

On Sun, 24 Oct 2004, Tom Lane wrote:

> Considering that the available numbers suggest we could win just a few
> percent...

I must confess that I was completely unaware of these "numbers." Where
do I find them?

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.NetBSD.org
     Make up enjoying your city life...produced by BIC CAMERA

Re: First set of OSDL Shared Mem scalability results, some

From

Tom Lane

Date:

25 October 2004, 16:58:09

Curt Sampson <cjs@cynic.net> writes:
> On Sun, 24 Oct 2004, Tom Lane wrote:
>> Considering that the available numbers suggest we could win just a few
>> percent...

> I must confess that I was completely unaware of these "numbers." Where
> do I find them?

The only numbers I've seen that directly bear on the question is
the oprofile results that Josh recently put up for the DBT-3 benchmark,
which showed the kernel copy-to-userspace and copy-from-userspace
subroutines eating a percent or two apiece of the total runtime.
I don't have the URL at hand but it was posted just a few days ago.
(Now that covers all such copies and not only our datafile reads/writes,
but it's probably fair to assume that the datafile I/O is the bulk of it.)

This is, of course, only one benchmark ... but lacking any measurements
in opposition, I'm inclined to believe it.

            regards, tom lane

Re: First set of OSDL Shared Mem scalability results, some

From

Tom Lane

Date:

25 October 2004, 16:58:17

I wrote:
> I don't have the URL at hand but it was posted just a few days ago.

... actually, it was the beginning of this here thread ...

            regards, tom lane