Thread: Performance TODO items

Performance TODO items

From

Bruce Momjian

Date:

30 July 2001, 12:51:50

I have thought of a few new TODO performance items:

1)  Someone at O'Reilly suggested that we order our duplicate index
entries by tid so if we are hitting the heap for lots of duplicates, the
hits will be on sequential pages.  Seems like a nice idea.

2)  After Tatsuo's report of running 1000 backends on pgbench and from a
Solaris report, I think we should have a queue of backends waiting for a
spinlock, rather than having them sleep and try again.  This is
particularly important for multi-processor machines.

3)  I am reading the Solaris Internals book and there is mention of a
"free behind" capability with large sequential scans.  When a large
sequential scan happens that would wipe out all the old cache entries,
the kernel detects this and places its previous pages first on the free
list.  For out code, if we do a sequential scan of a table that is
larger than our buffer cache size, I think we should detect this and do
the same.  See http://techdocs.postgresql.org for my performance paper
for an example.

New TODO entries are:
* Order duplicate index entries by tid* Add queue of backends waiting for spinlock* Add free-behind capability for
largesequential scans
 

I will modify them with any comments people have.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Performance TODO items

From

Bruce Momjian

Date:

30 July 2001, 13:26:41

> > New TODO entries are:
> > 
> >     * Order duplicate index entries by tid
> 
> In other words - add tid to index key: very old idea.

I was thinking during index creation, it would be nice to order them by
tid, but not do lots of work to keep it that way.

> >     * Add queue of backends waiting for spinlock
> 
> We shouldn't mix two different approaches for different
> kinds of short-time internal locks - in one cases we need in
> light lmgr (when we're going to keep lock long enough, eg for IO)
> and in another cases we'd better to proceed with POSIX' mutex-es
> or semaphores instead of spinlocks. Queueing backends waiting
> for spinlock sounds like nonsense - how are you going to protect
> such queue? With spinlocks? -:)

Yes, I guess so but hopefully we can spin waiting for the queue lock
rather than sleep.  We could use POSIX spinlocks/semaphores now but we
don't because of performance, right?

Should we be spinning waiting for spinlock on multi-cpu machines?  Is
that the answer?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

RE: Performance TODO items

From

"Mikheev, Vadim"

Date:

30 July 2001, 13:43:59

> New TODO entries are:
> 
>     * Order duplicate index entries by tid

In other words - add tid to index key: very old idea.

>     * Add queue of backends waiting for spinlock

We shouldn't mix two different approaches for different
kinds of short-time internal locks - in one cases we need in
light lmgr (when we're going to keep lock long enough, eg for IO)
and in another cases we'd better to proceed with POSIX' mutex-es
or semaphores instead of spinlocks. Queueing backends waiting
for spinlock sounds like nonsense - how are you going to protect
such queue? With spinlocks? -:)

Vadim

RE: Performance TODO items

From

"Mikheev, Vadim"

Date:

30 July 2001, 13:46:01

> > >     * Order duplicate index entries by tid
> > 
> > In other words - add tid to index key: very old idea.
> 
> I was thinking during index creation, it would be nice to
> order them by tid, but not do lots of work to keep it that way.

I hear this "not do lots of work" so often from you -:)
Days of simplicity are gone, Bruce. To continue, this project
requires more and more complex solutions.

> > >     * Add queue of backends waiting for spinlock
> > 
> > We shouldn't mix two different approaches for different
> > kinds of short-time internal locks - in one cases we need in
> > light lmgr (when we're going to keep lock long enough, eg for IO)
> > and in another cases we'd better to proceed with POSIX' mutex-es
> > or semaphores instead of spinlocks. Queueing backends waiting
> > for spinlock sounds like nonsense - how are you going to protect
> > such queue? With spinlocks? -:)
> 
> Yes, I guess so but hopefully we can spin waiting for the queue lock
> rather than sleep. We could use POSIX spinlocks/semaphores now but we
> don't because of performance, right?

No. As long as no one proved with test that mutexes are bad for
performance...
Funny, such test would require ~ 1 day of work.

> Should we be spinning waiting for spinlock on multi-cpu machines?  Is
> that the answer?

What do you mean?

Vadim

Re: Performance TODO items

From

Bruce Momjian

Date:

30 July 2001, 13:58:23

> > > >     * Order duplicate index entries by tid
> > > 
> > > In other words - add tid to index key: very old idea.
> > 
> > I was thinking during index creation, it would be nice to
> > order them by tid, but not do lots of work to keep it that way.
> 
> I hear this "not do lots of work" so often from you -:)
> Days of simplicity are gone, Bruce. To continue, this project
> requires more and more complex solutions.

Yep.  I can dream.  :-)

> > > >     * Add queue of backends waiting for spinlock
> > > 
> > > We shouldn't mix two different approaches for different
> > > kinds of short-time internal locks - in one cases we need in
> > > light lmgr (when we're going to keep lock long enough, eg for IO)
> > > and in another cases we'd better to proceed with POSIX' mutex-es
> > > or semaphores instead of spinlocks. Queueing backends waiting
> > > for spinlock sounds like nonsense - how are you going to protect
> > > such queue? With spinlocks? -:)
> > 
> > Yes, I guess so but hopefully we can spin waiting for the queue lock
> > rather than sleep. We could use POSIX spinlocks/semaphores now but we
> > don't because of performance, right?
> 
> No. As long as no one proved with test that mutexes are bad for
> performance...
> Funny, such test would require ~ 1 day of work.

Good question.  I know the number of function calls to spinlock stuff is
huge.  Seems real semaphores may be a big win on multi-cpu boxes.

> > Should we be spinning waiting for spinlock on multi-cpu machines?  Is
> > that the answer?
> 
> What do you mean?

On a single-cpu machine, if you can't get the spinlock, you should just
sleep and let the process who had it continue.  On a multi-cpu machine,
you perhaps should just keep trying, hoping that the process who holds
it finishes soon.  I wonder is we should find some kind of test on
postmaster startup that would test to see if we have multiple cpu's and
change from spinlock sleeping to spinlock spin-waiting.

Add to this that fact we can't sleep for less than on clock tick on most
machines, 10ms, and the spinlock stuff needs work.

TODO updated:

* Improve spinlock code, perhaps with OS semaphores, sleeper queue, or  spining to obtain lock on multi-cpu systems

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

RE: Performance TODO items

From

"Mikheev, Vadim"

Date:

30 July 2001, 14:36:55

> > > We could use POSIX spinlocks/semaphores now but we
> > > don't because of performance, right?
> > 
> > No. As long as no one proved with test that mutexes are bad for
> > performance...
> > Funny, such test would require ~ 1 day of work.
> 
> Good question. I know the number of function calls to spinlock stuff
> is huge. Seems real semaphores may be a big win on multi-cpu boxes.

Ok, being tired of endless discussions I'll try to use mutexes instead
of spinlocks and run pgbench on my Solaris WS 10 and E4500 (4 CPU) boxes.

Vadim

Re: Performance TODO items

From

Tom Lane

Date:

30 July 2001, 15:24:08

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> New TODO entries are:
>     * Add queue of backends waiting for spinlock

I already see:

* Create spinlock sleepers queue so everyone doesn't wake up at once


BTW, I agree with Vadim's opinion that we should add a new type of lock
(intermediate between spinlocks and lockmanager locks) rather than try
to add new semantics onto spinlocks.  For example, it'd be very nice to
distinguish read-only and read-write access in this new kind of lock,
but we can't expect spinlocks to be able to do that.
        regards, tom lane

Re: Performance TODO items

From

Tom Lane

Date:

30 July 2001, 15:29:32

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Should we be spinning waiting for spinlock on multi-cpu machines?  Is
> that the answer?

A multi-CPU machine is actually the only case where a true spinlock
*does* make sense.  On a single CPU you might as well yield the CPU
immediately, because you have no chance of getting the lock until the
current holder is allowed to run again.  On a multi CPU it's a
reasonable guess that the current holder is running on one of the other
CPUs and may release the lock soon ("soon" == less than a process
dispatch cycle, hence busy-wait is better than release CPU).

We are currently using spinlocks for a lot of situations where the mean
time spent holding the lock is probably larger than "soon" as defined
above.  We should have a different lock implementation for those cases.
True spinlocks should be reserved for protecting code where the maximum
time spent holding the lock is guaranteed to be very short.
        regards, tom lane

RE: Performance TODO items

From

"Darren King"

Date:

30 July 2001, 15:47:32

> 3)  I am reading the Solaris Internals book and there is mention of a
> "free behind" capability with large sequential scans.  When a large
> sequential scan happens that would wipe out all the old cache entries,
> the kernel detects this and places its previous pages first
> on the free list.  For out code, if we do a sequential scan of a table
> that is larger than our buffer cache size, I think we should detect
> this and do the same.  See http://techdocs.postgresql.org for my
> performance paper for an example.
>
> New TODO entries are:
>
>     * Order duplicate index entries by tid
>     * Add queue of backends waiting for spinlock
>     * Add free-behind capability for large sequential scans

So why do we cache sequetially-read pages?  Or at least not have an
option to control it?

Oracle (to the best of my knowledge) does NOT cache pages read by a
sequential index scan for at least two reasons/assumptions (two being
all that I can recall):

1. Caching pages for sequential scans over sufficiently large tables
will just cycle the cache.  The pages that will be cached at the end of
the query will be the last N pages of the table, so when the same
sequential query is run again, the scan from the beginning of the table
will start flushing the oldest cached pages which are more than likely
going to be the ones that will be needed at the end of the scan, etc,
etc.  In a multi-user environment, the effect is worse.

2. Concurrent or consective queries in a dynamic database will not
generate plans that use the same sequential scans, so they will tend to
thrash the cache.

Now there are some databases where the same general queries are run time
after time and caching the pages from sequential scans does make sense,
but in larger, enterprise-type systems, indices are created to help
speed up the most used queries and the sequential cache entries only
serve to clutter the cache and flush the useful pages.

Is there any way that caching pages read in by a sequential scan could
be made a configurable-option?

Any chance someone could run pgbench on a test system set up to not
cache sequential reads?

Darren

Re: Performance TODO items

From

Bruce Momjian

Date:

30 July 2001, 17:10:59

> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > New TODO entries are:
> >     * Add queue of backends waiting for spinlock
> 
> I already see:
> 
> * Create spinlock sleepers queue so everyone doesn't wake up at once

That is an old copy of the TODO.  I reworded it.  You will only see this
now:
* Improve spinlock code, perhaps with OS semaphores, sleeper queue, or  spining to obtain lock on multi-cpu systems

> BTW, I agree with Vadim's opinion that we should add a new type of lock
> (intermediate between spinlocks and lockmanager locks) rather than try
> to add new semantics onto spinlocks.  For example, it'd be very nice to
> distinguish read-only and read-write access in this new kind of lock,
> but we can't expect spinlocks to be able to do that.

Yes, I agree too.  On a uniprocessor machine, if I can't get the
spinlock, I want to yield the cpu, ideally for less than 10ms.  On a
multi-cpu machine, if the lock is held by another CPU and that process
is running, we want to spin waiting for the lock to free.   If not, we
can sleep.  We basically need some more sophisticated semantics around
these locks, or move to OS semaphores.

In fact, can't we sleep on an interruptible system call, and send
signals to processes when we release the lock?  That way we don't have
the 10ms sleep limitation.  One idea is to have a bytes for each backend
in shared memory and have each backend set the bit if it is waiting for
the semaphore.  There would be no contention with multiple backends
registering their sleep at the same time.

We have seen reports of 4-cpu systems having poor performance while the
system is only 12% busy, perhaps because the processes are all sleeping
waiting for the next tick.

I think multi-cpu machines are going to give us new challenges.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Performance TODO items

From

Bruce Momjian

Date:

30 July 2001, 17:14:10

> So why do we cache sequetially-read pages?  Or at least not have an
> option to control it?
> 
> Oracle (to the best of my knowledge) does NOT cache pages read by a
> sequential index scan for at least two reasons/assumptions (two being
> all that I can recall):
> 
> 1. Caching pages for sequential scans over sufficiently large tables
> will just cycle the cache.  The pages that will be cached at the end of
> the query will be the last N pages of the table, so when the same
> sequential query is run again, the scan from the beginning of the table
> will start flushing the oldest cached pages which are more than likely
> going to be the ones that will be needed at the end of the scan, etc,
> etc.  In a multi-user environment, the effect is worse.
> 
> 2. Concurrent or consective queries in a dynamic database will not
> generate plans that use the same sequential scans, so they will tend to
> thrash the cache.
> 
> Now there are some databases where the same general queries are run time
> after time and caching the pages from sequential scans does make sense,
> but in larger, enterprise-type systems, indices are created to help
> speed up the most used queries and the sequential cache entries only
> serve to clutter the cache and flush the useful pages.
> 
> Is there any way that caching pages read in by a sequential scan could
> be made a configurable-option?
> 
> Any chance someone could run pgbench on a test system set up to not
> cache sequential reads?

Yep, that is the issue.  If the whole table fits in the cache, it is
great.  If not, it is useless or worse because it forces out other
pages.  Right now the cache is oldest-out and doesn't know anything
about access patterns.  We would need to get that info passed in the
cache, probably some function parameter.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Performance TODO items

From

Tom Lane

Date:

30 July 2001, 17:53:40

Bruce Momjian <pgman@candle.pha.pa.us> writes:
> I have thought of a few new TODO performance items:
> 1)  Someone at O'Reilly suggested that we order our duplicate index
> entries by tid so if we are hitting the heap for lots of duplicates, the
> hits will be on sequential pages.  Seems like a nice idea.

A more general solution is for indexscan to collect up a bunch of TIDs
from the index, sort them in-memory by TID order, and then probe into
the heap with those TIDs.  This is better than the above because you get
nice ordering of the heap accesses across multiple key values, not just
among the tuples with the same key.  (In a unique or near-unique index,
the above idea is nearly worthless.)

In the best case there are few enough TIDs retrieved from the index that
you can just do this once, but even if there are lots of TIDs, it should
be a win to do this in batches of a few thousand TIDs.  Essentially we
decouple indexscans into separate index-access and heap-access phases.

One big problem is that this doesn't interact well with concurrent VACUUM:
our present solution for concurrent VACUUM assumes that indexscans hold
a pin on an index page until they've finished fetching the pointed-to
heap tuples.  Another objection is that we'd have a harder time
implementing the TODO item of marking an indextuple dead when its
associated heaptuple is dead.  Anyone see a way around these problems?
        regards, tom lane

Re: Performance TODO items

From

Bruce Momjian

Date:

30 July 2001, 18:11:55

> A more general solution is for indexscan to collect up a bunch of TIDs
> from the index, sort them in-memory by TID order, and then probe into
> the heap with those TIDs.  This is better than the above because you get
> nice ordering of the heap accesses across multiple key values, not just
> among the tuples with the same key.  (In a unique or near-unique index,
> the above idea is nearly worthless.)
> 
> In the best case there are few enough TIDs retrieved from the index that
> you can just do this once, but even if there are lots of TIDs, it should
> be a win to do this in batches of a few thousand TIDs.  Essentially we
> decouple indexscans into separate index-access and heap-access phases.
> 
> One big problem is that this doesn't interact well with concurrent VACUUM:
> our present solution for concurrent VACUUM assumes that indexscans hold
> a pin on an index page until they've finished fetching the pointed-to
> heap tuples.  Another objection is that we'd have a harder time
> implementing the TODO item of marking an indextuple dead when its
> associated heaptuple is dead.  Anyone see a way around these problems?

Interesting.  I figured the cache could keep most pages in such a case. 
I was thinking more of helping file system readahead by requesting the
earlier block first in a mult-block request.  Not sure how valuable that
would be.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Performance TODO items

From

Matthew Kirkwood

Date:

31 July 2001, 05:12:36

On Mon, 30 Jul 2001, Bruce Momjian wrote:

> * Improve spinlock code, perhaps with OS semaphores, sleeper queue, or
>   spining to obtain lock on multi-cpu systems

You may be interested in a discussion which happened over on
linux-kernel a few months ago.

Quite a lot of people want a lightweight userspace semaphore,
and for pretty much the same reasons.

Linus proposed a pretty interesting solution which has the
same minimal overhead as the current spinlocks in the non-
contention case, but avoids the spin where there's contention:

http://www.mail-archive.com/linux-kernel%40vger.kernel.org/msg39615.html

Matthew.

Re: Performance TODO items

From

Bruce Momjian

Date:

31 July 2001, 10:03:11

> On Mon, 30 Jul 2001, Bruce Momjian wrote:
> 
> > * Improve spinlock code, perhaps with OS semaphores, sleeper queue, or
> >   spining to obtain lock on multi-cpu systems
> 
> You may be interested in a discussion which happened over on
> linux-kernel a few months ago.
> 
> Quite a lot of people want a lightweight userspace semaphore,
> and for pretty much the same reasons.
> 
> Linus proposed a pretty interesting solution which has the
> same minimal overhead as the current spinlocks in the non-
> contention case, but avoids the spin where there's contention:
> 
> http://www.mail-archive.com/linux-kernel%40vger.kernel.org/msg39615.html

Yes, many OS's have user-space spinlocks, for the same performance
reasons (no kernel call).

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: Performance TODO items

From

Bruce Momjian

Date:

06 September 2001, 01:11:47

> > > > We could use POSIX spinlocks/semaphores now but we
> > > > don't because of performance, right?
> > > 
> > > No. As long as no one proved with test that mutexes are bad for
> > > performance...
> > > Funny, such test would require ~ 1 day of work.
> > 
> > Good question. I know the number of function calls to spinlock stuff
> > is huge. Seems real semaphores may be a big win on multi-cpu boxes.
> 
> Ok, being tired of endless discussions I'll try to use mutexes instead
> of spinlocks and run pgbench on my Solaris WS 10 and E4500 (4 CPU) boxes.

I have updated the TODO list with:
   * Improve spinlock code        o use SysV semaphores or queue of backends waiting on the lock       o wakeup sleeper
orsleep for less than one clock tick        o spin for lock on multi-cpu machines, yield on single cpu machines       o
read/writelocks
 

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026