Thread: Wierd context-switching issue on Xeon

Wierd context-switching issue on Xeon

From
Josh Berkus
Date:
Folks,

We're seeing some odd issues with hyperthreading-capable Xeons, whether or not
hyperthreading is enabled.   Basically, when a small number of really-heavy
duty queries hit the system and push all of the CPUs to more than 70% used
(about 1/2 user & 1/2 kernel), the system goes to 100,000+ context switcthes
per second and performance degrades.

I know that there's other Xeon users on this list ... has anyone else seen
anything like that?   The machines are Dells running Red Hat 7.3.

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


Re: Wierd context-switching issue on Xeon

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> We're seeing some odd issues with hyperthreading-capable Xeons, whether or not
> hyperthreading is enabled.   Basically, when a small number of really-heavy
> duty queries hit the system and push all of the CPUs to more than 70% used
> (about 1/2 user & 1/2 kernel), the system goes to 100,000+ context switcthes
> per second and performance degrades.

Strictly a WAG ... but what this sounds like to me is disastrously bad
behavior of the spinlock code under heavy contention.  We thought we'd
fixed the spinlock code for SMP machines awhile ago, but maybe
hyperthreading opens some new vistas for misbehavior ...

> I know that there's other Xeon users on this list ... has anyone else seen
> anything like that?   The machines are Dells running Red Hat 7.3.

What Postgres version?  Is it easy for you to try 7.4?  If we were
really lucky, the random-backoff algorithm added late in 7.4 development
would cure this.

If you can't try 7.4, or want to gather more data first, it would be
good to try to confirm or disprove the theory that the context switches
are coming from spinlock delays.  If they are, they'd be coming from the
select() calls in s_lock() in s_lock.c.  Can you strace or something to
see what kernel calls the context switches occur on?

Another line of thought is that RH 7.3 is a long ways back, and it
wasn't so very long ago that Linux still had lots of SMP bugs.  Maybe
what you really need is a kernel update?

            regards, tom lane

Re: Wierd context-switching issue on Xeon

From
Josh Berkus
Date:
Tom,

> Strictly a WAG ... but what this sounds like to me is disastrously bad
> behavior of the spinlock code under heavy contention.  We thought we'd
> fixed the spinlock code for SMP machines awhile ago, but maybe
> hyperthreading opens some new vistas for misbehavior ...

Yeah, I thought of that based on the discussion on -Hackers.  But we tried
turning off hyperthreading, with no change in behavior.

> If you can't try 7.4, or want to gather more data first, it would be
> good to try to confirm or disprove the theory that the context switches
> are coming from spinlock delays.  If they are, they'd be coming from the
> select() calls in s_lock() in s_lock.c.  Can you strace or something to
> see what kernel calls the context switches occur on?

Might be worth it ... will suggest that.  Will also try 7.4.

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


RESOLVED: Re: Wierd context-switching issue on Xeon

From
Dirk Lutzebäck
Date:
Tom, Josh,

I think we have the problem resolved after I found the following note
from Tom:

 > A large number of semops may mean that you have excessive contention
on some lockable
 > resource, but I don't have enough info to guess what resource.

This was the key to look at: we were missing all indices on table which
is used heavily and does lots of locking. After recreating the missing
indices the production system performed normal. No, more excessive
semop() calls, load way below 1.0, CS over 20.000 very rare, more in
thousands realm and less.

This is quite a relief but I am sorry that the problem was so stupid and
you wasted some time although Tom said he had also seem excessive
semop() calls on another Dual XEON system.

Hyperthreading was turned off so far but will be turned on again the
next days. I don't expect any problems then.

I'm not sure if this semop() problem is still an issue but the database
behaves a bit out of bounds in this situation, i.e. consuming system
resources with semop() calls 95% while tables are locked very often and
longer.

Thanks for your help,

Dirk

At last here is the current vmstat 1 excerpt where the problem has been
resolved:



procs -----------memory---------- ---swap-- -----io---- --system--
----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy
id wa
 1  0   2308 232508 201924 6976532    0    0   136   464  628   812  5
1 94  0
 0  0   2308 232500 201928 6976628    0    0    96   296  495   484  4
0 95  0
 0  1   2308 232492 201928 6976628    0    0     0   176  347   278  1
0 99  0
 0  0   2308 233484 201928 6976596    0    0    40   580  443   351  8
2 90  0
 1  0   2308 233484 201928 6976696    0    0    76   692  792   651  9
2 88  0
 0  0   2308 233484 201928 6976696    0    0     0    20  132    34  0
0 100  0
 0  0   2308 233484 201928 6976696    0    0     0    76  177    90  0
0 100  0
 0  1   2308 233484 201928 6976696    0    0     0   216  321   250  4
0 96  0
 0  0   2308 233484 201928 6976696    0    0     0   116  417   240  8
0 92  0
 0  0   2308 233484 201928 6976784    0    0    48   600  403   270  8
0 92  0
 0  0   2308 233464 201928 6976860    0    0    76   452 1064  2611 14
1 84  0
 0  0   2308 233460 201932 6976900    0    0    32   256  587   587 12
1 87  0
 0  0   2308 233460 201932 6976932    0    0    32   188  379   287  5
0 94  0
 0  0   2308 233460 201932 6976932    0    0     0     0  103     8  0
0 100  0
 0  0   2308 233460 201932 6976932    0    0     0     0  102    14  0
0 100  0
 0  1   2308 233444 201948 6976932    0    0     0   348  300   180  1
0 99  0
 1  0   2308 233424 201948 6976948    0    0    16   380  739   906  4
2 93  0
 0  0   2308 233424 201948 6977032    0    0    68   260  724   987  7
0 92  0
 0  0   2308 231924 201948 6977128    0    0    96   344 1130   753 11
1 88  0
 1  0   2308 231924 201948 6977248    0    0   112   324  687   628  3
0 97  0
 0  0   2308 231924 201948 6977248    0    0     0   192  575   430  5
0 95  0
 1  0   2308 231924 201948 6977248    0    0     0   264  208   124  0
0 100  0
 0  0   2308 231924 201948 6977264    0    0    16   272  380   230  3
2 95  0
 0  0   2308 231924 201948 6977264    0    0     0     0  104     8  0
0 100  0
 0  0   2308 231924 201948 6977264    0    0     0    48  258    92  1
0 99  0
 0  0   2308 231816 201948 6977484    0    0   212   268  456   384  2
0 98  0
 0  0   2308 231816 201948 6977484    0    0     0    88  453   770  0
0 99  0
 0  0   2308 231452 201948 6977680    0    0   196   476  615   676  5
0 94  0
 0  0   2308 231452 201948 6977680    0    0     0   228  431   400  2
0 98  0
 0  0   2308 231452 201948 6977680    0    0     0     0  237    58  3
0 97  0
 0  0   2308 231448 201952 6977680    0    0     0     0  365    84  2
0 97  0
 0  0   2308 231448 201952 6977680    0    0     0    40  246   108  1
0 99  0
 0  0   2308 231448 201952 6977776    0    0    96   352  606  1026  4
2 94  0
 0  0   2308 231448 201952 6977776    0    0     0   240  295   266  5
0 95  0



Re: RESOLVED: Re: Wierd context-switching issue on Xeon

From
Tom Lane
Date:
=?ISO-8859-1?Q?Dirk_Lutzeb=E4ck?= <lutzeb@aeccom.com> writes:
> This was the key to look at: we were missing all indices on table which
> is used heavily and does lots of locking. After recreating the missing
> indices the production system performed normal. No, more excessive
> semop() calls, load way below 1.0, CS over 20.000 very rare, more in
> thousands realm and less.

Hmm ... that's darn interesting.  AFAICT the test case I am looking at
for Josh's client has no such SQL-level problem ... but I will go back
and double check ...

            regards, tom lane

Re: RESOLVED: Re: Wierd context-switching issue on Xeon

From
Josh Berkus
Date:
Dirk,

> I'm not sure if this semop() problem is still an issue but the database
> behaves a bit out of bounds in this situation, i.e. consuming system
> resources with semop() calls 95% while tables are locked very often and
> longer.

It would be helpful to us if you could test this with the indexes disabled on
the non-Bigmem system.   I'd like to eliminate Bigmem as a factor, if
possible.

--
-Josh Berkus

______AGLIO DATABASE SOLUTIONS___________________________
                                        Josh Berkus
    Enterprise vertical business        josh@agliodbs.com
     and data analysis solutions        (415) 752-2387
      and database optimization           fax 651-9224
  utilizing Open Source technology      San Francisco


Re: Wierd context-switching issue on Xeon

From
Tom Lane
Date:
After some further digging I think I'm starting to understand what's up
here, and the really fundamental answer is that a multi-CPU Xeon MP box
sucks for running Postgres.

I did a bunch of oprofile measurements on a machine belonging to one of
Josh's clients, using a test case that involved heavy concurrent access
to a relatively small amount of data (little enough to fit into Postgres
shared buffers, so that no I/O or kernel calls were really needed once
the test got going).  I found that by nearly any measure --- elapsed
time, bus transactions, or machine-clear events --- the spinlock
acquisitions associated with grabbing and releasing the BufMgrLock took
an unreasonable fraction of the time.  I saw about 15% of elapsed time,
40% of bus transactions, and nearly 100% of pipeline-clear cycles going
into what is essentially two instructions out of the entire backend.
(Pipeline clears occur when the cache coherency logic detects a memory
write ordering problem.)

I am not completely clear on why this machine-level bottleneck manifests
as a lot of context swaps at the OS level.  I think what is happening is
that because SpinLockAcquire is so slow, a process is much more likely
than you'd normally expect to arrive at SpinLockAcquire while another
process is also acquiring the spinlock.  This puts the two processes
into a "lockstep" condition where the second process is nearly certain
to observe the BufMgrLock as locked, and be forced to suspend itself,
even though the time the first process holds the BufMgrLock is not
really very long at all.

If you google for Xeon and "cache coherency" you'll find quite a bit of
suggestive information about why this might be more true on the Xeon
setup than others.  A couple of interesting hits:

http://www.theinquirer.net/?article=10797
says that Xeon MP uses a *slower* FSB than Xeon DP.  This would
translate directly to more time needed to transfer a dirty cache line
from one processor to the other, which is the basic operation that we're
talking about here.

http://www.aceshardware.com/Spades/read.php?article_id=30000187
says that Opterons use a different cache coherency protocol that is
fundamentally superior to the Xeon's, because dirty cache data can be
transferred directly between two processor caches without waiting for
main memory.

So in the short term I think we have to tell people that Xeon MP is not
the most desirable SMP platform to run Postgres on.  (Josh thinks that
the specific motherboard chipset being used in these machines might
share some of the blame too.  I don't have any evidence for or against
that idea, but it's certainly possible.)

In the long run, however, CPUs continue to get faster than main memory
and the price of cache contention will continue to rise.  So it seems
that we need to give up the assumption that SpinLockAcquire is a cheap
operation.  In the presence of heavy contention it won't be.

One thing we probably have got to do soon is break up the BufMgrLock
into multiple finer-grain locks so that there will be less contention.
However I am wary of doing this incautiously, because if we do it in a
way that makes for a significant rise in the number of locks that have
to be acquired to access a buffer, we might end up with a net loss.

I think Neil Conway was looking into how the bufmgr might be
restructured to reduce lock contention, but if he had come up with
anything he didn't mention exactly what.  Neil?

            regards, tom lane

Re: Wierd context-switching issue on Xeon

From
Dave Cramer
Date:
So the the kernel/OS is irrelevant here ? this happens on any dual xeon?

What about hypterthreading does it still happen if HTT is turned off ?

Dave
On Sun, 2004-04-18 at 17:47, Tom Lane wrote:
> After some further digging I think I'm starting to understand what's up
> here, and the really fundamental answer is that a multi-CPU Xeon MP box
> sucks for running Postgres.
>
> I did a bunch of oprofile measurements on a machine belonging to one of
> Josh's clients, using a test case that involved heavy concurrent access
> to a relatively small amount of data (little enough to fit into Postgres
> shared buffers, so that no I/O or kernel calls were really needed once
> the test got going).  I found that by nearly any measure --- elapsed
> time, bus transactions, or machine-clear events --- the spinlock
> acquisitions associated with grabbing and releasing the BufMgrLock took
> an unreasonable fraction of the time.  I saw about 15% of elapsed time,
> 40% of bus transactions, and nearly 100% of pipeline-clear cycles going
> into what is essentially two instructions out of the entire backend.
> (Pipeline clears occur when the cache coherency logic detects a memory
> write ordering problem.)
>
> I am not completely clear on why this machine-level bottleneck manifests
> as a lot of context swaps at the OS level.  I think what is happening is
> that because SpinLockAcquire is so slow, a process is much more likely
> than you'd normally expect to arrive at SpinLockAcquire while another
> process is also acquiring the spinlock.  This puts the two processes
> into a "lockstep" condition where the second process is nearly certain
> to observe the BufMgrLock as locked, and be forced to suspend itself,
> even though the time the first process holds the BufMgrLock is not
> really very long at all.
>
> If you google for Xeon and "cache coherency" you'll find quite a bit of
> suggestive information about why this might be more true on the Xeon
> setup than others.  A couple of interesting hits:
>
> http://www.theinquirer.net/?article=10797
> says that Xeon MP uses a *slower* FSB than Xeon DP.  This would
> translate directly to more time needed to transfer a dirty cache line
> from one processor to the other, which is the basic operation that we're
> talking about here.
>
> http://www.aceshardware.com/Spades/read.php?article_id=30000187
> says that Opterons use a different cache coherency protocol that is
> fundamentally superior to the Xeon's, because dirty cache data can be
> transferred directly between two processor caches without waiting for
> main memory.
>
> So in the short term I think we have to tell people that Xeon MP is not
> the most desirable SMP platform to run Postgres on.  (Josh thinks that
> the specific motherboard chipset being used in these machines might
> share some of the blame too.  I don't have any evidence for or against
> that idea, but it's certainly possible.)
>
> In the long run, however, CPUs continue to get faster than main memory
> and the price of cache contention will continue to rise.  So it seems
> that we need to give up the assumption that SpinLockAcquire is a cheap
> operation.  In the presence of heavy contention it won't be.
>
> One thing we probably have got to do soon is break up the BufMgrLock
> into multiple finer-grain locks so that there will be less contention.
> However I am wary of doing this incautiously, because if we do it in a
> way that makes for a significant rise in the number of locks that have
> to be acquired to access a buffer, we might end up with a net loss.
>
> I think Neil Conway was looking into how the bufmgr might be
> restructured to reduce lock contention, but if he had come up with
> anything he didn't mention exactly what.  Neil?
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>
>
>
> !DSPAM:4082feb7326901956819835!
>
>
--
Dave Cramer
519 939 0336
ICQ # 14675561


Re: Wierd context-switching issue on Xeon

From
Greg Stark
Date:
Tom Lane <tgl@sss.pgh.pa.us> writes:

> So in the short term I think we have to tell people that Xeon MP is not
> the most desirable SMP platform to run Postgres on.  (Josh thinks that
> the specific motherboard chipset being used in these machines might
> share some of the blame too.  I don't have any evidence for or against
> that idea, but it's certainly possible.)
>
> In the long run, however, CPUs continue to get faster than main memory
> and the price of cache contention will continue to rise.  So it seems
> that we need to give up the assumption that SpinLockAcquire is a cheap
> operation.  In the presence of heavy contention it won't be.

There's nothing about the way Postgres spinlocks are coded that affects this?

Is it something the kernel could help with? I've been wondering whether
there's any benefits postgres is missing out on by using its own hand-rolled
locking instead of using the pthreads infrastructure that the kernel is often
involved in.

--
greg

Re: Wierd context-switching issue on Xeon

From
Tom Lane
Date:
Dave Cramer <pg@fastcrypt.com> writes:
> So the the kernel/OS is irrelevant here ? this happens on any dual xeon?

I believe so.  The context-switch behavior might possibly be a little
more pleasant on other kernels, but the underlying spinlock problem is
not dependent on the kernel.

> What about hypterthreading does it still happen if HTT is turned off ?

The problem comes from keeping the caches synchronized between multiple
physical CPUs.  AFAICS enabling HTT wouldn't make it worse, because a
hyperthreaded processor still only has one cache.

            regards, tom lane

Re: Wierd context-switching issue on Xeon

From
Tom Lane
Date:
Greg Stark <gsstark@mit.edu> writes:
> There's nothing about the way Postgres spinlocks are coded that affects this?

No.  AFAICS our spinlock sequences are pretty much equivalent to the way
the Linux kernel codes its spinlocks, so there's no deep dark knowledge
to be mined there.

We could possibly use some more-efficient blocking mechanism than semop()
once we've decided we have to block (it's a shame Linux still doesn't
have cross-process POSIX semaphores).  But the striking thing I learned
from looking at the oprofile results is that most of the inefficiency
comes at the very first TAS() operation, before we've even "spun" let
alone decided we have to block.  The s_lock() subroutine does not
account for more than a few percent of the runtime in these tests,
compared to 15% at the inline TAS() operations in LWLockAcquire and
LWLockRelease.  I interpret this to mean that once it's acquired
ownership of the cache line, a Xeon can get through the "spinning"
loop in s_lock() mighty quickly.

            regards, tom lane

Re: Wierd context-switching issue on Xeon

From
Tom Lane
Date:
>> What about hypterthreading does it still happen if HTT is turned off ?

> The problem comes from keeping the caches synchronized between multiple
> physical CPUs.  AFAICS enabling HTT wouldn't make it worse, because a
> hyperthreaded processor still only has one cache.

Also, I forgot to say that the numbers I'm quoting *are* with HTT off.

            regards, tom lane

Re: RESOLVED: Re: Wierd context-switching issue on Xeon

From
Dirk Lutzebäck
Date:
Josh, I cannot reproduce the excessive semop() on a Dual XEON DP on a
non-bigmem kernel, HT on. Interesting to know if the problem is related
to XEON MP (as Tom wrote) or bigmem.

Josh Berkus wrote:

>Dirk,
>
>
>
>>I'm not sure if this semop() problem is still an issue but the database
>>behaves a bit out of bounds in this situation, i.e. consuming system
>>resources with semop() calls 95% while tables are locked very often and
>>longer.
>>
>>
>
>It would be helpful to us if you could test this with the indexes disabled on
>the non-Bigmem system.   I'd like to eliminate Bigmem as a factor, if
>possible.
>
>
>



Re: Wierd context-switching issue on Xeon

From
Dave Cramer
Date:
Here's an interesting link that suggests that hyperthreading would be
much worse.


http://groups.google.com/groups?q=hyperthreading+dual+xeon+idle&start=10&hl=en&lr=&ie=UTF-8&c2coff=1&selm=aukkonen-FE5275.21093624062003%40shawnews.gv.shawcable.net&rnum=16

another which has some hints as to how it should be handled


http://groups.google.com/groups?q=hyperthreading+dual+xeon+idle&start=10&hl=en&lr=&ie=UTF-8&c2coff=1&selm=u5tl1XD3BHA.2760%40tkmsftngp04&rnum=19
FWIW, I have anecdotal evidence that suggests that this is the case, on
of my clients was seeing very large context switches with HTT turned on,
and without it was much better.

Dave
On Sun, 2004-04-18 at 23:19, Tom Lane wrote:
> >> What about hypterthreading does it still happen if HTT is turned off ?
>
> > The problem comes from keeping the caches synchronized between multiple
> > physical CPUs.  AFAICS enabling HTT wouldn't make it worse, because a
> > hyperthreaded processor still only has one cache.
>
> Also, I forgot to say that the numbers I'm quoting *are* with HTT off.
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
>
>
>
> !DSPAM:40834781158911062514350!
>
>
--
Dave Cramer
519 939 0336
ICQ # 14675561


Re: Wierd context-switching issue on Xeon

From
"Anjan Dave"
Date:
What about quad-XEON setups? Could that be worse? (have dual, and quad setups both) Shall we re-consider XEON-MP CPU
machineswith high cache (4MB+)?
 
 
Very generally, what number would be considered high, especially, if it coincides with expected heavy load?
 
Not sure a specific chipset was mentioned...
 
Thanks,
Anjan

    -----Original Message----- 
    From: Greg Stark [mailto:gsstark@mit.edu] 
    Sent: Sun 4/18/2004 8:40 PM 
    To: Tom Lane 
    Cc: lutzeb@aeccom.com; Josh Berkus; pgsql-performance@postgresql.org; Neil Conway 
    Subject: Re: [PERFORM] Wierd context-switching issue on Xeon
    
    


    Tom Lane <tgl@sss.pgh.pa.us> writes:
    
    > So in the short term I think we have to tell people that Xeon MP is not
    > the most desirable SMP platform to run Postgres on.  (Josh thinks that
    > the specific motherboard chipset being used in these machines might
    > share some of the blame too.  I don't have any evidence for or against
    > that idea, but it's certainly possible.)
    >
    > In the long run, however, CPUs continue to get faster than main memory
    > and the price of cache contention will continue to rise.  So it seems
    > that we need to give up the assumption that SpinLockAcquire is a cheap
    > operation.  In the presence of heavy contention it won't be.
    
    There's nothing about the way Postgres spinlocks are coded that affects this?
    
    Is it something the kernel could help with? I've been wondering whether
    there's any benefits postgres is missing out on by using its own hand-rolled
    locking instead of using the pthreads infrastructure that the kernel is often
    involved in.
    
    --
    greg
    
    
    ---------------------------(end of broadcast)---------------------------
    TIP 2: you can get off all lists at once with the unregister command
        (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
    


Re: Wierd context-switching issue on Xeon

From
Josh Berkus
Date:
Tom,

> So in the short term I think we have to tell people that Xeon MP is not
> the most desirable SMP platform to run Postgres on.  (Josh thinks that
> the specific motherboard chipset being used in these machines might
> share some of the blame too.  I don't have any evidence for or against
> that idea, but it's certainly possible.)

I have 3 reasons for thinking this:
1) the ServerWorks chipset is present in the fully documented cases that we
have of this problem so far.   This is notable becuase the SW is notorious
for poor manufacturing quality, so much so that the company that made them is
currently in receivership.   These chips were so bad that Dell was forced to
recall several hundred of it's 2650's, where the motherboards caught fire!
2) the main defect of the SW is the NorthBridge, which could conceivably
adversely affect traffic between RAM and the processor cache.
3) XeonMP is a very popular platform thanks to Dell, and we are not seeing
more problem reports than we are.

The other thing I'd like your comment on, Tom, is that Dirk appears to have
reported that when he installed a non-bigmem kernel, the issue went away.
Dirk, is this correct?

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: Wierd context-switching issue on Xeon

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> The other thing I'd like your comment on, Tom, is that Dirk appears to have
> reported that when he installed a non-bigmem kernel, the issue went away.
> Dirk, is this correct?

I'd be really surprised if that had anything to do with it.  AFAIR
Dirk's test changed more than one variable and so didn't prove a
connection.

            regards, tom lane

Re: Wierd context-switching issue on Xeon

From
"J. Andrew Rogers"
Date:
I decided to check the context-switching behavior here for baseline
since we have a rather diverse set of postgres server hardware, though
nothing using Xeon MP that is also running a postgres instance, and
everything looks normal under load.  Some platforms are better than
others, but nothing is outside of what I would consider normal bounds.

Our biggest database servers are Opteron SMP systems, and these servers
are particularly well-behaved under load with Postgres 7.4.2.  If there
is a problem with the locking code and context-switching, it sure isn't
manifesting on our Opteron SMP systems.  Under rare confluences of
process interaction, we occasionally see short spikes in the 2-3,000
cs/sec range.  It typically peaks at a couple hundred cs/sec under load.
Obviously this is going to be a function of our load profile a certain
extent.

The Opterons have proven to be very good database hardware in general
for us.


j. andrew rogers








Re: Wierd context-switching issue on Xeon

From
Bruce Momjian
Date:
Josh Berkus wrote:
> Tom,
>
> > So in the short term I think we have to tell people that Xeon MP is not
> > the most desirable SMP platform to run Postgres on.  (Josh thinks that
> > the specific motherboard chipset being used in these machines might
> > share some of the blame too.  I don't have any evidence for or against
> > that idea, but it's certainly possible.)
>
> I have 3 reasons for thinking this:
> 1) the ServerWorks chipset is present in the fully documented cases that we
> have of this problem so far.   This is notable becuase the SW is notorious
> for poor manufacturing quality, so much so that the company that made them is
> currently in receivership.   These chips were so bad that Dell was forced to
> recall several hundred of it's 2650's, where the motherboards caught fire!
> 2) the main defect of the SW is the NorthBridge, which could conceivably
> adversely affect traffic between RAM and the processor cache.
> 3) XeonMP is a very popular platform thanks to Dell, and we are not seeing
> more problem reports than we are.
>
> The other thing I'd like your comment on, Tom, is that Dirk appears to have
> reported that when he installed a non-bigmem kernel, the issue went away.

I have BSD on a SuperMicro dual Xeon, so if folks want another
hardware/OS combination to test, I can give out logins to my machine.

    http://candle.pha.pa.us/main/hardware.html

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: Wierd context-switching issue on Xeon

From
"scott.marlowe"
Date:
On Mon, 19 Apr 2004, Bruce Momjian wrote:

> Josh Berkus wrote:
> > Tom,
> >
> > > So in the short term I think we have to tell people that Xeon MP is not
> > > the most desirable SMP platform to run Postgres on.  (Josh thinks that
> > > the specific motherboard chipset being used in these machines might
> > > share some of the blame too.  I don't have any evidence for or against
> > > that idea, but it's certainly possible.)
> >
> > I have 3 reasons for thinking this:
> > 1) the ServerWorks chipset is present in the fully documented cases that we
> > have of this problem so far.   This is notable becuase the SW is notorious
> > for poor manufacturing quality, so much so that the company that made them is
> > currently in receivership.   These chips were so bad that Dell was forced to
> > recall several hundred of it's 2650's, where the motherboards caught fire!
> > 2) the main defect of the SW is the NorthBridge, which could conceivably
> > adversely affect traffic between RAM and the processor cache.
> > 3) XeonMP is a very popular platform thanks to Dell, and we are not seeing
> > more problem reports than we are.
> >
> > The other thing I'd like your comment on, Tom, is that Dirk appears to have
> > reported that when he installed a non-bigmem kernel, the issue went away.
>
> I have BSD on a SuperMicro dual Xeon, so if folks want another
> hardware/OS combination to test, I can give out logins to my machine.

I can probably do some nighttime testing on a dual 2800MHz non-MP Xeon
machine as well.  It's a Dell 2600 series machine and very fast.  It has
the moderately fast 533MHz FSB so may not have as many problems as the MP
type CPUs seem to be having.


Re: Wierd context-switching issue on Xeon

From
Joe Conway
Date:
scott.marlowe wrote:
> On Mon, 19 Apr 2004, Bruce Momjian wrote:
>>I have BSD on a SuperMicro dual Xeon, so if folks want another
>>hardware/OS combination to test, I can give out logins to my machine.
>
> I can probably do some nighttime testing on a dual 2800MHz non-MP Xeon
> machine as well.  It's a Dell 2600 series machine and very fast.  It has
> the moderately fast 533MHz FSB so may not have as many problems as the MP
> type CPUs seem to be having.

I've got a quad 2.8Ghz MP Xeon (IBM x445) that I could test on. Does
anyone have a test set that can reliably reproduce the problem?

Joe

Re: Wierd context-switching issue on Xeon

From
Josh Berkus
Date:
Joe,

> I've got a quad 2.8Ghz MP Xeon (IBM x445) that I could test on. Does
> anyone have a test set that can reliably reproduce the problem?

Unfortunately we can't seem to come up with one.    So far we have 2 machines
that exhibit the issue, and their databases are highly confidential (State of
WA education data).

It does seem to require a database which is in the many GB (> 10GB), and a
situation where a small subset of the data is getting hit repeatedly by
multiple processes.   So you could try your own data warehouse, making sure
that you have at least 4 connections hitting one query after another.

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


Re: Wierd context-switching issue on Xeon

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
>> I've got a quad 2.8Ghz MP Xeon (IBM x445) that I could test on. Does
>> anyone have a test set that can reliably reproduce the problem?

> Unfortunately we can't seem to come up with one.

> It does seem to require a database which is in the many GB (> 10GB), and a
> situation where a small subset of the data is getting hit repeatedly by
> multiple processes.

I do not think a large database is actually necessary; the test case
Josh's client has is only hitting a relatively small amount of data.
The trick seems to be to cause lots and lots of ReadBuffer/ReleaseBuffer
activity without much else happening, and to do this from multiple
backends concurrently.

I believe the best way to make this happen is a lot of relatively simple
(but not short) indexscan queries that in aggregate touch just a bit
less than shared_buffers worth of data.  I have not tried to make a
self-contained test case, but based on what I know now I think it should
be possible.

I'll give this a shot later tonight --- it does seem that trying to
reproduce the problem on different kinds of hardware is the next useful
step we can take.

            regards, tom lane

Re: Wierd context-switching issue on Xeon

From
Tom Lane
Date:
Here is a test case.  To set up, run the "test_setup.sql" script once;
then launch two copies of the "test_run.sql" script.  (For those of
you with more than two CPUs, see whether you need one per CPU to make
trouble, or whether two test_runs are enough.)  Check that you get a
nestloops-with-index-scans plan shown by the EXPLAIN in test_run.

In isolation, test_run.sql should do essentially no syscalls at all once
it's past the initial ramp-up.  On a machine that's functioning per
expectations, multiple copies of test_run show a relatively low rate of
semop() calls --- a few per second, at most --- and maybe a delaying
select() here and there.

What I actually see on Josh's client's machine is a context swap storm:
"vmstat 1" shows CS rates around 170K/sec.  strace'ing the backends
shows a corresponding rate of semop() syscalls, with a few delaying
select()s sprinkled in.  top(1) shows system CPU percent of 25-30
and idle CPU percent of 16-20.

I haven't bothered to check how long the test_run query takes, but if it
ends while you're still examining the behavior, just start it again.

Note the test case assumes you've got shared_buffers set to at least
1000; with smaller values, you may get some I/O syscalls, which will
probably skew the results.

            regards, tom lane

drop table test_data;

create table test_data(f1 int);

insert into test_data values (random() * 100);
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;
insert into test_data select random() * 100 from test_data;

create index test_index on test_data(f1);

vacuum verbose analyze test_data;
checkpoint;
-- force nestloop indexscan plan
set enable_seqscan to 0;
set enable_mergejoin to 0;
set enable_hashjoin to 0;

explain
select count(*) from test_data a, test_data b, test_data c
where a.f1 = b.f1 and b.f1 = c.f1;

select count(*) from test_data a, test_data b, test_data c
where a.f1 = b.f1 and b.f1 = c.f1;

Re: Wierd context-switching issue on Xeon

From
Tom Lane
Date:
I wrote:
> Here is a test case.

Hmmm ... I've been able to reproduce the CS storm on a dual Athlon,
which seems to pretty much let the Xeon per se off the hook.  Anybody
got a multiple Opteron to try?  Totally non-Intel CPUs?

It would be interesting to see results with non-Linux kernels, too.

            regards, tom lane

Re: Wierd context-switching issue on Xeon

From
Joe Conway
Date:
Tom Lane wrote:
> Here is a test case.  To set up, run the "test_setup.sql" script once;
> then launch two copies of the "test_run.sql" script.  (For those of
> you with more than two CPUs, see whether you need one per CPU to make
> trouble, or whether two test_runs are enough.)  Check that you get a
> nestloops-with-index-scans plan shown by the EXPLAIN in test_run.

Check.

> In isolation, test_run.sql should do essentially no syscalls at all once
> it's past the initial ramp-up.  On a machine that's functioning per
> expectations, multiple copies of test_run show a relatively low rate of
> semop() calls --- a few per second, at most --- and maybe a delaying
> select() here and there.
>
> What I actually see on Josh's client's machine is a context swap storm:
> "vmstat 1" shows CS rates around 170K/sec.  strace'ing the backends
> shows a corresponding rate of semop() syscalls, with a few delaying
> select()s sprinkled in.  top(1) shows system CPU percent of 25-30
> and idle CPU percent of 16-20.

Your test case works perfectly. I ran 4 concurrent psql sessions, on a
quad Xeon (IBM x445, 2.8GHz, 4GB RAM), hyperthreaded. Heres what 'top'
looks like:

177 processes: 173 sleeping, 3 running, 1 zombie, 0 stopped
CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle
            total   35.9%    0.0%    7.2%   0.0%     0.0%    0.0%   56.8%
            cpu00   19.6%    0.0%    4.9%   0.0%     0.0%    0.0%   75.4%
            cpu01   44.1%    0.0%    7.8%   0.0%     0.0%    0.0%   48.0%
            cpu02    0.0%    0.0%    0.0%   0.0%     0.0%    0.0%  100.0%
            cpu03   32.3%    0.0%   13.7%   0.0%     0.0%    0.0%   53.9%
            cpu04   21.5%    0.0%   10.7%   0.0%     0.0%    0.0%   67.6%
            cpu05   42.1%    0.0%    9.8%   0.0%     0.0%    0.0%   48.0%
            cpu06  100.0%    0.0%    0.0%   0.0%     0.0%    0.0%    0.0%
            cpu07   27.4%    0.0%   10.7%   0.0%     0.0%    0.0%   61.7%
Mem: 4123700k av, 3933896k used, 189804k free, 0k shrd, 221948k buff
                   2492124k actv,  760612k in_d,   41416k in_c
Swap: 2040244k av, 5632k used, 2034612k free 3113272k cached

Note that cpu06 is not a postgres process. The output of vmstat looks
like this:

# vmstat 1
procs                      memory      swap          io     system
    cpu
r  b swpd   free   buff  cache  si  so   bi   bo  in   cs us sy id wa
4  0 5632 184264 221948 3113308  0   0    0    0   0    0  0  0  0  0
3  0 5632 184264 221948 3113308  0   0    0    0  112 211894 36  9 55  0
5  0 5632 184264 221948 3113308  0   0    0    0  125 222071 39  8 53  0
4  0 5632 184264 221948 3113308  0   0    0    0  110 215097 39 10 52  0
1  0 5632 184588 221948 3113308  0   0    0   96  139 187561 35 10 55  0
3  0 5632 184588 221948 3113308  0   0    0    0  114 241731 38 10 52  0
3  0 5632 184920 221948 3113308  0   0    0    0  132 257168 40  9 51  0
1  0 5632 184912 221948 3113308  0   0    0    0  114 251802 38  9 54  0

> Note the test case assumes you've got shared_buffers set to at least
> 1000; with smaller values, you may get some I/O syscalls, which will
> probably skew the results.

  shared_buffers
----------------
  16384
(1 row)

I found that killing three of the four concurrent queries dropped
context switches to about 70,000 to 100,000. Two or more sessions brings
it up to 200K+.

Joe

Re: Wierd context-switching issue on Xeon

From
Robert Creager
Date:
When grilled further on (Mon, 19 Apr 2004 20:53:09 -0400),
Tom Lane <tgl@sss.pgh.pa.us> confessed:

> I wrote:
> > Here is a test case.
>
> Hmmm ... I've been able to reproduce the CS storm on a dual Athlon,
> which seems to pretty much let the Xeon per se off the hook.  Anybody
> got a multiple Opteron to try?  Totally non-Intel CPUs?
>
> It would be interesting to see results with non-Linux kernels, too.
>

Same problem on my dual AMD MP with 2.6.5 kernel using two sessions of your
test, but maybe not quite as severe. The highest CS values I saw was 102k, with
some non-db number crunching going on in parallel with the test.  'Average'
about 80k with two instances.  Using the anticipatory scheduler.

A single instance pulls in around 200-300 CS, and no tests running around
200-300 CS (i.e. no CS difference).

A snipet:

procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 3  0    284  90624  93452 1453740    0    0     0     0 1075 76548 83 17  0  0
 6  0    284 125312  93452 1470196    0    0     0     0 1073 87702 78 22  0  0
 3  0    284 178392  93460 1420208    0    0    76   298 1083 67721 77 24  0  0
 4  0    284 177120  93460 1421500    0    0  1104     0 1054 89593 80 21  0  0
 5  0    284 173504  93460 1425172    0    0  3584     0 1110 65536 81 19  0  0
 4  0    284 169984  93460 1428708    0    0  3456     0 1098 66937 81 20  0  0
 6  0    284 170944  93460 1428708    0    0     8     0 1045 66065 81 19  0  0
 6  0    284 167288  93460 1428776    0    0     0     8 1097 75560 81 19  0  0
 6  0    284 136296  93460 1458356    0    0     0     0 1036 80808 75 26  0  0
 5  0    284 132864  93460 1461688    0    0     0     0 1007 76071 84 17  0  0
 4  0    284 132880  93460 1461688    0    0     0     0 1079 86903 82 18  0  0
 5  0    284 132880  93460 1461688    0    0     0     0 1078 79885 83 17  0  0
 6  0    284 132648  93460 1461688    0    0     0   760 1228 66564 86 14  0  0
 6  0    284 132648  93460 1461688    0    0     0     0 1047 69741 86 15  0  0
 6  0    284 132672  93460 1461688    0    0     0     0 1057 79052 84 16  0  0
 5  0    284 132672  93460 1461688    0    0     0     0 1054 81109 82 18  0  0
 5  0    284 132736  93460 1461688    0    0     0     0 1043 91725 80 20  0  0


Cheers,
Rob

--
 21:33:03 up 3 days,  1:10,  3 users,  load average: 5.05, 4.67, 4.22
Linux 2.6.5-01 #5 SMP Tue Apr 6 21:32:39 MDT 2004

Attachment

Re: Wierd context-switching issue on Xeon

From
jelle
Date:
Same problem with dual 1Ghz P3's running Postgres 7.4.2, linux 2.4.x, and
2GB ram, under load, with long transactions (i.e. 1 "cannot serialize"
rollback per minute). 200K was the worst observed with vmstat.

Finally moved DB to a single xeon box.


Re: Wierd context-switching issue on Xeon

From
ohp@pyrenet.fr
Date:
Hi Tom,

You still have an account on my Unixware Bi-Xeon hyperthreded machine.
Feel free to use it for your tests.
On Mon, 19 Apr 2004, Tom Lane wrote:

> Date: Mon, 19 Apr 2004 20:53:09 -0400
> From: Tom Lane <tgl@sss.pgh.pa.us>
> To: josh@agliodbs.com
> Cc: Joe Conway <mail@joeconway.com>, scott.marlowe <scott.marlowe@ihs.com>,
>      Bruce Momjian <pgman@candle.pha.pa.us>, lutzeb@aeccom.com,
>      pgsql-performance@postgresql.org, Neil Conway <neilc@samurai.com>
> Subject: Re: [PERFORM] Wierd context-switching issue on Xeon
>
> I wrote:
> > Here is a test case.
>
> Hmmm ... I've been able to reproduce the CS storm on a dual Athlon,
> which seems to pretty much let the Xeon per se off the hook.  Anybody
> got a multiple Opteron to try?  Totally non-Intel CPUs?
>
> It would be interesting to see results with non-Linux kernels, too.
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
>

--
Olivier PRENANT                    Tel: +33-5-61-50-97-00 (Work)
6, Chemin d'Harraud Turrou           +33-5-61-50-97-01 (Fax)
31190 AUTERIVE                       +33-6-07-63-80-64 (GSM)
FRANCE                          Email: ohp@pyrenet.fr
------------------------------------------------------------------------------
Make your life a dream, make your dream a reality. (St Exupery)

Re: Wierd context-switching issue on Xeon

From
Jeff
Date:
On Apr 19, 2004, at 8:01 PM, Tom Lane wrote:
[test case]

Quad P3-700Mhz, ServerWorks, pg 7.4.2 - 1 process: 10-30 cs / second
                                                 2 process: 100k cs / sec
                                   3 process: 140k cs / sec
                                   8 process: 115k cs / sec

Dual P2-450Mhz, non-serverworks (piix)  - 1 process 15-20 / sec
                                                 2 process 30k / sec
                                         3 (up to 7) process: 15k /sec

(Yes, I verified with more processes the cs's drop)

And finally,

6 cpu sun e4500, solaris 2.6, pg 7.4.2: 1 - 10 processes: hovered
between 2-3k cs/second (there was other stuff running on the machine as
well)


Verrry interesting.
I've got a dual G4 at home, but for convenience Apple doesn't ship a
vmstat that tells context switches

--
Jeff Trout <jeff@jefftrout.com>
http://www.jefftrout.com/
http://www.stuarthamm.net/


Re: Wierd context-switching issue on Xeon

From
Dave Cramer
Date:
Dual Athlon

With one process running 30 cs/second
with two process running 15000 cs/second

Dave
On Tue, 2004-04-20 at 08:46, Jeff wrote:
> On Apr 19, 2004, at 8:01 PM, Tom Lane wrote:
> [test case]
>
> Quad P3-700Mhz, ServerWorks, pg 7.4.2 - 1 process: 10-30 cs / second
>                                                  2 process: 100k cs / sec
>                                    3 process: 140k cs / sec
>                                    8 process: 115k cs / sec
>
> Dual P2-450Mhz, non-serverworks (piix)  - 1 process 15-20 / sec
>                                                  2 process 30k / sec
>                                          3 (up to 7) process: 15k /sec
>
> (Yes, I verified with more processes the cs's drop)
>
> And finally,
>
> 6 cpu sun e4500, solaris 2.6, pg 7.4.2: 1 - 10 processes: hovered
> between 2-3k cs/second (there was other stuff running on the machine as
> well)
>
>
> Verrry interesting.
> I've got a dual G4 at home, but for convenience Apple doesn't ship a
> vmstat that tells context switches
>
> --
> Jeff Trout <jeff@jefftrout.com>
> http://www.jefftrout.com/
> http://www.stuarthamm.net/
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
>                http://www.postgresql.org/docs/faqs/FAQ.html
>
>
>
> !DSPAM:40851da1199651145780980!
>
>
--
Dave Cramer
519 939 0336
ICQ # 14675561


Re: Wierd context-switching issue on Xeon

From
"Matt Clark"
Date:
As a cross-ref to all the 7.4.x tests people have sent in, here's 7.2.3 (Redhat 7.3), Quad Xeon 700MHz/1MB L2 cache,
3GBRAM. 

Idle-ish (it's a production server) cs/sec ~5000

3 test queries running:
   procs                      memory    swap          io     system         cpu
 r  b  w   swpd   free   buff  cache   si  so    bi    bo   in    cs   us  sy  id
 3  0  0  23380 577680 105912 2145140   0   0     0     0  107 116890  50  14  35
 2  0  0  23380 577680 105912 2145140   0   0     0     0  114 118583  50  15  34
 2  0  0  23380 577680 105912 2145140   0   0     0     0  107 115842  54  14  32
 2  1  0  23380 577680 105920 2145140   0   0     0    32  156 117549  50  16  35

HTH

Matt

> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org
> [mailto:pgsql-performance-owner@postgresql.org]On Behalf Of Tom Lane
> Sent: 20 April 2004 01:02
> To: josh@agliodbs.com
> Cc: Joe Conway; scott.marlowe; Bruce Momjian; lutzeb@aeccom.com;
> pgsql-performance@postgresql.org; Neil Conway
> Subject: Re: [PERFORM] Wierd context-switching issue on Xeon
>
>
> Here is a test case.  To set up, run the "test_setup.sql" script once;
> then launch two copies of the "test_run.sql" script.  (For those of
> you with more than two CPUs, see whether you need one per CPU to make
> trouble, or whether two test_runs are enough.)  Check that you get a
> nestloops-with-index-scans plan shown by the EXPLAIN in test_run.
>
> In isolation, test_run.sql should do essentially no syscalls at all once
> it's past the initial ramp-up.  On a machine that's functioning per
> expectations, multiple copies of test_run show a relatively low rate of
> semop() calls --- a few per second, at most --- and maybe a delaying
> select() here and there.
>
> What I actually see on Josh's client's machine is a context swap storm:
> "vmstat 1" shows CS rates around 170K/sec.  strace'ing the backends
> shows a corresponding rate of semop() syscalls, with a few delaying
> select()s sprinkled in.  top(1) shows system CPU percent of 25-30
> and idle CPU percent of 16-20.
>
> I haven't bothered to check how long the test_run query takes, but if it
> ends while you're still examining the behavior, just start it again.
>
> Note the test case assumes you've got shared_buffers set to at least
> 1000; with smaller values, you may get some I/O syscalls, which will
> probably skew the results.
>
>             regards, tom lane
>
>


Re: Wierd context-switching issue on Xeon

From
"Sven Geisler"
Date:
Hi Tom,

Just to explain our hardware situation releated to the FSB of the XEON's.
We have older XEON DP in operation with FSB 400 and 2.4 GHz.
The XEON MP box runs with 2.5 GHz.
The XEON MP box is a Fujitsu Siemens Primergy RX600 with ServerWorks GC LE
as chipset.

The box, which Dirk were use to compare the behavior, is our newest XEON DP
system.
This XEON DP box runs with 2.8 GHz and FSB 533 using the Intel 7501 chipset
(Supermicro).

I would agree to Jush. When PostgreSQL has an issue with the INTEL XEON MP
hardware, this is more releated to the chipset.

Back to the SQL-Level. We use SELECT FOR UPDATE as "semaphore".
Should we try another implementation for this semahore on the client side to
prevent this issue?

Regards
Sven.

----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: <lutzeb@aeccom.com>
Cc: "Josh Berkus" <josh@agliodbs.com>; <pgsql-performance@postgreSQL.org>;
"Neil Conway" <neilc@samurai.com>
Sent: Sunday, April 18, 2004 11:47 PM
Subject: Re: [PERFORM] Wierd context-switching issue on Xeon


> After some further digging I think I'm starting to understand what's up
> here, and the really fundamental answer is that a multi-CPU Xeon MP box
> sucks for running Postgres.
>
> I did a bunch of oprofile measurements on a machine belonging to one of
> Josh's clients, using a test case that involved heavy concurrent access
> to a relatively small amount of data (little enough to fit into Postgres
> shared buffers, so that no I/O or kernel calls were really needed once
> the test got going).  I found that by nearly any measure --- elapsed
> time, bus transactions, or machine-clear events --- the spinlock
> acquisitions associated with grabbing and releasing the BufMgrLock took
> an unreasonable fraction of the time.  I saw about 15% of elapsed time,
> 40% of bus transactions, and nearly 100% of pipeline-clear cycles going
> into what is essentially two instructions out of the entire backend.
> (Pipeline clears occur when the cache coherency logic detects a memory
> write ordering problem.)
>
> I am not completely clear on why this machine-level bottleneck manifests
> as a lot of context swaps at the OS level.  I think what is happening is
> that because SpinLockAcquire is so slow, a process is much more likely
> than you'd normally expect to arrive at SpinLockAcquire while another
> process is also acquiring the spinlock.  This puts the two processes
> into a "lockstep" condition where the second process is nearly certain
> to observe the BufMgrLock as locked, and be forced to suspend itself,
> even though the time the first process holds the BufMgrLock is not
> really very long at all.
>
> If you google for Xeon and "cache coherency" you'll find quite a bit of
> suggestive information about why this might be more true on the Xeon
> setup than others.  A couple of interesting hits:
>
> http://www.theinquirer.net/?article=10797
> says that Xeon MP uses a *slower* FSB than Xeon DP.  This would
> translate directly to more time needed to transfer a dirty cache line
> from one processor to the other, which is the basic operation that we're
> talking about here.
>
> http://www.aceshardware.com/Spades/read.php?article_id=30000187
> says that Opterons use a different cache coherency protocol that is
> fundamentally superior to the Xeon's, because dirty cache data can be
> transferred directly between two processor caches without waiting for
> main memory.
>
> So in the short term I think we have to tell people that Xeon MP is not
> the most desirable SMP platform to run Postgres on.  (Josh thinks that
> the specific motherboard chipset being used in these machines might
> share some of the blame too.  I don't have any evidence for or against
> that idea, but it's certainly possible.)
>
> In the long run, however, CPUs continue to get faster than main memory
> and the price of cache contention will continue to rise.  So it seems
> that we need to give up the assumption that SpinLockAcquire is a cheap
> operation.  In the presence of heavy contention it won't be.
>
> One thing we probably have got to do soon is break up the BufMgrLock
> into multiple finer-grain locks so that there will be less contention.
> However I am wary of doing this incautiously, because if we do it in a
> way that makes for a significant rise in the number of locks that have
> to be acquired to access a buffer, we might end up with a net loss.
>
> I think Neil Conway was looking into how the bufmgr might be
> restructured to reduce lock contention, but if he had come up with
> anything he didn't mention exactly what.  Neil?
>
> regards, tom lane
>
>


Re: possible improvement between G4 and G5

From
"Aaron Werman"
Date:
There are a few things that you can do to help force yourself to be I/O
bound. These include:

- RAID 5 for write intensive applications, since multiple writes per synch
write is good. (There is a special case for logging or other streaming
sequential writes on RAID 5)

- Data journaling file systems are helpful in stress testing your
checkpoints

- Using midsized battery backed up write through buffering controllers. In
general, if you have a small cache, you see the problem directly, and a huge
cache will balance out load and defer writes to quieter times. That is why a
midsized cache is so useful in showing stress in your system only when it is
being stressed.

Only partly in jest,
/Aaron

BTW - I am truly curious about what happens to your system if you use
separate RAID 0+1 for your logs, disk sorts, and at least the most active
tables. This should reduce I/O load by an order of magnitude.

"Vivek Khera" <khera@kcilink.com> wrote in message
news:x7smez7tqj.fsf@yertle.int.kciLink.com...
> >>>>> "JB" == Josh Berkus <josh@agliodbs.com> writes:
>
> JB> Aaron,
> >> I do consulting, so they're all over the place and tend to be complex.
Very
> >> few fit in RAM, but still are very buffered. These are almost all
backed
> >> with very high end I/O subsystems, with dozens of spindles with battery
> >> backed up writethrough cache and gigs of buffers, which may be why I
worry
> >> so much about CPU. I have had this issue with multiple servers.
>
> JB> Aha, I think this is the difference.  I never seem to be able to
> JB> get my clients to fork out for adequate disk support.  They are
> JB> always running off single or double SCSI RAID in the host server;
> JB> not the sort of setup you have.
>
> Even when I upgraded my system to a 14-spindle RAID5 with 128M cache
> and 4GB RAM on a dual Xeon system, I still wind up being I/O bound
> quite often.
>
> I think it depends on what your "working set" turns out to be.  My
> workload really spans a lot more of the DB than I can end up caching.
>
> --
> =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
> Vivek Khera, Ph.D.                Khera Communications, Inc.
> Internet: khera@kciLink.com       Rockville, MD  +1-301-869-4449 x806
> AIM: vivekkhera Y!: vivek_khera   http://www.khera.org/~vivek/
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
>

Re: Wierd context-switching issue on Xeon

From
Dirk.Lutzebaeck@t-online.de (Dirk Lutzebaeck)
Date:
I would agree to Tom, that too much parameters are involved to blame
bigmem. I have access to the following machines where the same
application operates:

a)  Dual (4way) XEON MP, bigmem, HT off, ServerWorks chipset (a
Fujitsu-Siemens Primergy)

performs ok now because missing indexes were added but this is no proof
that this behaviour occurs again under high load, context switches are
moderate but have peaks to 40.000

b) Dual XEON DP, non-bigmem, HT on, ServerWorks chipset (a Dell machine
I think)

performs moderate because I see too much context switches here although
the mentioned indexes are created, context switches go up to 30.000
often, I can see 50% semop calls

c) Dual XEON DP, non-bigmem, HT on, E7500 Intel chipset (Supermicro)

performs well and I could not observe context switch peaks here (one
user active), almost no extra semop calls

d) Dual XEON DP, bigmem, HT off, ServerWorks chipset (a Fujitsu-Siemens
Primergy)

performance unknown at the moment (is offline) but looks like a) in the past

I can offer to do tests on those machines if somebody would provide me
some test instructions to nail this problem down.

Dirk



Tom Lane wrote:

>Josh Berkus <josh@agliodbs.com> writes:
>
>
>>The other thing I'd like your comment on, Tom, is that Dirk appears to have
>>reported that when he installed a non-bigmem kernel, the issue went away.
>>Dirk, is this correct?
>>
>>
>
>I'd be really surprised if that had anything to do with it.  AFAIR
>Dirk's test changed more than one variable and so didn't prove a
>connection.
>
>            regards, tom lane
>
>
>



Re: Wierd context-switching issue on Xeon

From
Dirk Lutzebäck
Date:
Dirk Lutzebaeck wrote:

> c) Dual XEON DP, non-bigmem, HT on, E7500 Intel chipset (Supermicro)
>
> performs well and I could not observe context switch peaks here (one
> user active), almost no extra semop calls

Did Tom's test here: with 2 processes I'll reach 200k+ CS with peaks to
300k CS. Bummer.. Josh, I don't think you can bash the ServerWorks
chipset here nor bigmem.

Dirk



Re: Wierd context-switching issue on Xeon

From
Paul Tuckfield
Date:
I tried to test how this is related to cache coherency, by forcing
affinity of the two test_run.sql processes to the two cores (pipelines?
threads) of a single hyperthreaded xeon processor in an smp xeon box.

When the processes are allowed to run on distinct chips in the smp box,
the CS storm happens.  When they are "bound" to the two cores of a
single hyperthreaded Xeon in the smp box, the CS storm *does* happen.



I used the taskset command:
taskset 01 -p <pid for backend of test_run.sql 1>
taskset 01 -p <pid for backend of test_run.sql 1>

I guess that 0 and 1 are the two cores (pipelines? hyper-threads?) on
the first Xeon processor in the box.

I did this on RedHat Fedora core1 on an intel motherboard (I'll get the
part no if it matters)

during storms :  300k CS/sec, 75% idle (on a dual xeon (four core))
machine (suggesting serializing/sleeping processes)
no storm:   50k CS/sec,  50% idle (suggesting 2 cpu bound processes)


Maybe there's a "hot block" that is bouncing back and forth between
caches? or maybe the page holding semaphores?

On Apr 19, 2004, at 5:53 PM, Tom Lane wrote:

> I wrote:
>> Here is a test case.
>
> Hmmm ... I've been able to reproduce the CS storm on a dual Athlon,
> which seems to pretty much let the Xeon per se off the hook.  Anybody
> got a multiple Opteron to try?  Totally non-Intel CPUs?
>
> It would be interesting to see results with non-Linux kernels, too.
>
>             regards, tom lane
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
>


Re: Wierd context-switching issue on Xeon

From
Rod Taylor
Date:
> It would be interesting to see results with non-Linux kernels, too.

Dual Celeron 500Mhz (Abit BP6 mobo) - client & server on same machine

2 processes FreeBSD (5.2.1): 1800cs
3 processes FreeBSD: 14000cs
4 processes FreeBSD: 14500cs

2 processes Linux (2.4.18 kernel): 52000cs
3 processes Linux: 10000cs
4 processes Linux: 20000cs



Re: Wierd context-switching issue on Xeon

From
Paul Tuckfield
Date:
Ooops, what I meant to say was that 2 threads bound to one
(hyperthreaded) cpu does *NOT* cause the storm, even on an smp xeon.

Therefore, the context switches may be a result of cache coherency
related delays.  (2 threads on one hyperthreaded cpu presumably have
tightly coupled 1,l2 cache.)

On Apr 20, 2004, at 1:02 PM, Paul Tuckfield wrote:

> I tried to test how this is related to cache coherency, by forcing
> affinity of the two test_run.sql processes to the two cores
> (pipelines? threads) of a single hyperthreaded xeon processor in an
> smp xeon box.
>
> When the processes are allowed to run on distinct chips in the smp
> box, the CS storm happens.  When they are "bound" to the two cores of
> a single hyperthreaded Xeon in the smp box, the CS storm *does*
> happen.
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ er, meant *NOT HAPPEN*
>
>
>
> I used the taskset command:
> taskset 01 -p <pid for backend of test_run.sql 1>
> taskset 01 -p <pid for backend of test_run.sql 1>
>
> I guess that 0 and 1 are the two cores (pipelines? hyper-threads?) on
> the first Xeon processor in the box.
>
> I did this on RedHat Fedora core1 on an intel motherboard (I'll get
> the part no if it matters)
>
> during storms :  300k CS/sec, 75% idle (on a dual xeon (four core))
> machine (suggesting serializing/sleeping processes)
> no storm:   50k CS/sec,  50% idle (suggesting 2 cpu bound processes)
>
>
> Maybe there's a "hot block" that is bouncing back and forth between
> caches? or maybe the page holding semaphores?
>
> On Apr 19, 2004, at 5:53 PM, Tom Lane wrote:
>
>> I wrote:
>>> Here is a test case.
>>
>> Hmmm ... I've been able to reproduce the CS storm on a dual Athlon,
>> which seems to pretty much let the Xeon per se off the hook.  Anybody
>> got a multiple Opteron to try?  Totally non-Intel CPUs?
>>
>> It would be interesting to see results with non-Linux kernels, too.
>>
>>             regards, tom lane
>>
>> ---------------------------(end of
>> broadcast)---------------------------
>> TIP 4: Don't 'kill -9' the postmaster
>>
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
>               http://www.postgresql.org/docs/faqs/FAQ.html
>


Re: Wierd context-switching issue on Xeon

From
Josh Berkus
Date:
Dirk, Tom,

OK, off IRC, I have the following reports:

Linux 2.4.21 or 2.4.20 on dual Pentium III : problem verified
Linux 2.4.21 or 2.4.20 on dual Penitum II : problem cannot be reproduced
Solaris 2.6 on 6 cpu e4500 (using 8 processes) : problem not reproduced

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


Re: Wierd context-switching issue on Xeon

From
"J. Andrew Rogers"
Date:
I verified problem on a Dual Opteron server.  I temporarily killed the
normal load, so the server was largely idle when the test was run.

Hardware:
2x Opteron 242
Rioworks HDAMA server board
4Gb RAM

OS Kernel:
RedHat9 + XFS


1 proc: 10-15 cs/sec
2 proc: 400,000-420,000 cs/sec



j. andrew rogers




Re: Wierd context-switching issue on Xeon

From
Josh Berkus
Date:
Anjan,

> Quad 2.0GHz XEON with highest load we have seen on the applications, DB
> performing great -

Can you run Tom's test?   It takes a particular pattern of data access to
reproduce the issue.

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: Wierd context-switching issue on Xeon

From
Bruce Momjian
Date:
Dirk Lutzeb�ck wrote:
> Dirk Lutzebaeck wrote:
>
> > c) Dual XEON DP, non-bigmem, HT on, E7500 Intel chipset (Supermicro)
> >
> > performs well and I could not observe context switch peaks here (one
> > user active), almost no extra semop calls
>
> Did Tom's test here: with 2 processes I'll reach 200k+ CS with peaks to
> 300k CS. Bummer.. Josh, I don't think you can bash the ServerWorks
> chipset here nor bigmem.

Dave Cramer reproduced the problem on my SuperMicro dual Xeon on BSD/OS.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: Wierd context-switching issue on Xeon

From
"Anjan Dave"
Date:
If this helps -

Quad 2.0GHz XEON with highest load we have seen on the applications, DB performing great -

   procs                      memory      swap          io     system      cpu
 r  b  w   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id
 1  0  0   1616 351820  66144 10813704    0    0     2     0    1     1  0  2  7
 3  0  0   1616 349712  66144 10813736    0    0     8  1634 1362  4650  4  2 95
 0  0  0   1616 347768  66144 10814120    0    0   188  1218 1158  4203  5  1 93
 0  0  1   1616 346596  66164 10814184    0    0     8  1972 1394  4773  4  1 94
 2  0  1   1616 345424  66164 10814272    0    0    20  1392 1184  4197  4  2 94

Around 4k CS/sec
Chipset is Intel ServerWorks GC-HE.
Linux Kernel 2.4.20-28.9bigmem #1 SMP

Thanks,
Anjan


-----Original Message-----
From: Dirk Lutzebäck [mailto:lutzeb@aeccom.com]
Sent: Tuesday, April 20, 2004 10:29 AM
To: Tom Lane; Josh Berkus
Cc: pgsql-performance@postgreSQL.org; Neil Conway
Subject: Re: [PERFORM] Wierd context-switching issue on Xeon

Dirk Lutzebaeck wrote:

> c) Dual XEON DP, non-bigmem, HT on, E7500 Intel chipset (Supermicro)
>
> performs well and I could not observe context switch peaks here (one
> user active), almost no extra semop calls

Did Tom's test here: with 2 processes I'll reach 200k+ CS with peaks to
300k CS. Bummer.. Josh, I don't think you can bash the ServerWorks
chipset here nor bigmem.

Dirk



---------------------------(end of broadcast)---------------------------
TIP 6: Have you searched our list archives?

               http://archives.postgresql.org



Re: Wierd context-switching issue on Xeon

From
Dave Cramer
Date:
I modified the code in s_lock.c to remove the spins

#define SPINS_PER_DELAY         1

and it doesn't exhibit the behaviour

This effectively changes the code to


while(TAS(lock))
    select(10000); // 10ms

Can anyone explain why executing TAS 100 times would increase context
switches ?

Dave


On Tue, 2004-04-20 at 12:59, Josh Berkus wrote:
> Anjan,
>
> > Quad 2.0GHz XEON with highest load we have seen on the applications, DB
> > performing great -
>
> Can you run Tom's test?   It takes a particular pattern of data access to
> reproduce the issue.
--
Dave Cramer
519 939 0336
ICQ # 14675561


Re: Wierd context-switching issue on Xeon

From
Joe Conway
Date:
Joe Conway wrote:
>> In isolation, test_run.sql should do essentially no syscalls at all once
>> it's past the initial ramp-up.  On a machine that's functioning per
>> expectations, multiple copies of test_run show a relatively low rate of
>> semop() calls --- a few per second, at most --- and maybe a delaying
>> select() here and there.

Here's results for 7.4 on a dual Athlon server running fedora core:

CPU states:  cpu    user    nice  system    irq  softirq  iowait    idle
            total   86.0%    0.0%   52.4%   0.0%     0.0%    0.0%   61.2%
            cpu00   37.6%    0.0%   29.7%   0.0%     0.0%    0.0%   32.6%
            cpu01   48.5%    0.0%   22.7%   0.0%     0.0%    0.0%   28.7%

procs                      memory      swap          io     system
    cpu
  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs
  1  0 120448  25764  48300 1094576    0    0     0   124  170   187
  1  0 120448  25780  48300 1094576    0    0     0     0  152    89
  2  0 120448  25744  48300 1094580    0    0     0    60  141 78290
  2  0 120448  25752  48300 1094580    0    0     0     0  131 140326
  2  0 120448  25756  48300 1094576    0    0     0    40  122 140100
  2  0 120448  25764  48300 1094584    0    0     0    60  133 136595
  2  0 120448  24284  48300 1094584    0    0     0   200  138 135151

The jump in cs corresponds to starting the query in the second session.

Joe


Re: Wierd context-switching issue on Xeon

From
pginfo
Date:
Hi,

Dual Xeon P4 2.8
linux RedHat AS 3
kernel 2.4.21-4-EL-smp
2 GB ram

I can see the same problem:

procs                      memory      swap          io
system         cpu
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy
id wa
1  0      0  96212  61056 1720240    0    0     0     0  101    11 25  0
75  0
 1  0      0  96212  61056 1720240    0    0     0     0  108   139 25
0 75  0
 1  0      0  96212  61056 1720240    0    0     0     0  104   173 25
0 75  0
 1  0      0  96212  61056 1720240    0    0     0     0  102    11 25
0 75  0
 1  0      0  96212  61056 1720240    0    0     0     0  101    11 25
0 75  0
 2  0      0  96204  61056 1720240    0    0     0     0  110 53866 31
4 65  0
 2  0      0  96204  61056 1720240    0    0     0     0  101 83176 41
5 54  0
 2  0      0  96204  61056 1720240    0    0     0     0  102 86050 39
6 55  0
 2  0      0  96204  61056 1720240    0    0     0    49  113 73642 41
5 54  0
 2  0      0  96204  61056 1720240    0    0     0     0  102 84211 40
5 55  0
 2  0      0  96204  61056 1720240    0    0     0     0  101 105165 39
7 54  0
 2  0      0  96204  61056 1720240    0    0     0     0  103 97754 38
6 56  0
 2  0      0  96204  61056 1720240    0    0     0     0  103 113668 36
7 57  0
 2  0      0  96204  61056 1720240    0    0     0     0  103 112003 37
7 56  0

regards,
ivan.


Re: Wierd context-switching issue on Xeon

From
ohp@pyrenet.fr
Date:
How long is this test supposed to run?

I've launched just 1 for testing, the plan seems horrible; the test is cpu
bound and hasn't finished yet after 17:02 min of CPU time, dual XEON 2.6G
Unixware 713

The machine is a Fujitsu-Siemens TX 200 server
 On Mon, 19 Apr 2004, Tom Lane wrote:

> Date: Mon, 19 Apr 2004 20:01:56 -0400
> From: Tom Lane <tgl@sss.pgh.pa.us>
> To: josh@agliodbs.com
> Cc: Joe Conway <mail@joeconway.com>, scott.marlowe <scott.marlowe@ihs.com>,
>      Bruce Momjian <pgman@candle.pha.pa.us>, lutzeb@aeccom.com,
>      pgsql-performance@postgresql.org, Neil Conway <neilc@samurai.com>
> Subject: Re: [PERFORM] Wierd context-switching issue on Xeon
>
> Here is a test case.  To set up, run the "test_setup.sql" script once;
> then launch two copies of the "test_run.sql" script.  (For those of
> you with more than two CPUs, see whether you need one per CPU to make
> trouble, or whether two test_runs are enough.)  Check that you get a
> nestloops-with-index-scans plan shown by the EXPLAIN in test_run.
>
> In isolation, test_run.sql should do essentially no syscalls at all once
> it's past the initial ramp-up.  On a machine that's functioning per
> expectations, multiple copies of test_run show a relatively low rate of
> semop() calls --- a few per second, at most --- and maybe a delaying
> select() here and there.
>
> What I actually see on Josh's client's machine is a context swap storm:
> "vmstat 1" shows CS rates around 170K/sec.  strace'ing the backends
> shows a corresponding rate of semop() syscalls, with a few delaying
> select()s sprinkled in.  top(1) shows system CPU percent of 25-30
> and idle CPU percent of 16-20.
>
> I haven't bothered to check how long the test_run query takes, but if it
> ends while you're still examining the behavior, just start it again.
>
> Note the test case assumes you've got shared_buffers set to at least
> 1000; with smaller values, you may get some I/O syscalls, which will
> probably skew the results.
>
>             regards, tom lane
>
>

--
Olivier PRENANT                    Tel: +33-5-61-50-97-00 (Work)
6, Chemin d'Harraud Turrou           +33-5-61-50-97-01 (Fax)
31190 AUTERIVE                       +33-6-07-63-80-64 (GSM)
FRANCE                          Email: ohp@pyrenet.fr
------------------------------------------------------------------------------
Make your life a dream, make your dream a reality. (St Exupery)

Re: Wierd context-switching issue on Xeon

From
Dirk Lutzebäck
Date:
It is intended to run indefinately.

Dirk

ohp@pyrenet.fr wrote:

>How long is this test supposed to run?
>
>I've launched just 1 for testing, the plan seems horrible; the test is cpu
>bound and hasn't finished yet after 17:02 min of CPU time, dual XEON 2.6G
>Unixware 713
>
>The machine is a Fujitsu-Siemens TX 200 server
>
>



Re: Wierd context-switching issue on Xeon

From
Dave Cramer
Date:
After some testing if you use the current head code for s_lock.c which
has some mods in it to alleviate this situation, and change
SPINS_PER_DELAY to 10 you can drastically reduce the cs with tom's test.
I am seeing a slight degradation in throughput using pgbench -c 10 -t
1000 but it might be liveable, considering the alternative is unbearable
in some situations.

Can anyone else replicate my results?

Dave
On Wed, 2004-04-21 at 08:10, Dirk_Lutzebäck wrote:
> It is intended to run indefinately.
>
> Dirk
>
> ohp@pyrenet.fr wrote:
>
> >How long is this test supposed to run?
> >
> >I've launched just 1 for testing, the plan seems horrible; the test is cpu
> >bound and hasn't finished yet after 17:02 min of CPU time, dual XEON 2.6G
> >Unixware 713
> >
> >The machine is a Fujitsu-Siemens TX 200 server
> >
> >
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>
>
>
> !DSPAM:40866735106778584283649!
>
>
--
Dave Cramer
519 939 0336
ICQ # 14675561


Re: Wierd context-switching issue on Xeon

From
Josh Berkus
Date:
Dave,

> After some testing if you use the current head code for s_lock.c which
> has some mods in it to alleviate this situation, and change
> SPINS_PER_DELAY to 10 you can drastically reduce the cs with tom's test.
> I am seeing a slight degradation in throughput using pgbench -c 10 -t
> 1000 but it might be liveable, considering the alternative is unbearable
> in some situations.
>
> Can anyone else replicate my results?

Can you produce a patch against 7.4.1?   I'd like to test your fix against a
real-world database.


--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: Wierd context-switching issue on Xeon

From
Paul Tuckfield
Date:
Dave:

Why would test and set increase context swtches:
Note that it *does not increase* context swtiches when the two threads
are on the two cores of a single Xeon processor. (use taskset to force
affinity on linux)

Scenario:
If the two test and set processes are testing and setting the same bit
as each other, then they'll see worst case cache coherency misses.
They'll ping a cache line back and forth between CPUs.  Another case,
might be that they're tesing and setting different bits or words, but
those bits or words are always in the same cache line, again causing
worst case cache coherency and misses.  The fact that tis doesn't
happen when the threads are bound to the 2 cores of a single Xeon
suggests it's because they're now sharing L1 cache. No pings/bounces.


I wonder do the threads stall so badly when pinging cache lines back
and forth,  that the kernel sees it as an opportunity to put the
process to sleep? or do these worst case misses cause an interrupt?

My question is:  What is it that the two threads waiting for when they
spin? Is it exactly the same resource, or two resources that happen to
have test-and-set flags in the same cache line?

On Apr 20, 2004, at 7:41 PM, Dave Cramer wrote:

> I modified the code in s_lock.c to remove the spins
>
> #define SPINS_PER_DELAY         1
>
> and it doesn't exhibit the behaviour
>
> This effectively changes the code to
>
>
> while(TAS(lock))
>     select(10000); // 10ms
>
> Can anyone explain why executing TAS 100 times would increase context
> switches ?
>
> Dave
>
>
> On Tue, 2004-04-20 at 12:59, Josh Berkus wrote:
>> Anjan,
>>
>>> Quad 2.0GHz XEON with highest load we have seen on the applications,
>>> DB
>>> performing great -
>>
>> Can you run Tom's test?   It takes a particular pattern of data
>> access to
>> reproduce the issue.
> --
> Dave Cramer
> 519 939 0336
> ICQ # 14675561
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 8: explain analyze is your friend
>


Re: Wierd context-switching issue on Xeon

From
Tom Lane
Date:
Paul Tuckfield <paul@tuckfield.com> writes:
> I wonder do the threads stall so badly when pinging cache lines back
> and forth,  that the kernel sees it as an opportunity to put the
> process to sleep? or do these worst case misses cause an interrupt?

No; AFAICS the kernel could not even be aware of that behavior.

The context swap storm is happening because of contention at the next
level up (LWLocks rather than spinlocks).  It could be an independent
issue that just happens to be triggered by the same sort of access
pattern.  I put forward a hypothesis that the cache miss storm caused by
the test-and-set ops induces the context swap storm by making the code
more likely to be executing in certain places at certain times ... but
it's only a hypothesis.

Yesterday evening I had pretty well convinced myself that they were
indeed independent issues: profiling on a single-CPU machine was telling
me that the test case I proposed spends over 10% of its time inside
ReadBuffer, which certainly seems like enough to explain a high rate of
contention on the BufMgrLock, without any assumptions about funny
behavior at the hardware level.  However, your report and Dave's suggest
that there really is some linkage.  So I'm still confused.

            regards, tom lane

Re: Wierd context-switching issue on Xeon

From
Dave Cramer
Date:
FYI,

I am doing my testing on non hyperthreading dual athlons.

Also, the test and set is attempting to set the same resource, and not
simply a bit. It's really an lock;xchg in assemblelr.

Also we are using the PAUSE mnemonic, so we should not be seeing any
cache coherency issues, as the cache is being taken out of the picture
AFAICS ?

Dave

On Wed, 2004-04-21 at 14:19, Paul Tuckfield wrote:
> Dave:
>
> Why would test and set increase context swtches:
> Note that it *does not increase* context swtiches when the two threads
> are on the two cores of a single Xeon processor. (use taskset to force
> affinity on linux)
>
> Scenario:
> If the two test and set processes are testing and setting the same bit
> as each other, then they'll see worst case cache coherency misses.
> They'll ping a cache line back and forth between CPUs.  Another case,
> might be that they're tesing and setting different bits or words, but
> those bits or words are always in the same cache line, again causing
> worst case cache coherency and misses.  The fact that tis doesn't
> happen when the threads are bound to the 2 cores of a single Xeon
> suggests it's because they're now sharing L1 cache. No pings/bounces.
>
>
> I wonder do the threads stall so badly when pinging cache lines back
> and forth,  that the kernel sees it as an opportunity to put the
> process to sleep? or do these worst case misses cause an interrupt?
>
> My question is:  What is it that the two threads waiting for when they
> spin? Is it exactly the same resource, or two resources that happen to
> have test-and-set flags in the same cache line?
>
> On Apr 20, 2004, at 7:41 PM, Dave Cramer wrote:
>
> > I modified the code in s_lock.c to remove the spins
> >
> > #define SPINS_PER_DELAY         1
> >
> > and it doesn't exhibit the behaviour
> >
> > This effectively changes the code to
> >
> >
> > while(TAS(lock))
> >     select(10000); // 10ms
> >
> > Can anyone explain why executing TAS 100 times would increase context
> > switches ?
> >
> > Dave
> >
> >
> > On Tue, 2004-04-20 at 12:59, Josh Berkus wrote:
> >> Anjan,
> >>
> >>> Quad 2.0GHz XEON with highest load we have seen on the applications,
> >>> DB
> >>> performing great -
> >>
> >> Can you run Tom's test?   It takes a particular pattern of data
> >> access to
> >> reproduce the issue.
> > --
> > Dave Cramer
> > 519 939 0336
> > ICQ # 14675561
> >
> >
> > ---------------------------(end of
> > broadcast)---------------------------
> > TIP 8: explain analyze is your friend
> >
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: the planner will ignore your desire to choose an index scan if your
>       joining column's datatypes do not match
>
>
>
> !DSPAM:4086c4d0263544680737483!
>
>
--
Dave Cramer
519 939 0336
ICQ # 14675561


Re: Wierd context-switching issue on Xeon patch for 7.4.1

From
Dave Cramer
Date:
attached.
--
Dave Cramer
519 939 0336
ICQ # 14675561

Attachment

Re: Wierd context-switching issue on Xeon

From
Tom Lane
Date:
Kenneth Marshall <ktm@is.rice.edu> writes:
> If the context swap storm derives from LWLock contention, maybe using
> a random order to assign buffer locks in buf_init.c would prevent
> simple adjacency of buffer allocation to cause the storm.

Good try, but no cigar ;-).  The test cases I've been looking at take
only shared locks on the per-buffer locks, so that's not where the
context swaps are coming from.  The swaps have to be caused by the
BufMgrLock, because that's the only exclusive lock being taken.

I did try increasing the allocated size of the spinlocks to 128 bytes
to see if it would do anything.  It didn't ...

            regards, tom lane

Re: Wierd context-switching issue on Xeon patch for 7.4.1

From
Tom Lane
Date:
Dave Cramer <pg@fastcrypt.com> writes:
> diff -c -r1.16 s_lock.c
> *** backend/storage/lmgr/s_lock.c    8 Aug 2003 21:42:00 -0000    1.16
> --- backend/storage/lmgr/s_lock.c    21 Apr 2004 20:27:34 -0000
> ***************
> *** 76,82 ****
>        * The select() delays are measured in centiseconds (0.01 sec) because 10
>        * msec is a common resolution limit at the OS level.
>        */
> ! #define SPINS_PER_DELAY        100
>   #define NUM_DELAYS            1000
>   #define MIN_DELAY_CSEC        1
>   #define MAX_DELAY_CSEC        100
> --- 76,82 ----
>        * The select() delays are measured in centiseconds (0.01 sec) because 10
>        * msec is a common resolution limit at the OS level.
>        */
> ! #define SPINS_PER_DELAY        10
>   #define NUM_DELAYS            1000
>   #define MIN_DELAY_CSEC        1
>   #define MAX_DELAY_CSEC        100


As far as I can tell, this does reduce the rate of semop's
significantly, but it does so by bringing the overall processing rate
to a crawl :-(.  I see 97% CPU idle time when using this patch.
I believe what is happening is that the select() delay in s_lock.c is
being hit frequently because the spin loop isn't allowed to run long
enough to let the other processor get out of the spinlock.

            regards, tom lane

Re: Wierd context-switching issue on Xeon patch for 7.4.1

From
Josh Berkus
Date:
Tom,

> As far as I can tell, this does reduce the rate of semop's
> significantly, but it does so by bringing the overall processing rate
> to a crawl :-(.  I see 97% CPU idle time when using this patch.
> I believe what is happening is that the select() delay in s_lock.c is
> being hit frequently because the spin loop isn't allowed to run long
> enough to let the other processor get out of the spinlock.

Also, I tested it on production data, and it reduces the CSes by about 40%.
An improvement, but not a magic bullet.

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: Wierd context-switching issue on Xeon patch for 7.4.1

From
Tom Lane
Date:
Dave Cramer <pg@fastcrypt.com> writes:
> I tried increasing the NUM_SPINS to 1000 and it works better.

Doesn't surprise me.  The value of 100 is about right on the assumption
that the spinlock instruction per se is not too much more expensive than
any other instruction.  What I was seeing from oprofile suggested that
the spinlock instruction cost about 100x more than an ordinary
instruction :-( ... so maybe 200 or so would be good on a Xeon.

> This is certainly heading in the right direction ? Although it looks
> like it is highly dependent on the system you are running on.

Yeah.  I don't know a reasonable way to tune this number automatically
for particular systems ... but at the very least we'd need to find a way
to distinguish uniprocessor from multiprocessor, because on a
uniprocessor the optimal value is surely 1.

            regards, tom lane

Re: Wierd context-switching issue on Xeon patch for 7.4.1

From
Christopher Kings-Lynne
Date:
> Yeah.  I don't know a reasonable way to tune this number automatically
> for particular systems ... but at the very least we'd need to find a way
> to distinguish uniprocessor from multiprocessor, because on a
> uniprocessor the optimal value is surely 1.

 From TODO:

* Add code to detect an SMP machine and handle spinlocks accordingly
from distributted.net, http://www1.distributed.net/source, in
client/common/cpucheck.cpp

Chris


Re: Wierd context-switching issue on Xeon patch for 7.4.1

From
Bruce Momjian
Date:
Tom Lane wrote:
> Dave Cramer <pg@fastcrypt.com> writes:
> > I tried increasing the NUM_SPINS to 1000 and it works better.
>
> Doesn't surprise me.  The value of 100 is about right on the assumption
> that the spinlock instruction per se is not too much more expensive than
> any other instruction.  What I was seeing from oprofile suggested that
> the spinlock instruction cost about 100x more than an ordinary
> instruction :-( ... so maybe 200 or so would be good on a Xeon.
>
> > This is certainly heading in the right direction ? Although it looks
> > like it is highly dependent on the system you are running on.
>
> Yeah.  I don't know a reasonable way to tune this number automatically
> for particular systems ... but at the very least we'd need to find a way
> to distinguish uniprocessor from multiprocessor, because on a
> uniprocessor the optimal value is surely 1.

Have you looked at the code pointed to by our TODO item:

    * Add code to detect an SMP machine and handle spinlocks accordingly
      from distributted.net, http://www1.distributed.net/source,
      in client/common/cpucheck.cpp

For BSDOS it has:

    #if (CLIENT_OS == OS_FREEBSD) || (CLIENT_OS == OS_BSDOS) || \
        (CLIENT_OS == OS_OPENBSD) || (CLIENT_OS == OS_NETBSD)
    { /* comment out if inappropriate for your *bsd - cyp (25/may/1999) */
      int ncpus; size_t len = sizeof(ncpus);
      int mib[2]; mib[0] = CTL_HW; mib[1] = HW_NCPU;
      if (sysctl( &mib[0], 2, &ncpus, &len, NULL, 0 ) == 0)
      //if (sysctlbyname("hw.ncpu", &ncpus, &len, NULL, 0 ) == 0)
        cpucount = ncpus;
    }

and I can confirm that on my computer it works:

    hw.ncpu = 2

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: Wierd context-switching issue on Xeon patch for 7.4.1

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> For BSDOS it has:

>     #if (CLIENT_OS == OS_FREEBSD) || (CLIENT_OS == OS_BSDOS) || \
>         (CLIENT_OS == OS_OPENBSD) || (CLIENT_OS == OS_NETBSD)
>     { /* comment out if inappropriate for your *bsd - cyp (25/may/1999) */
>       int ncpus; size_t len = sizeof(ncpus);
>       int mib[2]; mib[0] = CTL_HW; mib[1] = HW_NCPU;
>       if (sysctl( &mib[0], 2, &ncpus, &len, NULL, 0 ) == 0)
>       //if (sysctlbyname("hw.ncpu", &ncpus, &len, NULL, 0 ) == 0)
>         cpucount = ncpus;
>     }

Multiplied by how many platforms?  Ewww...

I was wondering about some sort of dynamic adaptation, roughly along the
lines of "whenever a spin loop successfully gets the lock after
spinning, decrease the allowed loop count by one; whenever we fail to
get the lock after spinning, increase by 100; if the loop count reaches,
say, 10000, decide we are on a uniprocessor and irreversibly set it to
1."  As written this would tend to incur a select() delay once per
hundred spinlock acquisitions, which is way too much, but I think we
could make it work with a sufficiently slow adaptation rate.  The tricky
part is that a slow adaptation rate means we can't have every backend
figuring this out for itself --- the right value would have to be
maintained globally, and I'm not sure how to do that without adding a
lot of overhead.

            regards, tom lane

Re: Wierd context-switching issue on Xeon

From
Tom Lane
Date:
Paul Tuckfield <paul@tuckfield.com> writes:
>> I used the taskset command:
>> taskset 01 -p <pid for backend of test_run.sql 1>
>> taskset 01 -p <pid for backend of test_run.sql 1>
>>
>> I guess that 0 and 1 are the two cores (pipelines? hyper-threads?) on
>> the first Xeon processor in the box.

AFAICT, what you've actually done here is to bind both backends to the
first logical processor of the first Xeon.  If you'd used 01 and 02
as the affinity masks then you'd have bound them to the two cores of
that Xeon, but what you actually did simply reduces the system to a
uniprocessor.  In that situation the context swap rate will be normally
one swap per scheduler timeslice, and at worst two swaps per timeslice
(if a process is swapped away from while it holds a lock the other one
wants).  It doesn't prove a lot about our SMP problem though.

I don't have access to a Xeon with both taskset and hyperthreading
enabled, so I can't check what happens when you do the taskset correctly
... could you retry?

            regards, tom lane

Re: Wierd context-switching issue on Xeon patch for 7.4.1

From
Dave Cramer
Date:
Yeah, I did some more testing myself, and actually get better numbers
with increasing spins per delay to 1000, but my suspicion is that it is
highly dependent on finding the right delay for the processor you are
on.

My hypothesis is that if you spin approximately the same or more time
than the average time it takes to get finished with the shared resource
then this should reduce cs.

Certainly more ideas are required here.

Dave
On Wed, 2004-04-21 at 22:35, Tom Lane wrote:
> Dave Cramer <pg@fastcrypt.com> writes:
> > diff -c -r1.16 s_lock.c
> > *** backend/storage/lmgr/s_lock.c    8 Aug 2003 21:42:00 -0000    1.16
> > --- backend/storage/lmgr/s_lock.c    21 Apr 2004 20:27:34 -0000
> > ***************
> > *** 76,82 ****
> >        * The select() delays are measured in centiseconds (0.01 sec) because 10
> >        * msec is a common resolution limit at the OS level.
> >        */
> > ! #define SPINS_PER_DELAY        100
> >   #define NUM_DELAYS            1000
> >   #define MIN_DELAY_CSEC        1
> >   #define MAX_DELAY_CSEC        100
> > --- 76,82 ----
> >        * The select() delays are measured in centiseconds (0.01 sec) because 10
> >        * msec is a common resolution limit at the OS level.
> >        */
> > ! #define SPINS_PER_DELAY        10
> >   #define NUM_DELAYS            1000
> >   #define MIN_DELAY_CSEC        1
> >   #define MAX_DELAY_CSEC        100
>
>
> As far as I can tell, this does reduce the rate of semop's
> significantly, but it does so by bringing the overall processing rate
> to a crawl :-(.  I see 97% CPU idle time when using this patch.
> I believe what is happening is that the select() delay in s_lock.c is
> being hit frequently because the spin loop isn't allowed to run long
> enough to let the other processor get out of the spinlock.
>
>             regards, tom lane
>
>
>
> !DSPAM:40872f7e21492906114513!
>
>
--
Dave Cramer
519 939 0336
ICQ # 14675561


Re: Wierd context-switching issue on Xeon patch for 7.4.1

From
Dave Cramer
Date:
More data....

On a dual xeon with HTT enabled:

I tried increasing the NUM_SPINS to 1000 and it works better.

NUM_SPINLOCKS    CS    ID    pgbench

100        250K    59%    230 TPS
1000        125K    55%    228 TPS

This is certainly heading in the right direction ? Although it looks
like it is highly dependent on the system you are running on.

--dc--



On Wed, 2004-04-21 at 22:53, Josh Berkus wrote:
> Tom,
>
> > As far as I can tell, this does reduce the rate of semop's
> > significantly, but it does so by bringing the overall processing rate
> > to a crawl :-(.  I see 97% CPU idle time when using this patch.
> > I believe what is happening is that the select() delay in s_lock.c is
> > being hit frequently because the spin loop isn't allowed to run long
> > enough to let the other processor get out of the spinlock.
>
> Also, I tested it on production data, and it reduces the CSes by about 40%.
> An improvement, but not a magic bullet.
--
Dave Cramer
519 939 0336
ICQ # 14675561


Re: Wierd context-switching issue on Xeon patch for 7.4.1

From
Tom Lane
Date:
Dave Cramer <pg@fastcrypt.com> writes:
> My hypothesis is that if you spin approximately the same or more time
> than the average time it takes to get finished with the shared resource
> then this should reduce cs.

The only thing we use spinlocks for nowadays is to protect LWLocks, so
the "average time" involved is fairly small and stable --- or at least
that was the design intention.  What we seem to be seeing is that on SMP
machines, cache coherency issues cause the TAS step itself to be
expensive and variable.  However, in the experiments I did, strace'ing
showed that actual spin timeouts (manifested by the execution of a
delaying select()) weren't actually that common; the big source of
context switches is semop(), which indicates contention at the LWLock
level rather than the spinlock level.  So while tuning the spinlock
limit count might be a useful thing to do in general, I think it will
have only negligible impact on the particular problems we're discussing
in this thread.

            regards, tom lane

Re: Wierd context-switching issue on Xeon patch for 7.4.1

From
Josh Berkus
Date:
Tom,

> The tricky
> part is that a slow adaptation rate means we can't have every backend
> figuring this out for itself --- the right value would have to be
> maintained globally, and I'm not sure how to do that without adding a
> lot of overhead.

This may be a moot point, since you've stated that changing the loop timing
won't solve the problem, but what about making the test part of make?   I
don't think too many systems are going to change processor architectures once
in production, and those that do can be told to re-compile.

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: Wierd context-switching issue on Xeon patch for 7.4.1

From
Josh Berkus
Date:
Tom,

> Having to recompile to run on single- vs dual-processor machines doesn't
> seem like it would fly.

Oh, I don't know.  Many applications require compiling for a target
architecture; SQL Server, for example, won't use a 2nd processor without
re-installation.   I'm not sure about Oracle.

It certainly wasn't too long ago that Linux gurus were esposing re-compiling
the kernel for the machine.

And it's not like they would *have* to re-compile to use PostgreSQL after
adding an additional processor.  Just if they wanted to maximize peformance
benefit.

Also, this is a fairly rare circumstance, I think; to judge by my clients,
once a database server is in production nobody touches the hardware.

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


Re: Wierd context-switching issue on Xeon patch for 7.4.1

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> This may be a moot point, since you've stated that changing the loop timing
> won't solve the problem, but what about making the test part of make?   I
> don't think too many systems are going to change processor architectures once
> in production, and those that do can be told to re-compile.

Having to recompile to run on single- vs dual-processor machines doesn't
seem like it would fly.

            regards, tom lane

Re: Wierd context-switching issue on Xeon patch for 7.4.1

From
Rod Taylor
Date:
On Thu, 2004-04-22 at 13:55, Tom Lane wrote:
> Josh Berkus <josh@agliodbs.com> writes:
> > This may be a moot point, since you've stated that changing the loop timing
> > won't solve the problem, but what about making the test part of make?   I
> > don't think too many systems are going to change processor architectures once
> > in production, and those that do can be told to re-compile.
>
> Having to recompile to run on single- vs dual-processor machines doesn't
> seem like it would fly.

Is it something the postmaster could quickly determine and set a global
during the startup cycle?



Re: Wierd context-switching issue on Xeon

From
"Anjan Dave"
Date:
Tested the sql on Quad 2.0GHz XEON/8GB RAM:
 
During the first run, the CS shooted up more than 100k, and was randomly high/low
Second process made it consistently high 100k+
Third brought it down to anaverage 80-90k
Fourth brought it down to an average of 50-60k/s
 
By cancelling the queries one-by-one, the CS started going up again.
 
8 logical CPUs in 'top', all of them not at all too busy, load average stood around 2 all the time.
 
Thanks.
Anjan
 
-----Original Message----- 
From: Josh Berkus [mailto:josh@agliodbs.com] 
Sent: Tue 4/20/2004 12:59 PM 
To: Anjan Dave; Dirk Lutzebäck; Tom Lane 
Cc: pgsql-performance@postgreSQL.org; Neil Conway 
Subject: Re: [PERFORM] Wierd context-switching issue on Xeon



    Anjan,
    
    > Quad 2.0GHz XEON with highest load we have seen on the applications, DB
    > performing great -
    
    Can you run Tom's test?   It takes a particular pattern of data access to
    reproduce the issue.
    
    --
    Josh Berkus
    Aglio Database Solutions
    San Francisco
    
    ---------------------------(end of broadcast)---------------------------
    TIP 9: the planner will ignore your desire to choose an index scan if your
          joining column's datatypes do not match
    


Attachment

Re: Wierd context-switching issue on Xeon patch for 7.4.1

From
Andrew McMillan
Date:
On Thu, 2004-04-22 at 10:37 -0700, Josh Berkus wrote:
> Tom,
>
> > The tricky
> > part is that a slow adaptation rate means we can't have every backend
> > figuring this out for itself --- the right value would have to be
> > maintained globally, and I'm not sure how to do that without adding a
> > lot of overhead.
>
> This may be a moot point, since you've stated that changing the loop timing
> won't solve the problem, but what about making the test part of make?   I
> don't think too many systems are going to change processor architectures once
> in production, and those that do can be told to re-compile.

Sure they do - PostgreSQL is regularly provided as a pre-compiled
distribution.  I haven't compiled PostgreSQL for years, and we have it
running on dozens of machines, some SMP, some not, but most running
Debian Linux.

Even having a compiler _installed_ on one of our client's database
servers would usually be considered against security procedures, and
would get a black mark when the auditors came through.

Regards,
                    Andrew McMillan
-------------------------------------------------------------------------
Andrew @ Catalyst .Net .NZ  Ltd,  PO Box 11-053,  Manners St,  Wellington
WEB: http://catalyst.net.nz/             PHYS: Level 2, 150-154 Willis St
DDI: +64(4)916-7201       MOB: +64(21)635-694      OFFICE: +64(4)499-2267
                     Planning an election?  Call us!
-------------------------------------------------------------------------


Re: Wierd context-switching issue on Xeon

From
"Magnus Naeslund(t)"
Date:
Tom Lane wrote:
>
> Hmmm ... I've been able to reproduce the CS storm on a dual Athlon,
> which seems to pretty much let the Xeon per se off the hook.  Anybody
> got a multiple Opteron to try?  Totally non-Intel CPUs?
>
> It would be interesting to see results with non-Linux kernels, too.
>
>             regards, tom lane

I also tested on an dual Athlon MP Tyan Thunder motherboard (2xMP2800+,
2.5GB memory), and got the same high numbers.
I then ran with kernel 2.6.5, it lowered them a little, but it's still
some ping pong effect here. I wonder if this is some effect of the
scheduler, maybe the shed frequency alone (100HZ vs 1000HZ).

It would be interesting to see what a locking implementation ala FUTEX
style would give on an 2.6 kernel, as i understood it that would work
cross process with some work.

The first file attached is kernel 2.4 running one process then starting
up the other one.
Same with second file, but with kernel 2.6...

Regards
Magnus
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 1  0      0 1828408  27852 528852    0    0     0     0  317   557 50  0 50  0
 1  0      0 1828408  27852 528852    0    0     0     0  293   491 50  0 49  0
 1  0      0 1828400  27860 528852    0    0     0    16  399   709 50  0 50  0
 1  0      0 1828400  27860 528852    0    0     0     0  350   593 50  0 49  0
 2  0      0 1828400  27860 528852    0    0     0     0  349   608 50  0 50  0
 1  0      0 1828400  27860 528852    0    0     0     0  109   412 50  0 50  0
 1  0      0 1828400  27860 528852    0    0     0     0  101    92 50  0 50  0
 1  0      0 1828392  27868 528852    0    0     0    16  104    96 50  0 50  0
 1  0      0 1828392  27868 528852    0    0     0     0  101   103 50  0 50  0
 2  0      0 1827408  27892 528852    0    0     8    48  113 61197 45  9 46  0
 2  0      0 1827408  27892 528852    0    0     0     0  101 167237 41 27 32  0
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 4  0      0 1827408  27892 528852    0    0     0     0  101 166145 39 25 36  0
 2  0      0 1827400  27900 528852    0    0     0    48  105 149406 42 19 40  0
 3  0      0 1827400  27900 528852    0    0     0     0  101 157559 43 26 32  0
 2  0      0 1827400  27900 528852    0    0     0     0  101 163813 46 24 30  0
 2  0      0 1827400  27900 528852    0    0     0     0  101 156872 44 26 30  0
 2  0      0 1827400  27900 528852    0    0     0     0  103 160722 45 28 28  0
 2  0      0 1827392  27908 528852    0    0     0    16  104 158644 41 23 37  0
 3  0      0 1827392  27908 528852    0    0     0     0  101 157534 42 25 33  0
 2  0      0 1827392  27908 528852    0    0     0     0  101 160007 37 28 35  0
 3  0      0 1827392  27908 528852    0    0     0     0  101 161852 45 24 31  0
 3  0      0 1827392  27908 528852    0    0     0     0  101 161616 42 25 33  0
 2  0      0 1827392  27916 528852    0    0     0    68  114 152144 44 25 31  0
 2  0      0 1827384  27916 528852    0    0     0     0  101 156485 35 28 37  0
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 1  0      0 2436044   8844  90028    0    0     0    16 1010   235 50  0 50  0
 1  0      0 2436108   8844  90028    0    0     0     0 1024   404 50  0 50  0
 1  0      0 2436108   8844  90028    0    0     0     0 1008   199 50  0 50  0
 1  0      0 2436108   8844  90028    0    0     0     0 1017   272 50  0 50  0
 1  0      0 2436108   8844  90028    0    0     0     0 1013   253 50  0 50  0
 1  1      0 2436108   8852  90020    0    0     0    16 1019   282 51  0 49  1
 2  0      0 2435068   8852  90020    0    0     0     0 1005 23929 45  4 50  0
 2  0      0 2435068   8852  90020    0    0     0    20 1008 95501 33 14 53  0
procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu----
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 3  0      0 2435068   8852  90020    0    0     0     0 1002 103940 35 15 50  0
 0  0      0 2435068   8852  90020    0    0     0     0 1003 104343 32 16 51  0
 2  0      0 2435068   8860  90080    0    0     0    52 1006 102477 34 16 51  1
 2  0      0 2435068   8860  90080    0    0     0     0 1002 92809 31 14 54  0
 2  0      0 2435068   8860  90080    0    0     0     0 1002 100498 37 14 49  0
 1  0      0 2435068   8860  90080    0    0     0     0 1002 108130 35 16 49  0
 0  0      0 2435068   8860  90080    0    0     0     0 1002 94045 33 14 54  0
 0  0      0 2435004   8868  90072    0    0     0    16 1005 104380 34 15 52  0
 2  0      0 2435004   8868  90072    0    0     0     0 1002 100696 36 14 50  0
 2  0      0 2435068   8868  90072    0    0     0     0 1002 98289 31 14 54  0
 0  0      0 2435068   8868  90072    0    0     0     0 1002 97287 31 14 55  0
 0  0      0 2435068   8868  90072    0    0     0     0 1002 92787 34 14 53  0
 0  0      0 2435068   8876  90064    0    0     0    16 1005 98568 32 16 52  1
 2  0      0 2435068   8876  90064    0    0     0     0 1003 107104 37 16 47  0

Re: Wierd context-switching issue on Xeon

From
Kenneth Marshall
Date:
On Wed, Apr 21, 2004 at 02:51:31PM -0400, Tom Lane wrote:
> The context swap storm is happening because of contention at the next
> level up (LWLocks rather than spinlocks).  It could be an independent
> issue that just happens to be triggered by the same sort of access
> pattern.  I put forward a hypothesis that the cache miss storm caused by
> the test-and-set ops induces the context swap storm by making the code
> more likely to be executing in certain places at certain times ... but
> it's only a hypothesis.
>
If the context swap storm derives from LWLock contention, maybe using
a random order to assign buffer locks in buf_init.c would prevent
simple adjacency of buffer allocation to cause the storm. Just offsetting
the assignment by the cacheline size should work. I notice that when
initializing the buffers in shared memory, both the buf->meta_data_lock
and the buf->cntx_lock are immediately adjacent in memory. I am not
familiar enough with the flow through postgres to see if there could
be "fighting" for those two locks. If so, offsetting those by the cache
line size would also stop the context swap storm.

--Ken

Re: Wierd context-switching issue on Xeon

From
Josh Berkus
Date:
Magus,

> It would be interesting to see what a locking implementation ala FUTEX
> style would give on an 2.6 kernel, as i understood it that would work
> cross process with some work.

I'mm working on testing a FUTEX patch, but am having some trouble with it.
Will let you know the results ....

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


Re: Wierd context-switching issue on Xeon patch for 7.4.1

From
Josh Berkus
Date:
Dave,

> Yeah, I did some more testing myself, and actually get better numbers
> with increasing spins per delay to 1000, but my suspicion is that it is
> highly dependent on finding the right delay for the processor you are
> on.

Well, it certainly didn't help here:

procs                      memory      swap          io     system         cpu
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
 2  0      0 14870744 123872 1129912    0    0     0     0 1027 187341 48 27
26  0
 2  0      0 14869912 123872 1129912    0    0     0    48 1030 126490 65 18
16  0
 2  0      0 14867032 123872 1129912    0    0     0     0 1021 106046 72 16
12  0
 2  0      0 14869912 123872 1129912    0    0     0     0 1025 90256 76 14 10
0
 2  0      0 14870424 123872 1129912    0    0     0     0 1022 135249 63 22
16  0
 2  0      0 14872664 123872 1129912    0    0     0     0 1023 131111 63 20
17  0
 1  0      0 14871128 123872 1129912    0    0     0    48 1024 155728 57 22
20  0
 2  0      0 14871128 123872 1129912    0    0     0     0 1028 189655 49 29
22  0
 2  0      0 14871064 123872 1129912    0    0     0     0 1018 190744 48 29
23  0
 2  0      0 14871064 123872 1129912    0    0     0     0 1027 186812 51 26
23  0


--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


Re: Wierd context-switching issue on Xeon patch for 7.4.1

From
Dave Cramer
Date:
Are you testing this with Tom's code, you need to do a baseline
measurement with 10 and then increase it, you will still get lots of cs,
but it will be less.

Dave
On Mon, 2004-04-26 at 20:03, Josh Berkus wrote:
> Dave,
>
> > Yeah, I did some more testing myself, and actually get better numbers
> > with increasing spins per delay to 1000, but my suspicion is that it is
> > highly dependent on finding the right delay for the processor you are
> > on.
>
> Well, it certainly didn't help here:
>
> procs                      memory      swap          io     system         cpu
>  r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy id wa
>  2  0      0 14870744 123872 1129912    0    0     0     0 1027 187341 48 27
> 26  0
>  2  0      0 14869912 123872 1129912    0    0     0    48 1030 126490 65 18
> 16  0
>  2  0      0 14867032 123872 1129912    0    0     0     0 1021 106046 72 16
> 12  0
>  2  0      0 14869912 123872 1129912    0    0     0     0 1025 90256 76 14 10
> 0
>  2  0      0 14870424 123872 1129912    0    0     0     0 1022 135249 63 22
> 16  0
>  2  0      0 14872664 123872 1129912    0    0     0     0 1023 131111 63 20
> 17  0
>  1  0      0 14871128 123872 1129912    0    0     0    48 1024 155728 57 22
> 20  0
>  2  0      0 14871128 123872 1129912    0    0     0     0 1028 189655 49 29
> 22  0
>  2  0      0 14871064 123872 1129912    0    0     0     0 1018 190744 48 29
> 23  0
>  2  0      0 14871064 123872 1129912    0    0     0     0 1027 186812 51 26
> 23  0
--
Dave Cramer
519 939 0336
ICQ # 14675561


Re: Wierd context-switching issue on Xeon patch for 7.4.1

From
Josh Berkus
Date:
Dave,

> Are you testing this with Tom's code, you need to do a baseline
> measurement with 10 and then increase it, you will still get lots of cs,
> but it will be less.

No, that was just a test of 1000 straight up.    Tom outlined a method, but I
didn't see any code that would help me find a better level, other than just
trying each +100 increase one at a time.   This would take days of testing
...
--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: Wierd context-switching issue on Xeon patch for 7.4.1

From
Dave Cramer
Date:
Josh,

I think you can safely increase by orders of magnitude here, instead of
by +100, my wild ass guess is that the sweet spot is the spin time
should be approximately the time it takes to consume the resource. So if
you have a really fast machine then the spin count should be higher.

Also you have to take into consideration your memory bus speed, with the
pause instruction inserted in the loop the timing is now dependent on
memory speed.

But... you need a baseline first.

Dave
On Tue, 2004-04-27 at 14:05, Josh Berkus wrote:
> Dave,
>
> > Are you testing this with Tom's code, you need to do a baseline
> > measurement with 10 and then increase it, you will still get lots of cs,
> > but it will be less.
>
> No, that was just a test of 1000 straight up.    Tom outlined a method, but I
> didn't see any code that would help me find a better level, other than just
> trying each +100 increase one at a time.   This would take days of testing
> ...
--
Dave Cramer
519 939 0336
ICQ # 14675561


Re: Wierd context-switching issue on Xeon patch for 7.4.1

From
Josh Berkus
Date:
Dave,

> But... you need a baseline first.

A baseline on CS?   I have that ....

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


Re: Wierd context-switching issue on Xeon

From
Robert Creager
Date:
When grilled further on (Wed, 21 Apr 2004 10:29:43 -0700),
Josh Berkus <josh@agliodbs.com> confessed:

> Dave,
>
> > After some testing if you use the current head code for s_lock.c which
> > has some mods in it to alleviate this situation, and change
> > SPINS_PER_DELAY to 10 you can drastically reduce the cs with tom's test.
> > I am seeing a slight degradation in throughput using pgbench -c 10 -t
> > 1000 but it might be liveable, considering the alternative is unbearable
> > in some situations.
> >
> > Can anyone else replicate my results?
>
> Can you produce a patch against 7.4.1?   I'd like to test your fix against a
> real-world database.

I would like to see the same, as I have a system that exhibits the same behavior
on a production db that's running 7.4.1.

Cheers,
Rob


--
 18:55:22 up  1:40,  4 users,  load average: 2.00, 2.04, 2.00
Linux 2.6.5-01 #7 SMP Fri Apr 16 22:45:31 MDT 2004

Attachment

Re: Wierd context-switching issue on Xeon

From
ohp@pyrenet.fr
Date:
Hi

I'd LOVE to contribute on this but I don't have vmstat and I'm not running
linux.

How can I help?
Regards
On Wed, 28 Apr 2004, Robert Creager wrote:

> Date: Wed, 28 Apr 2004 18:57:53 -0600
> From: Robert Creager <Robert_Creager@LogicalChaos.org>
> To: Josh Berkus <josh@agliodbs.com>
> Cc: pg@fastcrypt.com, Dirk_Lutzebäck <lutzeb@aeccom.com>, ohp@pyrenet.fr,
>      Tom Lane <tgl@sss.pgh.pa.us>, Joe Conway <mail@joeconway.com>,
>      scott.marlowe <scott.marlowe@ihs.com>,
>      Bruce Momjian <pgman@candle.pha.pa.us>, pgsql-performance@postgresql.org,
>      Neil Conway <neilc@samurai.com>
> Subject: Re: [PERFORM] Wierd context-switching issue on Xeon
>
> When grilled further on (Wed, 21 Apr 2004 10:29:43 -0700),
> Josh Berkus <josh@agliodbs.com> confessed:
>
> > Dave,
> >
> > > After some testing if you use the current head code for s_lock.c which
> > > has some mods in it to alleviate this situation, and change
> > > SPINS_PER_DELAY to 10 you can drastically reduce the cs with tom's test.
> > > I am seeing a slight degradation in throughput using pgbench -c 10 -t
> > > 1000 but it might be liveable, considering the alternative is unbearable
> > > in some situations.
> > >
> > > Can anyone else replicate my results?
> >
> > Can you produce a patch against 7.4.1?   I'd like to test your fix against a
> > real-world database.
>
> I would like to see the same, as I have a system that exhibits the same behavior
> on a production db that's running 7.4.1.
>
> Cheers,
> Rob
>
>
>

--
Olivier PRENANT                    Tel: +33-5-61-50-97-00 (Work)
6, Chemin d'Harraud Turrou           +33-5-61-50-97-01 (Fax)
31190 AUTERIVE                       +33-6-07-63-80-64 (GSM)
FRANCE                          Email: ohp@pyrenet.fr
------------------------------------------------------------------------------
Make your life a dream, make your dream a reality. (St Exupery)

Re: Wierd context-switching issue on Xeon

From
Josh Berkus
Date:
Rob,

> I would like to see the same, as I have a system that exhibits the same
behavior
> on a production db that's running 7.4.1.

If you checked the thread follow-ups,  you'd see that *decreasing*
spins_per_delay was not beneficial.   Instead, try increasing them, one step
at a time:

(take baseline measurement at 100)
250
500
1000
1500
2000
3000
5000

... until you find an optimal level.   Then report the results to us!

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


Re: Wierd context-switching issue on Xeon

From
Robert Creager
Date:
When grilled further on (Thu, 29 Apr 2004 11:21:51 -0700),
Josh Berkus <josh@agliodbs.com> confessed:

> spins_per_delay was not beneficial.   Instead, try increasing them, one step
> at a time:
>
> (take baseline measurement at 100)
> 250
> 500
> 1000
> 1500
> 2000
> 3000
> 5000
>
> ... until you find an optimal level.   Then report the results to us!
>

Some results.  The patch mentioned is what Dave Cramer posted to the Performance
list on 4/21.

A Perl script monitored <vmstat 1> for 120 seconds and generated max and average
values.  Unfortunately, I am not present on site, so I cannot physically change
the device under test to increase the db load to where it hit about 10 days ago.
 That will have to wait till the 'real' work week on Monday.

Context switches -          avg    max

Default 7.4.1 code :       10665  69470
Default patch - 10 :       17297  21929
patch at 100       :       26825  87073
patch at 1000      :       37580 110849

Now granted, the db isn't showing the CS swap problem in a bad way (at all), but
should the numbers be trending the way they are with the patched code?  Or will
these numbers potentially change dramatically when I can load up the db?

And, presuming I can re-produce what I was seeing previously (200K CS/s), you
folks want me to carry on with more testing of the patch and report the results?
 Or just go away and be quiet...

The information is provided from a HP Proliant DL380 G3 with 2x 2.4 GHZ Xenon's
(with HT enabled) 2 GB ram, running 2.4.22-26mdkenterprise kernel, RAID
controller w/128 Mb battery backed cache RAID 1 on 2x 15K RPM drives for WAL
drive, RAID 0+1 on 4x 10K RPM drives for data.  The only job this box has is
running this db.

Cheers,
Rob

--
 21:54:48 up 2 days,  4:39,  4 users,  load average: 2.00, 2.03, 2.00
Linux 2.6.5-01 #7 SMP Fri Apr 16 22:45:31 MDT 2004

Attachment

Re: Wierd context-switching issue on Xeon

From
Dave Cramer
Date:
No, don't go away and be quiet. Keep testing, it may be that under
normal operation the context switching goes up but under the conditions
that you were seeing the high CS it may not be as bad.

As others have mentioned the real solution to this is to rewrite the
buffer management so that the lock isn't quite as coarse grained.

Dave
On Sat, 2004-05-01 at 00:03, Robert Creager wrote:
> When grilled further on (Thu, 29 Apr 2004 11:21:51 -0700),
> Josh Berkus <josh@agliodbs.com> confessed:
>
> > spins_per_delay was not beneficial.   Instead, try increasing them, one step
> > at a time:
> >
> > (take baseline measurement at 100)
> > 250
> > 500
> > 1000
> > 1500
> > 2000
> > 3000
> > 5000
> >
> > ... until you find an optimal level.   Then report the results to us!
> >
>
> Some results.  The patch mentioned is what Dave Cramer posted to the Performance
> list on 4/21.
>
> A Perl script monitored <vmstat 1> for 120 seconds and generated max and average
> values.  Unfortunately, I am not present on site, so I cannot physically change
> the device under test to increase the db load to where it hit about 10 days ago.
>  That will have to wait till the 'real' work week on Monday.
>
> Context switches -          avg    max
>
> Default 7.4.1 code :       10665  69470
> Default patch - 10 :       17297  21929
> patch at 100       :       26825  87073
> patch at 1000      :       37580 110849
>
> Now granted, the db isn't showing the CS swap problem in a bad way (at all), but
> should the numbers be trending the way they are with the patched code?  Or will
> these numbers potentially change dramatically when I can load up the db?
>
> And, presuming I can re-produce what I was seeing previously (200K CS/s), you
> folks want me to carry on with more testing of the patch and report the results?
>  Or just go away and be quiet...
>
> The information is provided from a HP Proliant DL380 G3 with 2x 2.4 GHZ Xenon's
> (with HT enabled) 2 GB ram, running 2.4.22-26mdkenterprise kernel, RAID
> controller w/128 Mb battery backed cache RAID 1 on 2x 15K RPM drives for WAL
> drive, RAID 0+1 on 4x 10K RPM drives for data.  The only job this box has is
> running this db.
>
> Cheers,
> Rob
--
Dave Cramer
519 939 0336
ICQ # 14675561


Re: Wierd context-switching issue on Xeon

From
Robert Creager
Date:
Found some co-workers at work yesterday to load up my library...

The sample period is 5 minutes long (vs 2 minutes previously):

Context switches -          avg    max

Default 7.4.1 code :       48784 107354
Default patch - 10 :       20400  28160
patch at 100       :       38574  85372
patch at 1000      :       41188 106569

The reading at 1000 was not produced under the same circumstances as the prior
readings as I had to replace my device under test with a simulated one.  The
real one died.

The previous run with smaller database and 120 second averages:

Context switches -          avg    max

Default 7.4.1 code :       10665  69470
Default patch - 10 :       17297  21929
patch at 100       :       26825  87073
patch at 1000      :       37580 110849

--
 20:13:50 up 3 days,  2:58,  4 users,  load average: 2.12, 2.14, 2.10
Linux 2.6.5-01 #7 SMP Fri Apr 16 22:45:31 MDT 2004

Attachment

Re: Wierd context-switching issue on Xeon

From
Dave Cramer
Date:
Robert,

The real question is does it help under real life circumstances ?

Did you do the tests with Tom's sql code that is designed to create high
context switchs ?

Dave
On Sun, 2004-05-02 at 11:20, Robert Creager wrote:
> Found some co-workers at work yesterday to load up my library...
>
> The sample period is 5 minutes long (vs 2 minutes previously):
>
> Context switches -          avg    max
>
> Default 7.4.1 code :       48784 107354
> Default patch - 10 :       20400  28160
> patch at 100       :       38574  85372
> patch at 1000      :       41188 106569
>
> The reading at 1000 was not produced under the same circumstances as the prior
> readings as I had to replace my device under test with a simulated one.  The
> real one died.
>
> The previous run with smaller database and 120 second averages:
>
> Context switches -          avg    max
>
> Default 7.4.1 code :       10665  69470
> Default patch - 10 :       17297  21929
> patch at 100       :       26825  87073
> patch at 1000      :       37580 110849
--
Dave Cramer
519 939 0336
ICQ # 14675561


Re: Wierd context-switching issue on Xeon

From
Robert Creager
Date:
When grilled further on (Sun, 02 May 2004 11:39:22 -0400),
Dave Cramer <pg@fastcrypt.com> confessed:

> Robert,
>
> The real question is does it help under real life circumstances ?

I'm not yet at the point where the CS's are causing appreciable delays.  I
should get there early this week and will be able to measure the relief your
patch may provide.

>
> Did you do the tests with Tom's sql code that is designed to create high
> context switchs ?

No, I'm using my queries/data.

Cheers,
Rob

--
 10:44:58 up 3 days, 17:30,  4 users,  load average: 2.00, 2.04, 2.01
Linux 2.6.5-01 #7 SMP Fri Apr 16 22:45:31 MDT 2004

Attachment

Re: Wierd context-switching issue on Xeon

From
Bruce Momjian
Date:
Did we ever come to a conclusion about excessive SMP context switching
under load?

---------------------------------------------------------------------------

Dave Cramer wrote:
> Robert,
>
> The real question is does it help under real life circumstances ?
>
> Did you do the tests with Tom's sql code that is designed to create high
> context switchs ?
>
> Dave
> On Sun, 2004-05-02 at 11:20, Robert Creager wrote:
> > Found some co-workers at work yesterday to load up my library...
> >
> > The sample period is 5 minutes long (vs 2 minutes previously):
> >
> > Context switches -          avg    max
> >
> > Default 7.4.1 code :       48784 107354
> > Default patch - 10 :       20400  28160
> > patch at 100       :       38574  85372
> > patch at 1000      :       41188 106569
> >
> > The reading at 1000 was not produced under the same circumstances as the prior
> > readings as I had to replace my device under test with a simulated one.  The
> > real one died.
> >
> > The previous run with smaller database and 120 second averages:
> >
> > Context switches -          avg    max
> >
> > Default 7.4.1 code :       10665  69470
> > Default patch - 10 :       17297  21929
> > patch at 100       :       26825  87073
> > patch at 1000      :       37580 110849
> --
> Dave Cramer
> 519 939 0336
> ICQ # 14675561
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: Wierd context-switching issue on Xeon

From
Robert Creager
Date:
When grilled further on (Wed, 19 May 2004 21:20:20 -0400 (EDT)),
Bruce Momjian <pgman@candle.pha.pa.us> confessed:

>
> Did we ever come to a conclusion about excessive SMP context switching
> under load?
>

I just figured out what was causing the problem on my system Monday.  I'm using
the pg_autovacuum daemon, and it was not vacuuming my db.  I've no idea why and
didn't get a chance to investigate.

This lack of vacuuming was causing a huge number of context switches and query
delays. the queries that normally take .1 seconds were taking 11 seconds, and
the context switches were averaging 160k/s, peaking at 190k/s

Unfortunately, I was under pressure to fix the db at the time so I didn't get a
chance to play with the patch.

I restarted the vacuum daemon, and will keep an eye on it to see if it behaves.

If the problem re-occurs, is it worth while to attempt the different patch
delay settings?

Cheers,
Rob

--
 19:45:40 up 21 days,  2:30,  4 users,  load average: 2.03, 2.09, 2.06
Linux 2.6.5-01 #7 SMP Fri Apr 16 22:45:31 MDT 2004

Attachment

Re: Wierd context-switching issue on Xeon

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Did we ever come to a conclusion about excessive SMP context switching
> under load?

Yeah: it's bad.

Oh, you wanted a fix?  That seems harder :-(.  AFAICS we need a redesign
that causes less load on the BufMgrLock.  However, the traditional
solution to too-much-contention-for-a-lock is to break up the locked
data structure into finer-grained units, which means *more* lock
operations in total.  Normally you expect that the finer-grained lock
units will mean less contention.  But given that the issue here seems to
be trading physical ownership of the lock's cache line back and forth,
I'm afraid that the traditional approach would actually make things
worse.  The SMP issue seems to be not with whether there is
instantaneous contention for the locked datastructure, but with the cost
of making it possible for processor B to acquire a lock recently held by
processor A.

            regards, tom lane

Re: Wierd context-switching issue on Xeon

From
Tom Lane
Date:
Robert Creager <Robert_Creager@LogicalChaos.org> writes:
> I just figured out what was causing the problem on my system Monday.
> I'm using the pg_autovacuum daemon, and it was not vacuuming my db.

Do you have the post-7.4.2 datatype fixes for pg_autovacuum?

            regards, tom lane

Re: Wierd context-switching issue on Xeon

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Did we ever come to a conclusion about excessive SMP context switching
> > under load?
>
> Yeah: it's bad.
>
> Oh, you wanted a fix?  That seems harder :-(.  AFAICS we need a redesign
> that causes less load on the BufMgrLock.  However, the traditional
> solution to too-much-contention-for-a-lock is to break up the locked
> data structure into finer-grained units, which means *more* lock
> operations in total.  Normally you expect that the finer-grained lock
> units will mean less contention.  But given that the issue here seems to
> be trading physical ownership of the lock's cache line back and forth,
> I'm afraid that the traditional approach would actually make things
> worse.  The SMP issue seems to be not with whether there is
> instantaneous contention for the locked datastructure, but with the cost
> of making it possible for processor B to acquire a lock recently held by
> processor A.

I see.  I don't even see a TODO in there.  :-(

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: Wierd context-switching issue on Xeon

From
Robert Creager
Date:
When grilled further on (Wed, 19 May 2004 22:42:26 -0400),
Tom Lane <tgl@sss.pgh.pa.us> confessed:

> Robert Creager <Robert_Creager@LogicalChaos.org> writes:
> > I just figured out what was causing the problem on my system Monday.
> > I'm using the pg_autovacuum daemon, and it was not vacuuming my db.
>
> Do you have the post-7.4.2 datatype fixes for pg_autovacuum?

No.  I'm still running 7.4.1 w/associated contrib.  I guess an upgrade is in
order then.  I'm currently downloading 7.4.2 to see what the change is that I
need.  Is it just the 7.4.2 pg_autovacuum that is needed here?

I've caught a whiff that 7.4.3 is nearing release?  Any idea when?

Thanks,
Rob

--
 20:45:52 up 21 days,  3:30,  4 users,  load average: 2.02, 2.05, 2.05
Linux 2.6.5-01 #7 SMP Fri Apr 16 22:45:31 MDT 2004

Attachment

Re: Wierd context-switching issue on Xeon

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Tom Lane wrote:
>> ...  The SMP issue seems to be not with whether there is
>> instantaneous contention for the locked datastructure, but with the cost
>> of making it possible for processor B to acquire a lock recently held by
>> processor A.

> I see.  I don't even see a TODO in there.  :-(

Nothing more specific than "investigate SMP context switching issues",
anyway.  We are definitely in a research mode here, rather than an
engineering mode.

ObQuote: "Research is what I am doing when I don't know what I am
doing." - attributed to Werner von Braun, but has anyone got a
definitive reference?

            regards, tom lane

Re: Wierd context-switching issue on Xeon

From
Tom Lane
Date:
Robert Creager <Robert_Creager@LogicalChaos.org> writes:
> Tom Lane <tgl@sss.pgh.pa.us> confessed:
>> Do you have the post-7.4.2 datatype fixes for pg_autovacuum?

> No.  I'm still running 7.4.1 w/associated contrib.  I guess an upgrade is in
> order then.  I'm currently downloading 7.4.2 to see what the change is that I
> need.  Is it just the 7.4.2 pg_autovacuum that is needed here?

Nope, the fixes I was thinking about just missed the 7.4.2 release.
I think you can only get them from CVS.  (Maybe we should offer a
nightly build of the latest stable release branch, not only development
tip...)

> I've caught a whiff that 7.4.3 is nearing release?  Any idea when?

Not scheduled yet, but there was talk of pushing one out before 7.5 goes
into feature freeze.

            regards, tom lane

Re: Wierd context-switching issue on Xeon

From
Bruce Momjian
Date:
OK, added to TODO:

    * Investigate SMP context switching issues


---------------------------------------------------------------------------

Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Tom Lane wrote:
> >> ...  The SMP issue seems to be not with whether there is
> >> instantaneous contention for the locked datastructure, but with the cost
> >> of making it possible for processor B to acquire a lock recently held by
> >> processor A.
>
> > I see.  I don't even see a TODO in there.  :-(
>
> Nothing more specific than "investigate SMP context switching issues",
> anyway.  We are definitely in a research mode here, rather than an
> engineering mode.
>
> ObQuote: "Research is what I am doing when I don't know what I am
> doing." - attributed to Werner von Braun, but has anyone got a
> definitive reference?
>
>             regards, tom lane
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: Wierd context-switching issue on Xeon

From
Bruce Momjian
Date:
Tom Lane wrote:
> Robert Creager <Robert_Creager@LogicalChaos.org> writes:
> > Tom Lane <tgl@sss.pgh.pa.us> confessed:
> >> Do you have the post-7.4.2 datatype fixes for pg_autovacuum?
>
> > No.  I'm still running 7.4.1 w/associated contrib.  I guess an upgrade is in
> > order then.  I'm currently downloading 7.4.2 to see what the change is that I
> > need.  Is it just the 7.4.2 pg_autovacuum that is needed here?
>
> Nope, the fixes I was thinking about just missed the 7.4.2 release.
> I think you can only get them from CVS.  (Maybe we should offer a
> nightly build of the latest stable release branch, not only development
> tip...)
>
> > I've caught a whiff that 7.4.3 is nearing release?  Any idea when?
>
> Not scheduled yet, but there was talk of pushing one out before 7.5 goes
> into feature freeze.

We need the temp table autovacuum fix before we do 7.4.3.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: Wierd context-switching issue on Xeon

From
"Matthew T. O'Connor"
Date:
On Wed, 2004-05-19 at 21:59, Robert Creager wrote:
> When grilled further on (Wed, 19 May 2004 21:20:20 -0400 (EDT)),
> Bruce Momjian <pgman@candle.pha.pa.us> confessed:
>
> >
> > Did we ever come to a conclusion about excessive SMP context switching
> > under load?
> >
>
> I just figured out what was causing the problem on my system Monday.  I'm using
> the pg_autovacuum daemon, and it was not vacuuming my db.  I've no idea why and
> didn't get a chance to investigate.

Strange.  There is a known bug in the 7.4.2 version of pg_autovacuum
related to data type mismatches which is fixed in CVS.  But that bug
doesn't cause pg_autovacuum to stop vacuuming but rather to vacuum to
often.  So perhaps this is a different issue?  Please let me know what
you find.

Thanks,

Matthew O'Connor



Re: Wierd context-switching issue on Xeon

From
Christopher Browne
Date:
In an attempt to throw the authorities off his trail, tgl@sss.pgh.pa.us (Tom Lane) transmitted:
> ObQuote: "Research is what I am doing when I don't know what I am
> doing." - attributed to Werner von Braun, but has anyone got a
> definitive reference?

<http://www.quotationspage.com/search.php3?Author=Wernher+von+Braun&file=other>

That points to a bunch of seemingly authoritative sources...
--
(reverse (concatenate 'string "moc.enworbbc" "@" "enworbbc"))
http://www.ntlug.org/~cbbrowne/lsf.html
"Terrrrrific." -- Ford Prefect

Re: Wierd context-switching issue on Xeon

From
Josh Berkus
Date:
Guys,

> Oh, you wanted a fix?  That seems harder :-(.  AFAICS we need a redesign
> that causes less load on the BufMgrLock.

FWIW, we've been pursuing two routes of quick patch fixes.

1) Dave Cramer and I have been testing setting varying rates of spin_delay in
an effort to find a "sweet spot" that the individual system seems to like.
This has been somewhat delayed by my illness.

2) The OSDL folks have been trying various patches to use Linux 2.6 Futexes in
place of semops (if I have that right) which, if successful, would produce a
linux-specific fix.   However, they haven't yet come up wiith a version of
the patch which is stable.

I'm really curious, BTW, about how all of Jan's changes to buffer usage in 7.5
affect this issue.   Has anyone tested it on a recent snapshot?

--
-Josh Berkus
 Aglio Database Solutions
 San Francisco


Re: Wierd context-switching issue on Xeon

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
> I'm really curious, BTW, about how all of Jan's changes to buffer
> usage in 7.5 affect this issue.  Has anyone tested it on a recent
> snapshot?

Won't help.

(1) Theoretical argument: the problem case is select-only and touches
few enough buffers that it need never visit the kernel.  The buffer
management algorithm is thus irrelevant since there are never any
decisions for it to make.  If anything CVS tip will have a worse problem
because its more complicated management algorithm needs to spend longer
holding the BufMgrLock.

(2) Experimental argument: I believe that I did check the self-contained
test case we eventually developed against CVS tip on one of Red Hat's
SMP machines, and indeed it was unhappy.

            regards, tom lane