Thread: Excessive context switching on SMP Xeons

Excessive context switching on SMP Xeons

From
Bill Montgomery
Date:
All,

I realize the excessive-context-switching-on-xeon issue has been
discussed at length in the past, but I wanted to follow up and verify my
conclusion from those discussions:

On a 2-way or 4-way Xeon box, there is no way to avoid excessive
(30,000-60,000 per second) context switches when using PostgreSQL 7.4.5
to query a data set small enough to fit into main memory under a
significant load.

I am experiencing said symptom on two different dual-Xeon boxes, both
Dells with ServerWorks chipsets, running the latest RH9 and RHEL3
kernels, respectively. The databases are 90% read, 10% write, and are
small enough to fit entirely into main memory, between pg shared buffers
and kernel buffers.

We recently invested in an solid-state storage device
(http://www.superssd.com/products/ramsan-320/) to help write
performance. Our entire pg data directory is stored on it. Regrettably
(and in retrospect, unsurprisingly) we found that opening up the I/O
bottleneck does little for write performance when the server is under
load, due to the bottleneck created by excessive context switching. Is
the only solution then to move to a different SMP architecture such as
Itanium 2 or Opteron? If so, should we expect to see an additional
benefit from running PostgreSQL on a 64-bit architecture, versus 32-bit,
context switching aside? Alternatively, are there good 32-bit SMP
architectures to consider other than Xeon, given the high cost of
Itanium 2 and Opteron systems?

More generally, how have others scaled "up" their PostgreSQL
environments? We will eventually have to invent some "outward"
scalability within the logic of our application (e.g. do read-only
transactions against a pool of Slony-I subscribers), but in the short
term we still have an urgent need to scale upward. Thoughts? General wisdom?

Best Regards,

Bill Montgomery

Re: Excessive context switching on SMP Xeons

From
Josh Berkus
Date:
Bill,

> I realize the excessive-context-switching-on-xeon issue has been
> discussed at length in the past, but I wanted to follow up and verify my
> conclusion from those discussions:

First off, the good news: Gavin Sherry  and OSDL may have made some progress
on this.   We'll be testing as soon as OSDL gets the Scalable Test Platform
running again.   If you have the CS problem (which I don't think you do, see
below) and a test box, I'd be thrilled to have you test it.

> On a 2-way or 4-way Xeon box, there is no way to avoid excessive
> (30,000-60,000 per second) context switches when using PostgreSQL 7.4.5
> to query a data set small enough to fit into main memory under a
> significant load.

Hmmm ... some clarification:
1) I don't really consider a CS of 30,000 to 60,000 on Xeon to be excessive.
People demonstrating the problem on dual or quad Xeon reported CS levels of
150,000 or more.    So you probably don't have this issue at all -- depending
on the load, your level could be considered "normal".

2) The problem is not limited to Xeon, Linux, or x86 architecture.    It has
been demonstrated, for example, on 8-way Solaris machines.    It's just worse
(and thus more noticable) on Xeon.

> I am experiencing said symptom on two different dual-Xeon boxes, both
> Dells with ServerWorks chipsets, running the latest RH9 and RHEL3
> kernels, respectively. The databases are 90% read, 10% write, and are
> small enough to fit entirely into main memory, between pg shared buffers
> and kernel buffers.

Ah.  Well, you do have the worst possible architecture for PostgreSQL-SMP
performance.   The ServerWorks chipset is badly flawed (the company is now, I
believe, bankrupt from recalled products) and Xeons have several performance
issues on databases based on online tests.

> We recently invested in an solid-state storage device
> (http://www.superssd.com/products/ramsan-320/) to help write
> performance. Our entire pg data directory is stored on it. Regrettably
> (and in retrospect, unsurprisingly) we found that opening up the I/O
> bottleneck does little for write performance when the server is under
> load, due to the bottleneck created by excessive context switching.

Well, if you're CPU-bound, improved I/O won't help you, no.

> Is
> the only solution then to move to a different SMP architecture such as
> Itanium 2 or Opteron? If so, should we expect to see an additional
> benefit from running PostgreSQL on a 64-bit architecture, versus 32-bit,
> context switching aside?

Your performance will almost certainly be better for a variety of reasons on
Opteron/Itanium.    However, I'm still not convinced that you have the CS
bug.

> Alternatively, are there good 32-bit SMP
> architectures to consider other than Xeon, given the high cost of
> Itanium 2 and Opteron systems?

AthalonMP appears to be less suseptible to the CS bug than Xeon, and the
effect of the bug is not as severe.   However, a quad-Opteron box can be
built for less than $6000; what's your standard for "expensive"?   If you
don't have that much money, then you may be stuck for options.

> More generally, how have others scaled "up" their PostgreSQL
> environments? We will eventually have to invent some "outward"
> scalability within the logic of our application (e.g. do read-only
> transactions against a pool of Slony-I subscribers), but in the short
> term we still have an urgent need to scale upward. Thoughts? General
> wisdom?

As long as you're on x86, scaling outward is the way to go.   If you want to
continue to scale upwards, ask Andrew Sullivan about his experiences running
PostgreSQL on big IBM boxes.   But if you consider an quad-Opteron server
expensive, I don't think that's an option for you.

Overall, though, I'm not convinced that you have the CS bug and I think it's
more likely that you have a few "bad queries" which are dragging down the
whole system.    Troubleshoot those and your CPU-bound problems may go away.

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: Excessive context switching on SMP Xeons

From
Bill Montgomery
Date:
Thanks for the helpful response.

Josh Berkus wrote:

> First off, the good news: Gavin Sherry and OSDL may have made some
> progress
>
>on this.   We'll be testing as soon as OSDL gets the Scalable Test Platform
>running again.   If you have the CS problem (which I don't think you do, see
>below) and a test box, I'd be thrilled to have you test it.
>

I'd be thrilled to test it too, if for no other reason that to determine
whether what I'm experiencing really is the "CS problem".

>1) I don't really consider a CS of 30,000 to 60,000 on Xeon to be excessive.
>People demonstrating the problem on dual or quad Xeon reported CS levels of
>150,000 or more.    So you probably don't have this issue at all -- depending
>on the load, your level could be considered "normal".
>

Fair enough. I never see nearly this much context switching on my dual
Xeon boxes running dozens (sometimes hundreds) of concurrent apache
processes, but I'll concede this could just be due to the more parallel
nature of a bunch of independent apache workers.

>>I am experiencing said symptom on two different dual-Xeon boxes, both
>>Dells with ServerWorks chipsets, running the latest RH9 and RHEL3
>>kernels, respectively. The databases are 90% read, 10% write, and are
>>small enough to fit entirely into main memory, between pg shared buffers
>>and kernel buffers.
>>
>
>Ah.  Well, you do have the worst possible architecture for PostgreSQL-SMP
>performance.   The ServerWorks chipset is badly flawed (the company is now, I
>believe, bankrupt from recalled products) and Xeons have several performance
>issues on databases based on online tests.
>

Hence my desire for recommendations on alternate architectures ;-)

>AthalonMP appears to be less suseptible to the CS bug than Xeon, and the
>effect of the bug is not as severe.   However, a quad-Opteron box can be
>built for less than $6000; what's your standard for "expensive"?   If you
>don't have that much money, then you may be stuck for options.
>

Being a 24x7x365 shop, and these servers being mission critical, I
require vendors that can offer 24x7 4-hour part replacement, like Dell
or IBM. I haven't seen 4-way 64-bit boxes meeting that requirement for
less than $20,000, and that's for a very minimally configured box. A
suitably configured pair will likely end up costing $50,000 or more. I
would like to avoid an unexpected expense of that size, unless there's
no other good alternative. That said, I'm all ears for a cheaper
alternative that meets my support and performance requirements.

>Overall, though, I'm not convinced that you have the CS bug and I think it's
>more likely that you have a few "bad queries" which are dragging down the
>whole system.    Troubleshoot those and your CPU-bound problems may go away.
>

You may be right, but to compare apples to apples, here's some vmstat
output from a pgbench run:

[billm@xxx billm]$ pgbench -i -s 20 pgbench
<snip>
[billm@xxx billm]$ pgbench -s 20 -t 500 -c 100 pgbench
starting vacuum...end.
transaction type: TPC-B (sort of)
scaling factor: 20
number of clients: 100
number of transactions per client: 500
number of transactions actually processed: 50000/50000
tps = 369.717832 (including connections establishing)
tps = 370.852058 (excluding connections establishing)

and some of the vmstat output...

[billm@poe billm]$ vmstat 1
procs                      memory      swap          io
system         cpu
 r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us sy
wa id
 0  1      0 863108 220620 1571924    0    0     4    64   34    50  1
0  0 98
 0  1      0 863092 220620 1571932    0    0     0  3144  171  2037  3
3 47 47
 0  1      0 863084 220620 1571956    0    0     0  5840  202  3702  6
3 46 45
 1  1      0 862656 220620 1572420    0    0     0 12948  631 42093 69
22  5  5
11  0      0 862188 220620 1572828    0    0     0 12644  531 41330 70
23  2  5
 9  0      0 862020 220620 1573076    0    0     0  8396  457 28445 43
17 17 22
 9  0      0 861620 220620 1573556    0    0     0 13564  726 44330 72
22  2  5
 8  1      0 861248 220620 1573980    0    0     0 12564  660 43667 65
26  2  7
 3  1      0 860704 220624 1574236    0    0     0 14588  646 41176 62
25  5  8
 0  1      0 860440 220624 1574476    0    0     0 42184  865 31704 44
23 15 18
 8  0      0 860320 220624 1574628    0    0     0 10796  403 19971 31
10 29 29
 0  1      0 860040 220624 1574884    0    0     0 23588  654 36442 49
20 13 17
 0  1      0 859984 220624 1574932    0    0     0  4940  229  3884  5
3 45 46
 0  1      0 859940 220624 1575004    0    0     0 12140  355 13454 20
10 35 35
 0  1      0 859904 220624 1575044    0    0     0  5044  218  6922 11
5 41 43
 1  1      0 859868 220624 1575052    0    0     0  4808  199  2029  3
3 47 48
 0  1      0 859720 220624 1575180    0    0     0 21596  485 18075 28
13 29 30
11  1      0 859372 220624 1575532    0    0     0 24520  609 41409 62
33  2  3

While pgbench does not generate quite as high a number of CS as our app,
it is an apples-to-apples comparison, and rules out the possibility of
poorly written queries in our app. Still, 40k CS/sec seems high to me.
While pgbench is just a synthetic benchmark, and not necessarily the
best benchmark, yada yada, 370 tps seems like pretty poor performance.
I've benchmarked the IO subsystem at 70MB/s of random 8k writes, yet
pgbench typically doesn't use more than 10MB/s of that bandwidth (a
little more at checkpoints).

So I guess the question is this: now that I've opened up the IO
bottleneck that exists on most database servers, am I really truly CPU
bound now, and not just suffering from poorly handled spinlocks on my
Xeon/ServerWorks platform? If so, is the expense of a 64-bit system
worth it, or is the price/performance for PostgreSQL still better on an
alternative 32-bit platform, like AthlonMP?

Best Regards,

Bill Montgomery

Re: Excessive context switching on SMP Xeons

From
Gaetano Mendola
Date:
Bill Montgomery wrote:
> All,
>
> I realize the excessive-context-switching-on-xeon issue has been
> discussed at length in the past, but I wanted to follow up and verify my
> conclusion from those discussions:
>
> On a 2-way or 4-way Xeon box, there is no way to avoid excessive
> (30,000-60,000 per second) context switches when using PostgreSQL 7.4.5
> to query a data set small enough to fit into main memory under a
> significant load.
>
> I am experiencing said symptom on two different dual-Xeon boxes, both
> Dells with ServerWorks chipsets, running the latest RH9 and RHEL3
> kernels, respectively. The databases are 90% read, 10% write, and are
> small enough to fit entirely into main memory, between pg shared buffers
> and kernel buffers.
>

I don't know if my box is not loaded enough but I have a dual-Xeon box,
by DELL with the HT enabled and I'm not experiencing this kind of CS
problem, normaly hour CS is around 100000 per second.

# cat /proc/version
Linux version 2.4.9-e.24smp (bhcompile@porky.devel.redhat.com) (gcc version 2.96 20000731 (Red Hat Linux 7.2
2.96-118.7.2))#1 SMP Tue May 27 16:07:39 EDT 2003 


# cat /proc/cpuinfo
processor       : 0
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Xeon(TM) CPU 2.80GHz
stepping        : 7
cpu MHz         : 2787.139
cache size      : 512 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse
sse2ss ht tm 
bogomips        : 5557.45

processor       : 1
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Xeon(TM) CPU 2.80GHz
stepping        : 7
cpu MHz         : 2787.139
cache size      : 512 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse
sse2ss ht tm 
bogomips        : 5570.56

processor       : 2
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Xeon(TM) CPU 2.80GHz
stepping        : 7
cpu MHz         : 2787.139
cache size      : 512 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse
sse2ss ht tm 
bogomips        : 5570.56

processor       : 3
vendor_id       : GenuineIntel
cpu family      : 15
model           : 2
model name      : Intel(R) Xeon(TM) CPU 2.80GHz
stepping        : 7
cpu MHz         : 2787.139
cache size      : 512 KB
fdiv_bug        : no
hlt_bug         : no
f00f_bug        : no
coma_bug        : no
fpu             : yes
fpu_exception   : yes
cpuid level     : 2
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse
sse2ss ht tm 
bogomips        : 5570.56





Regards
Gaetano Mendola








Re: Excessive context switching on SMP Xeons

From
Josh Berkus
Date:
Bill,

> I'd be thrilled to test it too, if for no other reason that to determine
> whether what I'm experiencing really is the "CS problem".

Hmmm ... Gavin's patch is built against 8.0, and any version of the patch
would require linux 2.6, probably 2.6.7 minimum.   Can you test on that linux
version?   Do you have the resources to back-port Gavin's patch?

> Fair enough. I never see nearly this much context switching on my dual
> Xeon boxes running dozens (sometimes hundreds) of concurrent apache
> processes, but I'll concede this could just be due to the more parallel
> nature of a bunch of independent apache workers.

Certainly could be.  Heavy CSes only happen when you have a number of
long-running processes with contention for RAM in my experience.  If Apache
is dispatching thing quickly enough, they'd never arise.

> Hence my desire for recommendations on alternate architectures ;-)

Well, you could certainly stay on Xeon if there's better support availability.
Just get off Dell *650's.

> Being a 24x7x365 shop, and these servers being mission critical, I
> require vendors that can offer 24x7 4-hour part replacement, like Dell
> or IBM. I haven't seen 4-way 64-bit boxes meeting that requirement for
> less than $20,000, and that's for a very minimally configured box. A
> suitably configured pair will likely end up costing $50,000 or more. I
> would like to avoid an unexpected expense of that size, unless there's
> no other good alternative. That said, I'm all ears for a cheaper
> alternative that meets my support and performance requirements.

No, you're going to pay through the nose for that support level.   It's how
things work.

> tps = 369.717832 (including connections establishing)
> tps = 370.852058 (excluding connections establishing)

Doesn't seem too bad to me.   Have anything to compare it to?

What's in your postgresql.conf?

--Josh

Re: Excessive context switching on SMP Xeons

From
Alan Stange
Date:
A few quick random observations on the Xeon v. Opteron comparison:

-  running a dual Xeon with hyperthreading turned on really isn't the
same as having a quad cpu system.   I haven't seen postgresql specific
benchmarks, but the general case has been that HT is a benefit in a few
particular work loads but with no benefit in general.

- We're running postgresql 8 (in production!) on a dual Opteron 250,
Linux 2.6, 8GB memory, 1.7TB of attached fiber channel disk, etc.   This
machine is fast.    A dual 2.8 Ghz Xeon with 512K caches (with or
without HT enabled) simlpy won't be in the same performance league as
this dual Opteron system (assuming identical disk systems, etc).  We run
a Linux 2.6 kernel because it scales under load so much better than the
2.4 kernels.

The units we're using (and we have a lot of them) are SunFire v20z.  You
can get a dualie Opteron 250 for $7K with 4GB memory from Sun.  My
personal experience with this setup in a mission critical config is to
not depend on 4 hour spare parts, but to spend the money and install the
spare in the rack.   Naturally, one can go cheaper with slower cpus,
different vendors, etc.


I don't care to go into the whole debate of Xeon v. Opteron here.   We
also have a lot of dual Xeon systems. In every comparison I've done with
our codes, the dual Opteron clearly outperforms the dual Xeon, when
running on one and both cpus.


-- Alan



Josh Berkus wrote:

>Bill,
>
>
>
>>I'd be thrilled to test it too, if for no other reason that to determine
>>whether what I'm experiencing really is the "CS problem".
>>
>>
>
>Hmmm ... Gavin's patch is built against 8.0, and any version of the patch
>would require linux 2.6, probably 2.6.7 minimum.   Can you test on that linux
>version?   Do you have the resources to back-port Gavin's patch?
>
>
>
>>Fair enough. I never see nearly this much context switching on my dual
>>Xeon boxes running dozens (sometimes hundreds) of concurrent apache
>>processes, but I'll concede this could just be due to the more parallel
>>nature of a bunch of independent apache workers.
>>
>>
>
>Certainly could be.  Heavy CSes only happen when you have a number of
>long-running processes with contention for RAM in my experience.  If Apache
>is dispatching thing quickly enough, they'd never arise.
>
>
>
>>Hence my desire for recommendations on alternate architectures ;-)
>>
>>
>
>Well, you could certainly stay on Xeon if there's better support availability.
>Just get off Dell *650's.
>
>
>
>>Being a 24x7x365 shop, and these servers being mission critical, I
>>require vendors that can offer 24x7 4-hour part replacement, like Dell
>>or IBM. I haven't seen 4-way 64-bit boxes meeting that requirement for
>>less than $20,000, and that's for a very minimally configured box. A
>>suitably configured pair will likely end up costing $50,000 or more. I
>>would like to avoid an unexpected expense of that size, unless there's
>>no other good alternative. That said, I'm all ears for a cheaper
>>alternative that meets my support and performance requirements.
>>
>>
>
>No, you're going to pay through the nose for that support level.   It's how
>things work.
>
>
>
>>tps = 369.717832 (including connections establishing)
>>tps = 370.852058 (excluding connections establishing)
>>
>>
>
>Doesn't seem too bad to me.   Have anything to compare it to?
>
>What's in your postgresql.conf?
>
>--Josh
>
>---------------------------(end of broadcast)---------------------------
>TIP 5: Have you checked our extensive FAQ?
>
>               http://www.postgresql.org/docs/faqs/FAQ.html
>
>


Re: Excessive context switching on SMP Xeons

From
Greg Stark
Date:
Alan Stange <stange@rentec.com> writes:

> A few quick random observations on the Xeon v. Opteron comparison:
>
> - running a dual Xeon with hyperthreading turned on really isn't the same as
> having a quad cpu system. I haven't seen postgresql specific benchmarks, but
> the general case has been that HT is a benefit in a few particular work
> loads but with no benefit in general.

Part of the FUD with hyperthreading did have a kernel of truth that lied in
older kernels' schedulers. For example with Linux until recently the kernel
can easily end up scheduling two processes on the two virtual processors of
one single physical processor, leaving the other physical processor totally
idle.

With modern kernels' schedulers I would expect hyperthreading to live up to
its billing of adding 10% to 20% performance. Ie., a dual Xeon machine with
hyperthreading won't be as fast as four processors, but it should be 10-20%
faster than a dual Xeon without hyperthreading.

As with all things that will only help if you're bound by the right limited
resource to begin with. If you're I/O bound it isn't going to help. I would
expect Postgres with its heavy demand on memory bandwidth and shared memory
could potentially benefit more than usual from being able to context switch
during pipeline stalls.

--
greg

Re: Excessive context switching on SMP Xeons

From
Alan Stange
Date:
Greg Stark wrote:

>Alan Stange <stange@rentec.com> writes:
>
>
>>A few quick random observations on the Xeon v. Opteron comparison:
>>
>>- running a dual Xeon with hyperthreading turned on really isn't the same as
>>having a quad cpu system. I haven't seen postgresql specific benchmarks, but
>>the general case has been that HT is a benefit in a few particular work
>>loads but with no benefit in general.
>>
>>
>Part of the FUD with hyperthreading did have a kernel of truth that lied in
>older kernels' schedulers. For example with Linux until recently the kernel
>can easily end up scheduling two processes on the two virtual processors of
>one single physical processor, leaving the other physical processor totally
>idle.
>
>With modern kernels' schedulers I would expect hyperthreading to live up to
>its billing of adding 10% to 20% performance. Ie., a dual Xeon machine with
>hyperthreading won't be as fast as four processors, but it should be 10-20%
>faster than a dual Xeon without hyperthreading.
>
>As with all things that will only help if you're bound by the right limited
>resource to begin with. If you're I/O bound it isn't going to help. I would
>expect Postgres with its heavy demand on memory bandwidth and shared memory
>could potentially benefit more than usual from being able to context switch
>during pipeline stalls.
>
>
All true.   I'd be surprised if HT on an older 2.8 Ghz Xeon with only a
512K cache will see any real benefit.   The dual Xeon is already memory
starved, now further increase the memory pressure on the caches (because
the 512K is now "shared" by two virtual processors) and you probably
won't see a gain.  It's memory stalls all around.  To be clear, the
context switch in this case isn't a kernel context switch but a "virtual
cpu" context switch.

The probable reason we see dual Opteron boxes way outperforming dual
Xeons boxes is exactly because of Postgresql's heavy demand on memory.
The Opteron's have a much better memory system.

A quick search on google or digging around in the comp.arch archives
will provide lots of details.    HP's web site has (had?) some
benchmarks comparing these systems.  HP sells both Xeon and Opteron
systems, so the comparison were quite "fair".  Their numbers showed the
Opteron handily outperfoming the Xeons.

-- Alan

Re: Excessive context switching on SMP Xeons

From
Bill Montgomery
Date:
Josh Berkus wrote:

>>I'd be thrilled to test it too, if for no other reason that to determine
>>whether what I'm experiencing really is the "CS problem".
>>
>>
>
>Hmmm ... Gavin's patch is built against 8.0, and any version of the patch
>would require linux 2.6, probably 2.6.7 minimum.   Can you test on that linux
>version?   Do you have the resources to back-port Gavin's patch?
>
>

I don't currently have any SMP Xeon systems running a 2.6 kernel, but it
could be arranged. As for back-porting the patch to 7.4.5, probably so,
but I'd have to see it first.

>>tps = 369.717832 (including connections establishing)
>>tps = 370.852058 (excluding connections establishing)
>>
>>
>
>Doesn't seem too bad to me.   Have anything to compare it to?
>
>

Yes, about 280 tps on the same machine with the data directory on a
3-disk RAID 5 w/ a 128MB cache, rather than the SSD. I was expecting a
much larger increase, given that the RAID does about 3MB/s of random 8k
writes, and the SSD device does about 70MB/s of random 8k writes. Said
differently, I thought my CPU bottleneck would be much higher, as to
allow for more than a 30% increase in pgbench TPS when I took the IO
bottleneck out of the equation. (That said, I'm not tuning for pgbench,
but it is a useful comparison that everyone on the list is familiar
with, and takes out the possibility that my app just has a bunch of
poorly written queries).

>What's in your postgresql.conf?
>
>

Some relevant parameters:
shared_buffers = 16384
sort_mem = 2048
vacuum_mem = 16384
max_fsm_pages = 200000
max_fsm_relations = 10000
fsync = true
wal_sync_method = fsync
wal_buffers = 32
checkpoint_segments = 6
effective_cache_size = 262144
random_page_cost = 0.25

Everything else is left at the default (or not relevant to this post).
Anything blatantly stupid in there for my setup?

Thanks,

Bill Montgomery

Re: Excessive context switching on SMP Xeons

From
SZUCS Gábor
Date:
Hmmm...

I may be mistaken (I think last time I read about optimization params was in
7.3 docs), but doesn't RPC < 1 mean that random read is faster than
sequential read? In your case, do you really think reading randomly is 4x
faster than reading sequentially? Doesn't seem to make sense, even with a
zillion-disk array. Theoretically.

Also not sure, but sort_mem and vacuum_mem seem to be too small to me.

G.
%----------------------- cut here -----------------------%
\end

----- Original Message -----
From: "Bill Montgomery" <billm@lulu.com>
Sent: Wednesday, October 06, 2004 5:45 PM


> Some relevant parameters:
> shared_buffers = 16384
> sort_mem = 2048
> vacuum_mem = 16384
> max_fsm_pages = 200000
> max_fsm_relations = 10000
> fsync = true
> wal_sync_method = fsync
> wal_buffers = 32
> checkpoint_segments = 6
> effective_cache_size = 262144
> random_page_cost = 0.25


Re: Excessive context switching on SMP Xeons

From
Gaetano Mendola
Date:
Alan Stange wrote:
> A few quick random observations on the Xeon v. Opteron comparison:

[SNIP]

> I don't care to go into the whole debate of Xeon v. Opteron here.   We
> also have a lot of dual Xeon systems. In every comparison I've done with
> our codes, the dual Opteron clearly outperforms the dual Xeon, when
> running on one and both cpus.

Here http://www6.tomshardware.com/cpu/20030422/  both were tested and there is
a database performance section, unfortunatelly they used MySQL.


Regards
Gaetano Mendola




Re: Excessive context switching on SMP Xeons

From
Alan Stange
Date:
Here's a few numbers from the Opteron 250.  If I get some time I'll post
a more comprehensive comparison including some other systems.

The system is a Sun v20z.  Dual Opteron 250, 2.4Ghz, Linux 2.6, 8 GB
memory.   I did a compile and install of pg 8.0 beta 3.  I created a
data base on a tmpfs file system and ran pgbench.  Everything was "out
of the box", meaning I did not tweak any config files.

I used this for pgbench:
$ pgbench -i -s 32

and this for pgbench invocations:
$ pgbench -s 32 -c 1 -t 10000 -v


clients      tps
1            1290
2            1780
4            1760
8            1680
16           1376
32            904


How are these results useful?  In some sense, this is a speed of light
number for the Opteron 250.   You'll never go faster on this system with
a real storage subsystem involved instead of a tmpfs file system.   It's
also a set of numbers that anyone else can reproduce as we don't have to
deal with any differences in file systems, disk subsystems, networking,
etc.   Finally, it's a set of results that anyone else can compute on
Xeon's or other systems and make a simple (and naive) comparisons.


Just to stay on topic:   vmstat reported about 30K cs / second while
this was running the 1 and 2 client cases.

-- Alan


Re: Excessive context switching on SMP Xeons

From
Bill Montgomery
Date:
Alan Stange wrote:

> Here's a few numbers from the Opteron 250.  If I get some time I'll
> post a more comprehensive comparison including some other systems.
>
> The system is a Sun v20z.  Dual Opteron 250, 2.4Ghz, Linux 2.6, 8 GB
> memory.   I did a compile and install of pg 8.0 beta 3.  I created a
> data base on a tmpfs file system and ran pgbench.  Everything was "out
> of the box", meaning I did not tweak any config files.
>
> I used this for pgbench:
> $ pgbench -i -s 32
>
> and this for pgbench invocations:
> $ pgbench -s 32 -c 1 -t 10000 -v
>
>
> clients      tps      1            1290          2
> 1780       4            1760        8            1680
> 16           1376       32            904


The same test on a Dell PowerEdge 1750, Dual Xeon 3.2 GHz, 512k cache,
HT on, Linux 2.4.21-20.ELsmp (RHEL 3), 4GB memory, pg 7.4.5:

$ pgbench -i -s 32 pgbench
$ pgbench -s 32 -c 1 -t 10000 -v

clients   tps   avg CS/sec
-------  -----  ----------
      1    601      48,000
      2    889      77,000
      4   1006      80,000
      8    985      59,000
     16    966      47,000
     32    913      46,000

Far less performance that the Dual Opterons with a low number of
clients, but the gap narrows as the number of clients goes up. Anyone
smarter than me care to explain?

Anyone have a 4-way Opteron to run the same benchmark on?

-Bill

> How are these results useful?  In some sense, this is a speed of light
> number for the Opteron 250.   You'll never go faster on this system
> with a real storage subsystem involved instead of a tmpfs file
> system.   It's also a set of numbers that anyone else can reproduce as
> we don't have to deal with any differences in file systems, disk
> subsystems, networking, etc.   Finally, it's a set of results that
> anyone else can compute on Xeon's or other systems and make a simple
> (and naive) comparisons.
>
>
> Just to stay on topic:   vmstat reported about 30K cs / second while
> this was running the 1 and 2 client cases.
>
> -- Alan



Re: Excessive context switching on SMP Xeons

From
Michael Adler
Date:
On Thu, Oct 07, 2004 at 11:48:41AM -0400, Bill Montgomery wrote:
> Alan Stange wrote:
>
> The same test on a Dell PowerEdge 1750, Dual Xeon 3.2 GHz, 512k cache,
> HT on, Linux 2.4.21-20.ELsmp (RHEL 3), 4GB memory, pg 7.4.5:
>
> Far less performance that the Dual Opterons with a low number of
> clients, but the gap narrows as the number of clients goes up. Anyone
> smarter than me care to explain?

You'll have to wait for someone smarter than you, but I will posit
this: Did you use a tmpfs filesystem like Alan? You didn't mention
either way. Alan did that as an attempt remove IO as a variable.

-Mike

Re: Excessive context switching on SMP Xeons

From
Alan Stange
Date:
Bill Montgomery wrote:

> Alan Stange wrote:
>
>> Here's a few numbers from the Opteron 250.  If I get some time I'll
>> post a more comprehensive comparison including some other systems.
>>
>> The system is a Sun v20z.  Dual Opteron 250, 2.4Ghz, Linux 2.6, 8 GB
>> memory.   I did a compile and install of pg 8.0 beta 3.  I created a
>> data base on a tmpfs file system and ran pgbench.  Everything was
>> "out of the box", meaning I did not tweak any config files.
>>
>> I used this for pgbench:
>> $ pgbench -i -s 32
>>
>> and this for pgbench invocations:
>> $ pgbench -s 32 -c 1 -t 10000 -v
>>
>>
>> clients      tps      1            1290          2
>> 1780       4            1760        8            1680
>> 16           1376       32            904
>
>
>
> The same test on a Dell PowerEdge 1750, Dual Xeon 3.2 GHz, 512k cache,
> HT on, Linux 2.4.21-20.ELsmp (RHEL 3), 4GB memory, pg 7.4.5:
>
> $ pgbench -i -s 32 pgbench
> $ pgbench -s 32 -c 1 -t 10000 -v
>
> clients   tps   avg CS/sec
> -------  -----  ----------
>      1    601      48,000
>      2    889      77,000
>      4   1006      80,000
>      8    985      59,000
>     16    966      47,000
>     32    913      46,000
>
> Far less performance that the Dual Opterons with a low number of
> clients, but the gap narrows as the number of clients goes up. Anyone
> smarter than me care to explain?

boy, did Thunderbird ever botch the format of the table I entered...

I thought the falloff at 32 clients was a bit steep as well.   One
thought that crossed my mind is that "pgbench -s 32 -c 32 ..." might not
be valid.   From the pgbench README:

        -s scaling_factor
                this should be used with -i (initialize) option.
                number of tuples generated will be multiple of the
                scaling factor. For example, -s 100 will imply 10M
                (10,000,000) tuples in the accounts table.
                default is 1.  NOTE: scaling factor should be at least
                as large as the largest number of clients you intend
                to test; else you'll mostly be measuring update contention.

Another possible cause is the that pgbench process is cpu starved and
isn't able to keep driving the postgresql processes.   So I ran pgbench
from another system with all else the same.    The numbers were a bit
smaller but otherwise similar.


I then reran everything using -s 64:

clients   tps
1         1254
2         1645
4         1713
8         1548
16        1396
32        1060

Still starting to head down a bit.  In the 32 client case, the system
was ~60% user time, ~25% sytem and ~15% idle. Anyway, the machine is
clearly hitting some contention somewhere.   It could be in the tmpfs
code, VM system, etc.

-- Alan





Re: Excessive context switching on SMP Xeons

From
Bill Montgomery
Date:
Michael Adler wrote:

>On Thu, Oct 07, 2004 at 11:48:41AM -0400, Bill Montgomery wrote:
>
>
>>Alan Stange wrote:
>>
>>The same test on a Dell PowerEdge 1750, Dual Xeon 3.2 GHz, 512k cache,
>>HT on, Linux 2.4.21-20.ELsmp (RHEL 3), 4GB memory, pg 7.4.5:
>>
>>Far less performance that the Dual Opterons with a low number of
>>clients, but the gap narrows as the number of clients goes up. Anyone
>>smarter than me care to explain?
>>
>>
>
>You'll have to wait for someone smarter than you, but I will posit
>this: Did you use a tmpfs filesystem like Alan? You didn't mention
>either way. Alan did that as an attempt remove IO as a variable.
>
>-Mike
>
>

Yes, I should have been more explicit. My goal was to replicate his
experiment as closely as possible in my environment, so I did run my
postgres data directory on a tmpfs.

-Bill Montgomery

IBM P-series machines (was: Excessive context switching on SMP Xeons)

From
Andrew Sullivan
Date:
On Tue, Oct 05, 2004 at 09:47:36AM -0700, Josh Berkus wrote:
> As long as you're on x86, scaling outward is the way to go.   If you want to
> continue to scale upwards, ask Andrew Sullivan about his experiences running
> PostgreSQL on big IBM boxes.   But if you consider an quad-Opteron server
> expensive, I don't think that's an option for you.

Well, they're not that big, and both Chris Browne and Andrew Hammond
are at least as qualified to talk about this as I.  But since Josh
mentioned it, I'll put some anecdotal rablings here just in case
anyone is interested.

We used to run our systems on Solaris 7, then 8, on Sun E4500s.  We
found the performance on those boxes surprisingly bad under certain
pathological loads.  I ultimately traced this to some remarkably poor
shared memory handling by Solaris: during relatively heavy load
(in particular, thousands of selects per second on the same set of
tuples) we'd see an incredible number of semaphore operations, and
concluded that the buffer handling was killing us.

I think we could have tuned this away, but for independent reasons we
decided to dump Sun gear (the hardware had become unreliable, and we
were not satisfied with the service we were getting).  We ended up
choosing IBM P650s as a replacement.

The 650s are not cheap, but boy are they fast.  I don't have any
numbers I can share, but I can tell you that we recently had a few
days in which our write load was as large as the entire write load
for last year, and you couldn't tell.  It is too early for us to say
whether the P series lives up to its billing in terms of relibility:
the real reason we use these machines is reliability, so if
approaching 100% uptime isn't important to you, the speed may not be
worth it.

We're also, for the record, doing experiments with Opterons.  So far,
we're impressed, and you can buy a lot of Opteron for the cost of one
P650.

A

--
Andrew Sullivan  | ajs@crankycanuck.ca
I remember when computers were frustrating because they *did* exactly what
you told them to.  That actually seems sort of quaint now.
        --J.D. Baldwin

Re: IBM P-series machines (was: Excessive context

From
Rod Taylor
Date:
On Mon, 2004-10-11 at 13:38, Andrew Sullivan wrote:
> On Tue, Oct 05, 2004 at 09:47:36AM -0700, Josh Berkus wrote:
> > As long as you're on x86, scaling outward is the way to go.   If you want to
> > continue to scale upwards, ask Andrew Sullivan about his experiences running
> > PostgreSQL on big IBM boxes.   But if you consider an quad-Opteron server
> > expensive, I don't think that's an option for you.


> The 650s are not cheap, but boy are they fast.  I don't have any
> numbers I can share, but I can tell you that we recently had a few
> days in which our write load was as large as the entire write load
> for last year, and you couldn't tell.  It is too early for us to say
> whether the P series lives up to its billing in terms of relibility:
> the real reason we use these machines is reliability, so if
> approaching 100% uptime isn't important to you, the speed may not be
> worth it.

Agreed completely, and the 570 knocks the 650 out of the water -- nearly
double the performance for math heavy queries. Beware vendor support for
Linux on these things though -- we ran into many of the same issues with
vendor support on the IBM machines as we did with the Opterons.


Re: IBM P-series machines (was: Excessive context

From
Christopher Browne
Date:
pg@rbt.ca (Rod Taylor) wrote:
> On Mon, 2004-10-11 at 13:38, Andrew Sullivan wrote:
>> On Tue, Oct 05, 2004 at 09:47:36AM -0700, Josh Berkus wrote:
>> > As long as you're on x86, scaling outward is the way to go.  If
>> > you want to continue to scale upwards, ask Andrew Sullivan about
>> > his experiences running PostgreSQL on big IBM boxes.  But if you
>> > consider an quad-Opteron server expensive, I don't think that's
>> > an option for you.
>
>> The 650s are not cheap, but boy are they fast.  I don't have any
>> numbers I can share, but I can tell you that we recently had a few
>> days in which our write load was as large as the entire write load
>> for last year, and you couldn't tell.  It is too early for us to
>> say whether the P series lives up to its billing in terms of
>> relibility: the real reason we use these machines is reliability,
>> so if approaching 100% uptime isn't important to you, the speed may
>> not be worth it.
>
> Agreed completely, and the 570 knocks the 650 out of the water --
> nearly double the performance for math heavy queries. Beware vendor
> support for Linux on these things though -- we ran into many of the
> same issues with vendor support on the IBM machines as we did with
> the Opterons.

The 650s are running AIX, not Linux.

Based on the "Signal 11" issue, I'm not certain what would be the
'best' answer.  It appears that the problem relates to proprietary
bits of AIX libc.  In theory, that would have been more easily
resolvable with a source-available GLIBC.

On the other hand, I'm not sure what happens to support for any of the
interesting hardware extensions.  I'm not sure, for instance, that we
could run HACMP on Linux on this hardware.

As for "vendor support" for Opteron, that sure looks like a
trainwreck...  If you're going through IBM, then they won't want to
respond to any issues if you're not running a "bog-standard" RHAS/RHES
release from Red Hat.  And that, on Opteron, is preposterous, because
there's plenty of the bits of Opteron support that only ever got put
in Linux 2.6, whilst RHAT is still back in the 2.4 days.

In a way, that's just as well, at this point.  There's plenty of stuff
surrounding this that is still pretty experimental; the longer RHAT
waits to support 2.6, the greater the likelihood that Linux support
for Opteron will have settled down to the point that the result will
actually be supportable by RHAT, and by proxy, by IBM and others.

There is some risk that if RHAT waits _too_ long for 2.6, people will
have already jumped ship to SuSE.  No benefits without risks...
--
(reverse (concatenate 'string "gro.mca" "@" "enworbbc"))
http://www.ntlug.org/~cbbrowne/rdbms.html
If at first you don't succeed, then you didn't do it right!
If at first you don't succeed, then skydiving definitely isn't for you.

Re: IBM P-series machines

From
Matt Clark
Date:
>As for "vendor support" for Opteron, that sure looks like a
>trainwreck...  If you're going through IBM, then they won't want to
>respond to any issues if you're not running a "bog-standard" RHAS/RHES
>release from Red Hat.  And that, on Opteron, is preposterous, because
>there's plenty of the bits of Opteron support that only ever got put
>in Linux 2.6, whilst RHAT is still back in the 2.4 days.
>
>
>

To be fair, they have backported a boatload of 2.6 features to their kernel:
http://www.redhat.com/software/rhel/kernel26/

And that page certainly isn't an exhaustive list...

M

Opteron vs RHAT

From
Chris Browne
Date:
matt@ymogen.net (Matt Clark) writes:
>>As for "vendor support" for Opteron, that sure looks like a
>>trainwreck...  If you're going through IBM, then they won't want to
>>respond to any issues if you're not running a "bog-standard" RHAS/RHES
>>release from Red Hat.  And that, on Opteron, is preposterous, because
>>there's plenty of the bits of Opteron support that only ever got put
>>in Linux 2.6, whilst RHAT is still back in the 2.4 days.
>
> To be fair, they have backported a boatload of 2.6 features to their kernel:
> http://www.redhat.com/software/rhel/kernel26/
>
> And that page certainly isn't an exhaustive list...

To be fair, we keep on actually running into things that _can't_ be
backported, like fibrechannel drivers that were written to take
advantage of changes in the SCSI support in 2.6.

This sort of thing will be particularly problematic with Opteron,
where the porting efforts for AMD64 have taken place alongside the
creation of 2.6.
--
let name="cbbrowne" and tld="cbbrowne.com" in String.concat "@" [name;tld];;
http://www.ntlug.org/~cbbrowne/linuxxian.html
A VAX is virtually a computer, but not quite.

Re: Opteron vs RHAT

From
"Matt Clark"
Date:
> >>trainwreck...  If you're going through IBM, then they won't want to
> >>respond to any issues if you're not running a
> "bog-standard" RHAS/RHES
> >>release from Red Hat.
...> To be fair, we keep on actually running into things that
> _can't_ be backported, like fibrechannel drivers that were
> written to take advantage of changes in the SCSI support in 2.6.

I thought IBM had good support for SUSE?  I don't know why I thought that...


Re: Excessive context switching on SMP Xeons

From
Dave Cramer
Date:
Bill,

In order to manifest the context switch problem you will definitely
require clients to be set to more than one in pgbench. It only occurs
when 2 or more backends need access to shared memory.

If you want help backpatching Gavin's patch I'll be glad to do it for
you, but you do need a recent kernel.

Dave


On Thu, 2004-10-07 at 14:48, Bill Montgomery wrote:
> Michael Adler wrote:
>
> >On Thu, Oct 07, 2004 at 11:48:41AM -0400, Bill Montgomery wrote:
> >
> >
> >>Alan Stange wrote:
> >>
> >>The same test on a Dell PowerEdge 1750, Dual Xeon 3.2 GHz, 512k cache,
> >>HT on, Linux 2.4.21-20.ELsmp (RHEL 3), 4GB memory, pg 7.4.5:
> >>
> >>Far less performance that the Dual Opterons with a low number of
> >>clients, but the gap narrows as the number of clients goes up. Anyone
> >>smarter than me care to explain?
> >>
> >>
> >
> >You'll have to wait for someone smarter than you, but I will posit
> >this: Did you use a tmpfs filesystem like Alan? You didn't mention
> >either way. Alan did that as an attempt remove IO as a variable.
> >
> >-Mike
> >
> >
>
> Yes, I should have been more explicit. My goal was to replicate his
> experiment as closely as possible in my environment, so I did run my
> postgres data directory on a tmpfs.
>
> -Bill Montgomery
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster
--
Dave Cramer
519 939 0336
ICQ # 14675561
www.postgresintl.com