Thread: PostgreSQL and Xeon MP
Hello, We are experiencing performances problem with a quad Xeon MP and PostgreSQL 7.4 for a year now. Our context switch rate is not so high but the load of the server is blocked to 4 even on very high load and we have 60% cpu idle even in this case. Our database fits in RAM and we don't have any IO problem. I saw this post from Tom Lane http://archives.postgresql.org/pgsql-performance/2004-04/msg00249.php and several other references to problem with Xeon MP and I suspect our problems are related to this. We tried to put our production load on a dual standard Xeon on monday and it performs far better with the same configuration parameters. I know that work has been done by Tom for PostgreSQL 8.1 on multiprocessor support but I didn't find any information on if it solves the problem with Xeon MP or not. My question is should we expect a resolution of our problem by switching to 8.1 or will we still have problems and should we consider a hardware change? We will try to upgrade next tuesday so we will have the real answer soon but if anyone has any experience or information on this, he will be very welcome. Thanks for your help. -- Guillaume
Guillaume Smet wrote: > Hello, > > We are experiencing performances problem with a quad Xeon MP and > PostgreSQL 7.4 for a year now. I had a similar issue with a client the other week. > Our context switch rate is not so high > but the load of the server is blocked to 4 even on very high load and > we have 60% cpu idle even in this case. Our database fits in RAM and > we don't have any IO problem. Actually, I think that's part of the problem - it's the memory bandwidth. > I saw this post from Tom Lane > http://archives.postgresql.org/pgsql-performance/2004-04/msg00249.php > and several other references to problem with Xeon MP and I suspect our > problems are related to this. You should be seeing context-switching jump dramatically if it's the "classic" multi-Xeon problem. There's a point at which it seems to just escalate without a corresponding jump in activity. > We tried to put our production load on a dual standard Xeon on monday > and it performs far better with the same configuration parameters. > > I know that work has been done by Tom for PostgreSQL 8.1 on > multiprocessor support but I didn't find any information on if it > solves the problem with Xeon MP or not. I checked with Tom last week. Thread starts below: http://archives.postgresql.org/pgsql-hackers/2006-02/msg01118.php He's of the opinion that 8.1.3 will be an improvement. > My question is should we expect a resolution of our problem by > switching to 8.1 or will we still have problems and should we consider > a hardware change? We will try to upgrade next tuesday so we will have > the real answer soon but if anyone has any experience or information > on this, he will be very welcome. -- Richard Huxton Archonet Ltd
Richard, > You should be seeing context-switching jump dramatically if it's the > "classic" multi-Xeon problem. There's a point at which it seems to just > escalate without a corresponding jump in activity. No we don't have this problem of very high context switching in our case even when the database is very slow. When I mean very slow, we have pages which loads in a few seconds in the normal case (load between 3 and 4) which takes several minutes (up to 5-10 minutes) to be generated in the worst case (load at 4 but really bad performances). If I take a look on our cpu load graph, in one year, the cpu load was never higher than 5 even in the worst cases... > I checked with Tom last week. Thread starts below: > http://archives.postgresql.org/pgsql-hackers/2006-02/msg01118.php > > He's of the opinion that 8.1.3 will be an improvement. Thanks for pointing me this thread, I searched in -performance not in -hackers as the original thread was in -performance. We planned a migration to 8.1.3 so we'll see what happen with this version. Do you plan to test it before next tuesday? If so, I'm interested in your results. I'll post our results here as soon as we complete the upgrade. -- Guillaume
Hi Guillaume, I had a similar issue last summer. Could you please provide details about your XEON MP server and some statistics (context-switches/load/CPU usage)? I tried different servers (x86) with different results. I saw a difference between XEON MP w/ and w/o EMT64. The memory bandwidth makes also a difference. What version of XEON MP does your server have? Which type of RAM does you server have? Do you use Hyperthreading? You should provide details from the XEON DP? Regards Sven. Guillaume Smet schrieb: > Richard, > >> You should be seeing context-switching jump dramatically if it's the >> "classic" multi-Xeon problem. There's a point at which it seems to just >> escalate without a corresponding jump in activity. > > No we don't have this problem of very high context switching in our > case even when the database is very slow. When I mean very slow, we > have pages which loads in a few seconds in the normal case (load > between 3 and 4) which takes several minutes (up to 5-10 minutes) to > be generated in the worst case (load at 4 but really bad > performances). > If I take a look on our cpu load graph, in one year, the cpu load was > never higher than 5 even in the worst cases... > >> I checked with Tom last week. Thread starts below: >> http://archives.postgresql.org/pgsql-hackers/2006-02/msg01118.php >> >> He's of the opinion that 8.1.3 will be an improvement. > > Thanks for pointing me this thread, I searched in -performance not in > -hackers as the original thread was in -performance. We planned a > migration to 8.1.3 so we'll see what happen with this version. > > Do you plan to test it before next tuesday? If so, I'm interested in > your results. I'll post our results here as soon as we complete the > upgrade. > > -- > Guillaume > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings -- /This email and any files transmitted with it are confidential and intended solely for the use of the individual or entity to whom they are addressed. If you are not the intended recipient, you should not copy it, re-transmit it, use it or disclose its contents, but should return it to the sender immediately and delete your copy from your system. Thank you for your cooperation./ Sven Geisler <sgeisler@aeccom.com> Tel +49.30.5362.1627 Fax .1638 Senior Developer, AEC/communications GmbH Berlin, Germany
Guillaume Smet wrote: > Richard, > >> You should be seeing context-switching jump dramatically if it's the >> "classic" multi-Xeon problem. There's a point at which it seems to just >> escalate without a corresponding jump in activity. > > No we don't have this problem of very high context switching in our > case even when the database is very slow. When I mean very slow, we > have pages which loads in a few seconds in the normal case (load > between 3 and 4) which takes several minutes (up to 5-10 minutes) to > be generated in the worst case (load at 4 but really bad > performances). Very strange. > If I take a look on our cpu load graph, in one year, the cpu load was > never higher than 5 even in the worst cases... > >> I checked with Tom last week. Thread starts below: >> http://archives.postgresql.org/pgsql-hackers/2006-02/msg01118.php >> >> He's of the opinion that 8.1.3 will be an improvement. > > Thanks for pointing me this thread, I searched in -performance not in > -hackers as the original thread was in -performance. We planned a > migration to 8.1.3 so we'll see what happen with this version. > > Do you plan to test it before next tuesday? If so, I'm interested in > your results. I'll post our results here as soon as we complete the > upgrade. The client has just bought an Opteron to run on, I'm afraid. I might try 8.1 on the Xeon but it'll just be to see what happens and that won't be for a while. -- Richard Huxton Archonet Ltd
On 3/16/06, Richard Huxton <dev@archonet.com> wrote: > Very strange. Sure. I can't find any logical explanation for that but it is the behaviour we have for more than a year now (the site was migrated from Oracle to PostgreSQL on january 2005). We check iostat, vmstat and so on without any hint on why we have this behaviour. > The client has just bought an Opteron to run on, I'm afraid. I might try > 8.1 on the Xeon but it'll just be to see what happens and that won't be > for a while. I don't think it will be an option for us so I will have more information next week.
Sven, On 3/16/06, Sven Geisler <sgeisler@aeccom.com> wrote: > What version of XEON MP does your server have? The server is a dell 6650 from end of 2004 with 4 xeon mp 2.2 and 2MB cache per proc. Here are the information from Dell: 4x PROCESSOR, 80532, 2.2GHZ, 2MB cache, 400Mhz, SOCKET F 8x DUAL IN-LINE MEMORY MODULE, 512MB, 266MHz > Do you use Hyperthreading? No, we don't use it. > You should provide details from the XEON DP? The only problem is that the Xeon DP is installed with a 2.6 kernel and a postgresql 8.1.3 (it is used to test the migration from 7.4 to 8.1.3). So it's very difficult to really compare the two behaviours. It's a Dell 2850 with: 2 x PROCESSOR, 80546K, 2.8G, 1MB cache, XEON NOCONA, 800MHz 4 x DUAL IN-LINE MEMORY MODULE, 1GB, 400MHz This server is obviously newer than the other one. -- Guillaume
On 3/16/06, Sven Geisler <sgeisler@aeccom.com> wrote: > Hi Guillaume, > > I had a similar issue last summer. Could you please provide details > about your XEON MP server and some statistics (context-switches/load/CPU > usage)? I forgot the statistics: CPU load usually from 1 to 4. CPU usage < 40% for each processor usually and sometimes when the server completely hangs, it grows to 60%.., Here is a top output of the server at this time: 15:21:17 up 138 days, 13:25, 1 user, load average: 1.29, 1.25, 1.38 82 processes: 81 sleeping, 1 running, 0 zombie, 0 stopped CPU states: cpu user nice system irq softirq iowait idle total 25.7% 0.0% 3.9% 0.0% 0.3% 0.1% 69.7% cpu00 29.3% 0.0% 4.7% 0.1% 0.5% 0.0% 65.0% cpu01 20.7% 0.0% 1.9% 0.0% 0.3% 0.0% 76.8% cpu02 25.5% 0.0% 5.5% 0.0% 0.1% 0.3% 68.2% cpu03 27.3% 0.0% 3.3% 0.0% 0.1% 0.1% 68.8% Mem: 3857224k av, 3298580k used, 558644k free, 0k shrd, 105172k buff 2160124k actv, 701304k in_d, 56400k in_c Swap: 4281272k av, 6488k used, 4274784k free 2839348k cached We have currently between 3000 and 13000 context switches/s, average of 5000 I'd say visually. Here is a top output I had on november 17 when the server completely hangs (several minutes for each page of the website) and it is typical of this server behaviour: 17:08:41 up 19 days, 15:16, 1 user, load average: 4.03, 4.26, 4.36 288 processes: 285 sleeping, 3 running, 0 zombie, 0 stopped CPU states: cpu user nice system irq softirq iowait idle total 59.0% 0.0% 8.8% 0.2% 0.0% 0.0% 31.9% cpu00 52.3% 0.0% 13.3% 0.9% 0.0% 0.0% 33.3% cpu01 65.7% 0.0% 7.6% 0.0% 0.0% 0.0% 26.6% cpu02 58.0% 0.0% 7.6% 0.0% 0.0% 0.0% 34.2% cpu03 60.0% 0.0% 6.6% 0.0% 0.0% 0.0% 33.3% Mem: 3857224k av, 3495880k used, 361344k free, 0k shrd, 92160k buff 2374048k actv, 463576k in_d, 37708k in_c Swap: 4281272k av, 25412k used, 4255860k free 2173392k cached As you can see, load is blocked to 4, no iowait and cpu idle of 30%. Vmstat showed 5000 context switches/s on average so we had no context switch storm.
"Guillaume Smet" <guillaume.smet@gmail.com> writes: > Here is a top output I had on november 17 when the server completely > hangs (several minutes for each page of the website) and it is typical > of this server behaviour: > 17:08:41 up 19 days, 15:16, 1 user, load average: 4.03, 4.26, 4.36 > 288 processes: 285 sleeping, 3 running, 0 zombie, 0 stopped > CPU states: cpu user nice system irq softirq iowait idle > total 59.0% 0.0% 8.8% 0.2% 0.0% 0.0% 31.9% > cpu00 52.3% 0.0% 13.3% 0.9% 0.0% 0.0% 33.3% > cpu01 65.7% 0.0% 7.6% 0.0% 0.0% 0.0% 26.6% > cpu02 58.0% 0.0% 7.6% 0.0% 0.0% 0.0% 34.2% > cpu03 60.0% 0.0% 6.6% 0.0% 0.0% 0.0% 33.3% > Mem: 3857224k av, 3495880k used, 361344k free, 0k shrd, 92160k buff > 2374048k actv, 463576k in_d, 37708k in_c > Swap: 4281272k av, 25412k used, 4255860k free 2173392k cached > As you can see, load is blocked to 4, no iowait and cpu idle of 30%. Can you try strace'ing some of the backend processes while the system is behaving like this? I suspect what you'll find is a whole lot of delaying select() calls due to high contention for spinlocks ... regards, tom lane
Hi Guillaume, Guillaume Smet schrieb: > > The server is a dell 6650 from end of 2004 with 4 xeon mp 2.2 and 2MB > cache per proc. > > Here are the information from Dell: > 4x PROCESSOR, 80532, 2.2GHZ, 2MB cache, 400Mhz, SOCKET F > 8x DUAL IN-LINE MEMORY MODULE, 512MB, 266MHz > .... > >> You should provide details from the XEON DP? > > The only problem is that the Xeon DP is installed with a 2.6 kernel > and a postgresql 8.1.3 (it is used to test the migration from 7.4 to > 8.1.3). So it's very difficult to really compare the two behaviours. > > It's a Dell 2850 with: > 2 x PROCESSOR, 80546K, 2.8G, 1MB cache, XEON NOCONA, 800MHz > 4 x DUAL IN-LINE MEMORY MODULE, 1GB, 400MHz > Did you compare 7.4 on a 4-way with 8.1 on a 2-way? How many queries and clients did you use to test the performance? How much faster is the XEON DP? I think, you can expect that your XEON DP is faster on a single query because CPU and RAM are faster. The overall performance can be better on your XEON DP if you only have a few clients. I guess, the newer hardware and the newer PostgreSQL version cause the better performance. Regards Sven.
On 3/16/06, Sven Geisler <sgeisler@aeccom.com> wrote: > Did you compare 7.4 on a 4-way with 8.1 on a 2-way? I know there are too many parameters changing between the two servers but I can't really change anything before tuesday. On tuesday, we will be able to compare both servers with the same software. > How many queries and clients did you use to test the performance? Googlebot is indexing this site generating 2-3 mbits/s of traffic so we use the googlebot to stress this server. There was a lot of clients and a lot of queries. > How much faster is the XEON DP? Well, on high load, PostgreSQL scales well on the DP (load at 40, queries slower but still performing well) and is awfully slow on the MP box.
On 3/16/06, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Can you try strace'ing some of the backend processes while the system is > behaving like this? I suspect what you'll find is a whole lot of > delaying select() calls due to high contention for spinlocks ... Tom, I think we can try to do it. You mean strace -p pid with pid on some of the postgres process not on the postmaster itself, does you? Do we need other options? Which pattern should we expect? I'm not really familiar with strace and its output. Thanks for your help.
"Guillaume Smet" <guillaume.smet@gmail.com> writes: > You mean strace -p pid with pid on some of the postgres process not on > the postmaster itself, does you? Right, pick a couple that are accumulating CPU time. > Do we need other options? strace will generate a *whole lot* of output to stderr. I usually do something like strace -p pid 2>outfile and then control-C it after a few seconds. > Which pattern should we expect? What we want to find out is if there's a lot of select()s and/or semop()s shown in the result. Ideally there wouldn't be any, but I fear that's not what you'll find. regards, tom lane
Hi Guillaume, Guillaume Smet schrieb: >> How much faster is the XEON DP? > > Well, on high load, PostgreSQL scales well on the DP (load at 40, > queries slower but still performing well) and is awfully slow on the > MP box. I know what you mean with awfully slow. I think, your application is facing contention. The contention becomes larger as more CPU you have. PostgreSQL 8.1 is addressing contention on multiprocessor servers as you mentioned before. I guess, you will see that your 4-way XEON MP isn't that bad if you compare both servers with the same PostgreSQL version. Regards Sven.
On 3/16/06, Tom Lane <tgl@sss.pgh.pa.us> wrote: > What we want to find out is if there's a lot of select()s and/or > semop()s shown in the result. Ideally there wouldn't be any, but > I fear that's not what you'll find. OK, I'll try to do it on monday before our upgrade then see what happens with PostgreSQL 8.1.3. Thanks for your help.
On Thu, Mar 16, 2006 at 11:45:12AM +0100, Guillaume Smet wrote: > Hello, > > We are experiencing performances problem with a quad Xeon MP and > PostgreSQL 7.4 for a year now. Our context switch rate is not so high > but the load of the server is blocked to 4 even on very high load and > we have 60% cpu idle even in this case. Our database fits in RAM and > we don't have any IO problem. I saw this post from Tom Lane > http://archives.postgresql.org/pgsql-performance/2004-04/msg00249.php > and several other references to problem with Xeon MP and I suspect our > problems are related to this. > We tried to put our production load on a dual standard Xeon on monday > and it performs far better with the same configuration parameters. > > I know that work has been done by Tom for PostgreSQL 8.1 on > multiprocessor support but I didn't find any information on if it > solves the problem with Xeon MP or not. > > My question is should we expect a resolution of our problem by > switching to 8.1 or will we still have problems and should we consider > a hardware change? We will try to upgrade next tuesday so we will have > the real answer soon but if anyone has any experience or information > on this, he will be very welcome. > > Thanks for your help. > > -- > Guillaume > Guillaume, We had a similar problem with poor performance on a Xeon DP and PostgreSQL 7.4.x. 8.0 came out in time for preliminary testing but it did not solve the problem and our production systems went live using a different database product. We are currently testing against 8.1.x and the seemingly bizarre lack of performance is gone. I would suspect that a quad-processor box would have the same issue. I would definitely recommend giving 8.1 a try. Ken
On 3/16/06, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Can you try strace'ing some of the backend processes while the system is > behaving like this? I suspect what you'll find is a whole lot of > delaying select() calls due to high contention for spinlocks ... As announced, we have migrated our production server from 7.4.8 to 8.1.3 this morning. We did some strace'ing before the migration and you were right on the select calls. We had a lot of them even when the database was not highly loaded (one every 3-4 lines). After the upgrade, we have the expected behaviour with a more linear scalability and a growing cpu load when the database is highly loaded (and no cpu idle anymore in this case). We have fewer context switches too. 8.1.3 definitely is far better for quad Xeon MP and I recommend the upgrade for everyone having this sort of problem. Tom, thanks for your great work on this problem. -- Guillaume