Thread: hardware upgrade, performance degrade?

hardware upgrade, performance degrade?

From
Steven Crandell
Date:
Recently I moved my ~600G / ~15K TPS database from a 
48 core@2.0GHz server with 512GB RAM on 15K RPM disk
to a newer server with 
64 core@2.2Ghz server with 1T of RAM on 15K RPM disks

The move was from v9.1.4 to v9.1.8 (eventually also tested with v9.1.4 on the new hardware) and was done via base backup followed by slave promotion.
All postgres configurations were matched exactly as were system and kernel parameters.

On the first day that this server saw production load levels it absolutely fell on its face.  We ran an exhaustive battery of tests including failing over to the new (hardware matched) slave only to find the problem happening there also.  

After several engineers all confirmed that every postgres and system setting matched, we eventually migrated back onto the original hardware using exactly the same methods and settings that had been used while the data was on the new hardware.  As soon as we brought the DB live on the older (supposedly slower) hardware, everything started running smoothly again. 

As far as we were able to gather in the frantic moments of downtime, hundreds of queries were hanging up while trying to COMMIT.  This in turn caused new queries backup as they waited for locks and so on.  

Prior to failing back to the original hardware, we found interesting posts about people having problems similar to ours due to NUMA and several suggested that they had solved their problem by setting vm.zone_reclaim_mode = 0  

Unfortunately we experienced the exact same problems even after turning off the zone_reclaim_mode.  We did  extensive testing of the i/o on the new hardware (both data and log arrays)  before it was put into service and have done even more comprehensive testing since it came out of service.  The short version is that the disks on the new hardware are faster than disks on the old server.  In one test run we even set the server to write WALs to shared memory instead of to the log LV just to help rule out i/o problems and only saw a marginal improvement in overall TPS numbers.

At this point we are extremely confident that if we have a configuration problem, it is not with any of the usual postgresql.conf/sysctl.conf suspects.  We are pretty sure that the problem is being caused by the hardware in some way but that it is not the result of a hardware failure (e.g. degraded array, raid card self tests or what have you).

Given that we're dealing with new hardware and the fact that this still acts a lot like a NUMA issue, are there other settings we should be adjusting to deal with possible performance problems associated with NUMA?

Does this sound like something else entirely?

Any thoughts appreciated.

thanks,
Steve

Re: hardware upgrade, performance degrade?

From
Guillaume Smet
Date:
Hi Steven,

On Fri, Mar 1, 2013 at 10:52 AM, Steven Crandell
<steven.crandell@gmail.com> wrote:
> Given that we're dealing with new hardware and the fact that this still acts
> a lot like a NUMA issue, are there other settings we should be adjusting to
> deal with possible performance problems associated with NUMA?

On my kernel related thread list, I have:
http://www.postgresql.org/message-id/CAOR=d=2tjWoxpQrUHpJK1R+BtEvBv4buiVtX4Qf6we=MHmghxw@mail.gmail.com
(you haven't mentioned if the kernel version is identical for both
servers)
http://www.postgresql.org/message-id/50E4AAB1.9040902@optionshouse.com
http://www.postgresql.org/message-id/508ACF61.1000602@optionshouse.com
http://www.postgresql.org/message-id/CAMAsR=5F45+kj+hw9q+zE7zo=Qc0yBEB1sLXCF0QL+dWt_7KqQ@mail.gmail.com

Might be of some interest to you.

--
Guillaume


Re: hardware upgrade, performance degrade?

From
Craig James
Date:
On Fri, Mar 1, 2013 at 1:52 AM, Steven Crandell <steven.crandell@gmail.com> wrote:
Recently I moved my ~600G / ~15K TPS database from a 
48 core@2.0GHz server with 512GB RAM on 15K RPM disk
to a newer server with 
64 core@2.2Ghz server with 1T of RAM on 15K RPM disks

The move was from v9.1.4 to v9.1.8 (eventually also tested with v9.1.4 on the new hardware) and was done via base backup followed by slave promotion.
All postgres configurations were matched exactly as were system and kernel parameters.

On the first day that this server saw production load levels it absolutely fell on its face.  We ran an exhaustive battery of tests including failing over to the new (hardware matched) slave only to find the problem happening there also.  

After several engineers all confirmed that every postgres and system setting matched, we eventually migrated back onto the original hardware using exactly the same methods and settings that had been used while the data was on the new hardware.  As soon as we brought the DB live on the older (supposedly slower) hardware, everything started running smoothly again. 

As far as we were able to gather in the frantic moments of downtime, hundreds of queries were hanging up while trying to COMMIT.  This in turn caused new queries backup as they waited for locks and so on.  

Prior to failing back to the original hardware, we found interesting posts about people having problems similar to ours due to NUMA and several suggested that they had solved their problem by setting vm.zone_reclaim_mode = 0  

Unfortunately we experienced the exact same problems even after turning off the zone_reclaim_mode.  We did  extensive testing of the i/o on the new hardware (both data and log arrays)  before it was put into service and have done even more comprehensive testing since it came out of service.  The short version is that the disks on the new hardware are faster than disks on the old server.  In one test run we even set the server to write WALs to shared memory instead of to the log LV just to help rule out i/o problems and only saw a marginal improvement in overall TPS numbers.

At this point we are extremely confident that if we have a configuration problem, it is not with any of the usual postgresql.conf/sysctl.conf suspects.  We are pretty sure that the problem is being caused by the hardware in some way but that it is not the result of a hardware failure (e.g. degraded array, raid card self tests or what have you).

Given that we're dealing with new hardware and the fact that this still acts a lot like a NUMA issue, are there other settings we should be adjusting to deal with possible performance problems associated with NUMA?

Does this sound like something else entirely?

Any thoughts appreciated.

One piece of information that you didn't supply ... sorry if this is obvious, but did you run the usual range of performance tests using pgbench, bonnie++ and so forth to confirm that the new server was working well before you put it into production?  Did it compare well on those same tests to your old hardware?

Craig


thanks,
Steve

Re: hardware upgrade, performance degrade?

From
Steven Crandell
Date:
We saw the same performance problems when this new hardware was running cent 6.3 with a 2.6.32-279.19.1.el6.x86_64 kernel and when it was matched to the OS/kernel of the old hardware which was cent 5.8 with a 2.6.18-308.11.1.el5 kernel.

Yes the new hardware was thoroughly tested with bonnie before being put into services and has been tested since.  We are unable to find any interesting differences in our bonnie tests comparisons between the old and new hardware.  pgbench was not used prior to our discovery of the problem but has been used extensively since.  FWIW This server ran a zabbix database (much lower load requirements) for a month without any problems prior to taking over as our primary production DB server.  

After quite a bit of trial and error we were able to find a pgbench test (2x 300 concurrent client sessions doing selects along with 1x 50 concurrent user session doing the standard pgbench query rotation)  that showed the new hardware under performing when compared to the old hardware to the tune of about  a 1000 TPS difference (2300 to 1300) for the 50 concurrent user pgbench run and about a 1000 less TPS for each of the select only runs (~24000 to ~23000).  Less demanding tests would be handled equally well by both old and new servers.  More demanding tests would tip both old and new over with very similar efficacy.

Hopefully that fleshes things out a bit more.
Please let me know if I can provide additional information.

thanks
steve


On Fri, Mar 1, 2013 at 8:41 AM, Craig James <cjames@emolecules.com> wrote:
On Fri, Mar 1, 2013 at 1:52 AM, Steven Crandell <steven.crandell@gmail.com> wrote:
Recently I moved my ~600G / ~15K TPS database from a 
48 core@2.0GHz server with 512GB RAM on 15K RPM disk
to a newer server with 
64 core@2.2Ghz server with 1T of RAM on 15K RPM disks

The move was from v9.1.4 to v9.1.8 (eventually also tested with v9.1.4 on the new hardware) and was done via base backup followed by slave promotion.
All postgres configurations were matched exactly as were system and kernel parameters.

On the first day that this server saw production load levels it absolutely fell on its face.  We ran an exhaustive battery of tests including failing over to the new (hardware matched) slave only to find the problem happening there also.  

After several engineers all confirmed that every postgres and system setting matched, we eventually migrated back onto the original hardware using exactly the same methods and settings that had been used while the data was on the new hardware.  As soon as we brought the DB live on the older (supposedly slower) hardware, everything started running smoothly again. 

As far as we were able to gather in the frantic moments of downtime, hundreds of queries were hanging up while trying to COMMIT.  This in turn caused new queries backup as they waited for locks and so on.  

Prior to failing back to the original hardware, we found interesting posts about people having problems similar to ours due to NUMA and several suggested that they had solved their problem by setting vm.zone_reclaim_mode = 0  

Unfortunately we experienced the exact same problems even after turning off the zone_reclaim_mode.  We did  extensive testing of the i/o on the new hardware (both data and log arrays)  before it was put into service and have done even more comprehensive testing since it came out of service.  The short version is that the disks on the new hardware are faster than disks on the old server.  In one test run we even set the server to write WALs to shared memory instead of to the log LV just to help rule out i/o problems and only saw a marginal improvement in overall TPS numbers.

At this point we are extremely confident that if we have a configuration problem, it is not with any of the usual postgresql.conf/sysctl.conf suspects.  We are pretty sure that the problem is being caused by the hardware in some way but that it is not the result of a hardware failure (e.g. degraded array, raid card self tests or what have you).

Given that we're dealing with new hardware and the fact that this still acts a lot like a NUMA issue, are there other settings we should be adjusting to deal with possible performance problems associated with NUMA?

Does this sound like something else entirely?

Any thoughts appreciated.

One piece of information that you didn't supply ... sorry if this is obvious, but did you run the usual range of performance tests using pgbench, bonnie++ and so forth to confirm that the new server was working well before you put it into production?  Did it compare well on those same tests to your old hardware?

Craig


thanks,
Steve


Re: hardware upgrade, performance degrade?

From
Scott Marlowe
Date:
On Fri, Mar 1, 2013 at 9:49 AM, Steven Crandell
<steven.crandell@gmail.com> wrote:
> We saw the same performance problems when this new hardware was running cent
> 6.3 with a 2.6.32-279.19.1.el6.x86_64 kernel and when it was matched to the
> OS/kernel of the old hardware which was cent 5.8 with a 2.6.18-308.11.1.el5
> kernel.
>
> Yes the new hardware was thoroughly tested with bonnie before being put into
> services and has been tested since.  We are unable to find any interesting
> differences in our bonnie tests comparisons between the old and new
> hardware.  pgbench was not used prior to our discovery of the problem but
> has been used extensively since.  FWIW This server ran a zabbix database
> (much lower load requirements) for a month without any problems prior to
> taking over as our primary production DB server.
>
> After quite a bit of trial and error we were able to find a pgbench test (2x
> 300 concurrent client sessions doing selects along with 1x 50 concurrent
> user session doing the standard pgbench query rotation)  that showed the new
> hardware under performing when compared to the old hardware to the tune of
> about  a 1000 TPS difference (2300 to 1300) for the 50 concurrent user
> pgbench run and about a 1000 less TPS for each of the select only runs
> (~24000 to ~23000).  Less demanding tests would be handled equally well by
> both old and new servers.  More demanding tests would tip both old and new
> over with very similar efficacy.
>
> Hopefully that fleshes things out a bit more.
> Please let me know if I can provide additional information.

OK I'd recommend testing with various numbers of clients and seeing
what kind of shape you get from the curve when you plot it.  I.e. does
it fall off really hard at some number etc?  If the old server
degrades more gracefully under very heavy load it may be that you're
just admitting too many connections for the new one etc, not hitting
its sweet spot.

FWIW, the newest intel 10 core xeons and their cousins just barely
keep up with or beat the 8 or 12 core AMD Opterons from 3 years ago in
most of my testing.  They look great on paper, but under heavy load
they are luck to keep up most the time.

There's also the possibility that even though you've turned off zone
reclaim that your new hardware is still running in a numa mode that
makes internode communication much more expensive and that's costing
you money.  This may especially be true with 1TB of memory that it's
both running at a lower speed AND internode connection costs are much
higher.  use the numactl command (I think that's it) to see what the
internode costs are, and compare it to the old hardware.  IF the
internode comm costs are really high, see if you can turn off numa in
the BIOS and if it gets somewhat better.

Of course check the usual, that your battery backed cache is really
working in write back not write through etc.

Good luck.  Acceptance testing can really suck when newer, supposedly
faster hardware is in fact slower.


Re: hardware upgrade, performance degrade?

From
Jesper Krogh
Date:
On 01/03/2013, at 10.52, Steven Crandell <steven.crandell@gmail.com> wrote:

> Recently I moved my ~600G / ~15K TPS database from a
> 48 core@2.0GHz server with 512GB RAM on 15K RPM disk
> to a newer server with
> 64 core@2.2Ghz server with 1T of RAM on 15K RPM disks
>
> The move was from v9.1.4 to v9.1.8 (eventually also tested with v9.1.4 on the new hardware) and was done via base
backupfollowed by slave promotion. 
> All postgres configurations were matched exactly as were system and kernel parameters.
>

my guess is that you have gone down in clockfrequency on memory when you doubled the amount  of memory

in a mainly memory cached database the performance is extremely sensitive to memory speed

Jesper



Re: hardware upgrade, performance degrade?

From
Josh Berkus
Date:
Steven,

> We saw the same performance problems when this new hardware was running
> cent 6.3 with a 2.6.32-279.19.1.el6.x86_64 kernel and when it was matched
> to the OS/kernel of the old hardware which was cent 5.8 with
> a 2.6.18-308.11.1.el5 kernel.

Oh, now that's interesting.  We've been seeing the same issue (IO stalls
on COMMIT) ond had attributed it to some bugs in the 3.2 and 3.4
kernels, partly because we don't have a credible "old server" to test
against.  Now you have me wondering if there's not a hardware or driver
issue with a major HW manufacturer which just happens to be hitting
around now.

Can you detail your hardware stack so that we can compare notes?


--
Josh Berkus
PostgreSQL Experts Inc.
http://pgexperts.com


Re: hardware upgrade, performance degrade?

From
Jean-David Beyer
Date:
On 03/03/2013 03:16 PM, Josh Berkus wrote:
> Steven,
>
>> We saw the same performance problems when this new hardware was running
>> cent 6.3 with a 2.6.32-279.19.1.el6.x86_64 kernel and when it was matched
>> to the OS/kernel of the old hardware which was cent 5.8 with
>> a 2.6.18-308.11.1.el5 kernel.
>
> Oh, now that's interesting.  We've been seeing the same issue (IO stalls
> on COMMIT) ond had attributed it to some bugs in the 3.2 and 3.4
> kernels, partly because we don't have a credible "old server" to test
> against.  Now you have me wondering if there's not a hardware or driver
> issue with a major HW manufacturer which just happens to be hitting
> around now.
>
> Can you detail your hardware stack so that we can compare notes?
>
>
The current Red Hat Enterprise Linux 6.4 kernel is
kernel-2.6.32-358.0.1.el6.x86_64

in case that matters.


Re: hardware upgrade, performance degrade?

From
Steven Crandell
Date:
Here's our hardware break down.

The logvg on the new hardware  is 30MB/s slower (170 MB/s vs 200 MB/s ) than the logvg on the older hardware which was an immediately interesting difference but we have yet to be able to create a test scenario that successfully implicates this slower log speed in our problems. That is something we are actively working on.


Old server hardware:
        Manufacturer: Dell Inc. 
        Product Name: PowerEdge R810
        4x Intel(R) Xeon(R) CPU           E7540  @ 2.00GHz
        32x16384 MB 1066 MHz DDR3
        Controller 0: PERC H700 - 2 disk RAID-1 278.88 GB rootvg
        Controller 1: PERC H800 - 18 disk RAID-6 2,178.00 GB datavg, 4 drive RAID-10 272.25 GB logvg, 2 hot spare
        2x 278.88 GB 15K SAS on controller 0
        24x 136.13 GB 15K SAS on controller 1 

New server hardware:
       Manufacturer: Dell Inc.
        Product Name: PowerEdge R820
        4x Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz
        32x32 GB 1333 MHz DDR3
        Controller 0: PERC H710P  - 4 disk RAID-6 557.75 GB rootvg
        Controller 1: PERC H810    - 20 disk RAID-60 4,462.00 GB datavg, 2 disk RAID-1  278.88 GB logvg, 2 hot spare
        28x278.88 GB 15K SAS drives total. 


On Sun, Mar 3, 2013 at 1:34 PM, Jean-David Beyer <jeandavid8@verizon.net> wrote:
On 03/03/2013 03:16 PM, Josh Berkus wrote:
> Steven,
>
>> We saw the same performance problems when this new hardware was running
>> cent 6.3 with a 2.6.32-279.19.1.el6.x86_64 kernel and when it was matched
>> to the OS/kernel of the old hardware which was cent 5.8 with
>> a 2.6.18-308.11.1.el5 kernel.
>
> Oh, now that's interesting.  We've been seeing the same issue (IO stalls
> on COMMIT) ond had attributed it to some bugs in the 3.2 and 3.4
> kernels, partly because we don't have a credible "old server" to test
> against.  Now you have me wondering if there's not a hardware or driver
> issue with a major HW manufacturer which just happens to be hitting
> around now.
>
> Can you detail your hardware stack so that we can compare notes?
>
>
The current Red Hat Enterprise Linux 6.4 kernel is
kernel-2.6.32-358.0.1.el6.x86_64

in case that matters.


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: hardware upgrade, performance degrade?

From
Mark Kirkwood
Date:
On 05/03/13 11:54, Steven Crandell wrote:
> Here's our hardware break down.
>
> The logvg on the new hardware  is 30MB/s slower (170 MB/s vs 200 MB/s )
> than the logvg on the older hardware which was an immediately interesting
> difference but we have yet to be able to create a test scenario that
> successfully implicates this slower log speed in our problems. That is
> something we are actively working on.
>
>
> Old server hardware:
>          Manufacturer: Dell Inc.
>          Product Name: PowerEdge R810
>          4x Intel(R) Xeon(R) CPU           E7540  @ 2.00GHz
>          32x16384 MB 1066 MHz DDR3
>          Controller 0: PERC H700 - 2 disk RAID-1 278.88 GB rootvg
>          Controller 1: PERC H800 - 18 disk RAID-6 2,178.00 GB datavg, 4
> drive RAID-10 272.25 GB logvg, 2 hot spare
>          2x 278.88 GB 15K SAS on controller 0
>          24x 136.13 GB 15K SAS on controller 1
>
> New server hardware:
>         Manufacturer: Dell Inc.
>          Product Name: PowerEdge R820
>          4x Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz
>          32x32 GB 1333 MHz DDR3
>          Controller 0: PERC H710P  - 4 disk RAID-6 557.75 GB rootvg
>          Controller 1: PERC H810    - 20 disk RAID-60 4,462.00 GB datavg, 2
> disk RAID-1  278.88 GB logvg, 2 hot spare
>          28x278.88 GB 15K SAS drives total.
>
>
> On Sun, Mar 3, 2013 at 1:34 PM, Jean-David Beyer <jeandavid8@verizon.net>wrote:
>

Right - It is probably worth running 'pg_test_fsync' on the two logvg's
and comparing the results. This will tell you if the commit latency is
similar or not on the two disk systems.

One other difference that springs immediately to mind is that datavg is
an 18 disk RAID 6 on the old system and a 20 disk RAID 60 on the new
one...so you have about 1/2 the io performance right there.

Cheers

Mark


Re: hardware upgrade, performance degrade?

From
Sergey Konoplev
Date:
On Fri, Mar 1, 2013 at 1:52 AM, Steven Crandell
<steven.crandell@gmail.com> wrote:
> As far as we were able to gather in the frantic moments of downtime,
> hundreds of queries were hanging up while trying to COMMIT.  This in turn
> caused new queries backup as they waited for locks and so on.
>
> Given that we're dealing with new hardware and the fact that this still acts
> a lot like a NUMA issue, are there other settings we should be adjusting to
> deal with possible performance problems associated with NUMA?
>
> Does this sound like something else entirely?

It does. I collected a number of kernel (and not only) tuning issues
with short explanations to prevent it from affecting database behavior
badly. Try to follow them:

https://code.google.com/p/pgcookbook/wiki/Database_Server_Configuration

--
Sergey Konoplev
Database and Software Architect
http://www.linkedin.com/in/grayhemp

Phones:
USA +1 415 867 9984
Russia, Moscow +7 901 903 0499
Russia, Krasnodar +7 988 888 1979

Skype: gray-hemp
Jabber: gray.ru@gmail.com


Re: hardware upgrade, performance degrade?

From
John Rouillard
Date:
On Mon, Mar 04, 2013 at 03:54:40PM -0700, Steven Crandell wrote:
> Here's our hardware break down.
>
> The logvg on the new hardware  is 30MB/s slower (170 MB/s vs 200 MB/s )
> than the logvg on the older hardware which was an immediately interesting
> difference but we have yet to be able to create a test scenario that
> successfully implicates this slower log speed in our problems. That is
> something we are actively working on.
>
>
> Old server hardware:
>         Manufacturer: Dell Inc.
>         Product Name: PowerEdge R810
>         4x Intel(R) Xeon(R) CPU           E7540  @ 2.00GHz
>         32x16384 MB 1066 MHz DDR3
>         Controller 0: PERC H700 - 2 disk RAID-1 278.88 GB rootvg
>         Controller 1: PERC H800 - 18 disk RAID-6 2,178.00 GB datavg, 4
> drive RAID-10 272.25 GB logvg, 2 hot spare
>         2x 278.88 GB 15K SAS on controller 0
>         24x 136.13 GB 15K SAS on controller 1
>
> New server hardware:
>        Manufacturer: Dell Inc.
>         Product Name: PowerEdge R820
>         4x Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz
>         32x32 GB 1333 MHz DDR3
>         Controller 0: PERC H710P  - 4 disk RAID-6 557.75 GB rootvg
>         Controller 1: PERC H810    - 20 disk RAID-60 4,462.00 GB datavg, 2
> disk RAID-1  278.88 GB logvg, 2 hot spare
>         28x278.88 GB 15K SAS drives total.

Hmm, you went from a striped (raid 1/0) log volume on the old hardware
to a non-striped (raid 1) volume on the new hardware. That could
explain the speed drop. Are the disks the same speed for the two
systems?

--
                -- rouilj

John Rouillard       System Administrator
Renesys Corporation  603-244-9084 (cell)  603-643-9300 x 111


Re: hardware upgrade, performance degrade?

From
Scott Marlowe
Date:
On Mon, Mar 4, 2013 at 4:17 PM, John Rouillard <rouilj@renesys.com> wrote:
> On Mon, Mar 04, 2013 at 03:54:40PM -0700, Steven Crandell wrote:
>> Here's our hardware break down.
>>
>> The logvg on the new hardware  is 30MB/s slower (170 MB/s vs 200 MB/s )
>> than the logvg on the older hardware which was an immediately interesting
>> difference but we have yet to be able to create a test scenario that
>> successfully implicates this slower log speed in our problems. That is
>> something we are actively working on.
>>
>>
>> Old server hardware:
>>         Manufacturer: Dell Inc.
>>         Product Name: PowerEdge R810
>>         4x Intel(R) Xeon(R) CPU           E7540  @ 2.00GHz
>>         32x16384 MB 1066 MHz DDR3
>>         Controller 0: PERC H700 - 2 disk RAID-1 278.88 GB rootvg
>>         Controller 1: PERC H800 - 18 disk RAID-6 2,178.00 GB datavg, 4
>> drive RAID-10 272.25 GB logvg, 2 hot spare
>>         2x 278.88 GB 15K SAS on controller 0
>>         24x 136.13 GB 15K SAS on controller 1
>>
>> New server hardware:
>>        Manufacturer: Dell Inc.
>>         Product Name: PowerEdge R820
>>         4x Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz
>>         32x32 GB 1333 MHz DDR3
>>         Controller 0: PERC H710P  - 4 disk RAID-6 557.75 GB rootvg
>>         Controller 1: PERC H810    - 20 disk RAID-60 4,462.00 GB datavg, 2
>> disk RAID-1  278.88 GB logvg, 2 hot spare
>>         28x278.88 GB 15K SAS drives total.
>
> Hmm, you went from a striped (raid 1/0) log volume on the old hardware
> to a non-striped (raid 1) volume on the new hardware. That could
> explain the speed drop. Are the disks the same speed for the two
> systems?

Yeah that's a terrible tradeoff there.  Just throw 4 disks in a
RAID-10 instead of RAID-60. With 4 disks you'll get the same storage
and much better performance from RAID-10.

Also consider using larger drives and a RAID-10 for your big drive
array.  RAID-6 or RAID-60 is notoriously slow for databases,
especially for random access.


Re: hardware upgrade, performance degrade?

From
Steven Crandell
Date:
Mark,
I ran pg_fsync_test on log and data LV's on both old and new hardware.  

New hardware out performed old on every measurable on the log LV

Same for the data LV's except for the 16kB open_sync write where the old hardware edged out the new by a hair (18649 vs 17999 ops/sec) 
and write, fsync, close where they were effectively tied.




On Mon, Mar 4, 2013 at 4:17 PM, John Rouillard <rouilj@renesys.com> wrote:
On Mon, Mar 04, 2013 at 03:54:40PM -0700, Steven Crandell wrote:
> Here's our hardware break down.
>
> The logvg on the new hardware  is 30MB/s slower (170 MB/s vs 200 MB/s )
> than the logvg on the older hardware which was an immediately interesting
> difference but we have yet to be able to create a test scenario that
> successfully implicates this slower log speed in our problems. That is
> something we are actively working on.
>
>
> Old server hardware:
>         Manufacturer: Dell Inc.
>         Product Name: PowerEdge R810
>         4x Intel(R) Xeon(R) CPU           E7540  @ 2.00GHz
>         32x16384 MB 1066 MHz DDR3
>         Controller 0: PERC H700 - 2 disk RAID-1 278.88 GB rootvg
>         Controller 1: PERC H800 - 18 disk RAID-6 2,178.00 GB datavg, 4
> drive RAID-10 272.25 GB logvg, 2 hot spare
>         2x 278.88 GB 15K SAS on controller 0
>         24x 136.13 GB 15K SAS on controller 1
>
> New server hardware:
>        Manufacturer: Dell Inc.
>         Product Name: PowerEdge R820
>         4x Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz
>         32x32 GB 1333 MHz DDR3
>         Controller 0: PERC H710P  - 4 disk RAID-6 557.75 GB rootvg
>         Controller 1: PERC H810    - 20 disk RAID-60 4,462.00 GB datavg, 2
> disk RAID-1  278.88 GB logvg, 2 hot spare
>         28x278.88 GB 15K SAS drives total.

Hmm, you went from a striped (raid 1/0) log volume on the old hardware
to a non-striped (raid 1) volume on the new hardware. That could
explain the speed drop. Are the disks the same speed for the two
systems?

--
                                -- rouilj

John Rouillard       System Administrator
Renesys Corporation  603-244-9084 (cell)  603-643-9300 x 111


--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: hardware upgrade, performance degrade?

From
Scott Marlowe
Date:
I'd be more interested in the random results from bonnie++ but my real
world experience tells me that for heavily parallel writes etc a
RAID-10 will stomp a RAID-6 or RAID-60 on the same number of drives.

On Mon, Mar 4, 2013 at 4:47 PM, Steven Crandell
<steven.crandell@gmail.com> wrote:
> Mark,
> I ran pg_fsync_test on log and data LV's on both old and new hardware.
>
> New hardware out performed old on every measurable on the log LV
>
> Same for the data LV's except for the 16kB open_sync write where the old
> hardware edged out the new by a hair (18649 vs 17999 ops/sec)
> and write, fsync, close where they were effectively tied.
>
>
>
>
> On Mon, Mar 4, 2013 at 4:17 PM, John Rouillard <rouilj@renesys.com> wrote:
>>
>> On Mon, Mar 04, 2013 at 03:54:40PM -0700, Steven Crandell wrote:
>> > Here's our hardware break down.
>> >
>> > The logvg on the new hardware  is 30MB/s slower (170 MB/s vs 200 MB/s )
>> > than the logvg on the older hardware which was an immediately
>> > interesting
>> > difference but we have yet to be able to create a test scenario that
>> > successfully implicates this slower log speed in our problems. That is
>> > something we are actively working on.
>> >
>> >
>> > Old server hardware:
>> >         Manufacturer: Dell Inc.
>> >         Product Name: PowerEdge R810
>> >         4x Intel(R) Xeon(R) CPU           E7540  @ 2.00GHz
>> >         32x16384 MB 1066 MHz DDR3
>> >         Controller 0: PERC H700 - 2 disk RAID-1 278.88 GB rootvg
>> >         Controller 1: PERC H800 - 18 disk RAID-6 2,178.00 GB datavg, 4
>> > drive RAID-10 272.25 GB logvg, 2 hot spare
>> >         2x 278.88 GB 15K SAS on controller 0
>> >         24x 136.13 GB 15K SAS on controller 1
>> >
>> > New server hardware:
>> >        Manufacturer: Dell Inc.
>> >         Product Name: PowerEdge R820
>> >         4x Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz
>> >         32x32 GB 1333 MHz DDR3
>> >         Controller 0: PERC H710P  - 4 disk RAID-6 557.75 GB rootvg
>> >         Controller 1: PERC H810    - 20 disk RAID-60 4,462.00 GB datavg,
>> > 2
>> > disk RAID-1  278.88 GB logvg, 2 hot spare
>> >         28x278.88 GB 15K SAS drives total.
>>
>> Hmm, you went from a striped (raid 1/0) log volume on the old hardware
>> to a non-striped (raid 1) volume on the new hardware. That could
>> explain the speed drop. Are the disks the same speed for the two
>> systems?
>>
>> --
>>                                 -- rouilj
>>
>> John Rouillard       System Administrator
>> Renesys Corporation  603-244-9084 (cell)  603-643-9300 x 111
>>
>>
>> --
>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-performance
>
>



--
To understand recursion, one must first understand recursion.


Re: hardware upgrade, performance degrade?

From
Steven Crandell
Date:
Scott,

Long story short, yes I agree, the raids are all kinds of wrong and if I had been involved in the build processes they would look very different right now.  

I have been playing around with different bs= and count= settings for some simple dd tests tonight and found some striking differences that finally show the new hardware under performing (significantly!) when compared to the old hardware.

That said, we are still struggling to find a postgres-specific test that yields sufficiently different results on old and new hardware that it could serve as an indicator that we had solved the problem prior to shoving this box back into production service and crossing our fingers.  The moment we fix the raids, we give away a prime testing scenario.  Tomorrow I plan to split the difference and fix the raids on one of the new boxes and not the other.

We are also working on capturing prod logs for playback but that is proving non-trivial due to our existing performance bottleneck and some eccentricities associated with our application.

more to come on this hopefully


many thanks for all of the insights thus far.


On Mon, Mar 4, 2013 at 6:48 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
I'd be more interested in the random results from bonnie++ but my real
world experience tells me that for heavily parallel writes etc a
RAID-10 will stomp a RAID-6 or RAID-60 on the same number of drives.

On Mon, Mar 4, 2013 at 4:47 PM, Steven Crandell
<steven.crandell@gmail.com> wrote:
> Mark,
> I ran pg_fsync_test on log and data LV's on both old and new hardware.
>
> New hardware out performed old on every measurable on the log LV
>
> Same for the data LV's except for the 16kB open_sync write where the old
> hardware edged out the new by a hair (18649 vs 17999 ops/sec)
> and write, fsync, close where they were effectively tied.
>
>
>
>
> On Mon, Mar 4, 2013 at 4:17 PM, John Rouillard <rouilj@renesys.com> wrote:
>>
>> On Mon, Mar 04, 2013 at 03:54:40PM -0700, Steven Crandell wrote:
>> > Here's our hardware break down.
>> >
>> > The logvg on the new hardware  is 30MB/s slower (170 MB/s vs 200 MB/s )
>> > than the logvg on the older hardware which was an immediately
>> > interesting
>> > difference but we have yet to be able to create a test scenario that
>> > successfully implicates this slower log speed in our problems. That is
>> > something we are actively working on.
>> >
>> >
>> > Old server hardware:
>> >         Manufacturer: Dell Inc.
>> >         Product Name: PowerEdge R810
>> >         4x Intel(R) Xeon(R) CPU           E7540  @ 2.00GHz
>> >         32x16384 MB 1066 MHz DDR3
>> >         Controller 0: PERC H700 - 2 disk RAID-1 278.88 GB rootvg
>> >         Controller 1: PERC H800 - 18 disk RAID-6 2,178.00 GB datavg, 4
>> > drive RAID-10 272.25 GB logvg, 2 hot spare
>> >         2x 278.88 GB 15K SAS on controller 0
>> >         24x 136.13 GB 15K SAS on controller 1
>> >
>> > New server hardware:
>> >        Manufacturer: Dell Inc.
>> >         Product Name: PowerEdge R820
>> >         4x Intel(R) Xeon(R) CPU E5-4620 0 @ 2.20GHz
>> >         32x32 GB 1333 MHz DDR3
>> >         Controller 0: PERC H710P  - 4 disk RAID-6 557.75 GB rootvg
>> >         Controller 1: PERC H810    - 20 disk RAID-60 4,462.00 GB datavg,
>> > 2
>> > disk RAID-1  278.88 GB logvg, 2 hot spare
>> >         28x278.88 GB 15K SAS drives total.
>>
>> Hmm, you went from a striped (raid 1/0) log volume on the old hardware
>> to a non-striped (raid 1) volume on the new hardware. That could
>> explain the speed drop. Are the disks the same speed for the two
>> systems?
>>
>> --
>>                                 -- rouilj
>>
>> John Rouillard       System Administrator
>> Renesys Corporation  603-244-9084 (cell)  603-643-9300 x 111
>>
>>
>> --
>> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
>> To make changes to your subscription:
>> http://www.postgresql.org/mailpref/pgsql-performance
>
>



--
To understand recursion, one must first understand recursion.