Thread: Re: Postgresql Performance on an HP DL385 and
Steve, > Sun box with 4-disc array (4GB RAM. 4 167GB 10K SCSI RAID10 > LSI MegaRAID 128MB). This is after 8 runs. > > dbserver-dual-opteron-centos,08/08/06,Tuesday,20,us,12,2,5 > dbserver-dual-opteron-centos,08/08/06,Tuesday,20,sy,59,50,53 > dbserver-dual-opteron-centos,08/08/06,Tuesday,20,wa,1,0,0 > dbserver-dual-opteron-centos,08/08/06,Tuesday,20,id,45,26,38 > > Average TPS is 75 > > HP box with 8GB RAM. six disc array RAID10 on SmartArray 642 > with 192MB RAM. After 8 runs, I see: > > intown-vetstar-amd64,08/09/06,Tuesday,23,us,31,0,3 > intown-vetstar-amd64,08/09/06,Tuesday,23,sy,16,0,1 > intown-vetstar-amd64,08/09/06,Tuesday,23,wa,99,6,50 > intown-vetstar-amd64,08/09/06,Tuesday,23,id,78,0,42 > > Average TPS is 31. Note that the I/O wait (wa) on the HP box high, low and average are all *much* higher than on the Sun box. The average I/O wait was 50% of one CPU, which is huge. By comparison there was virtually no I/O wait on the Sun machine. This is indicating that your HP machine is indeed I/O bound and furthermore is tying up a PG process waiting for the disk to return. - Luke
Luke,
I thought so. In my test, I tried to be fair/equal since my Sun box has two 4-disc arrays each on their own channel. So, I just used one of them which should be a little slower than the 6-disc with 192MB cache.
Incidently, the two internal SCSI drives, which are on the 6i adapter, generated a TPS of 18.
I thought this server would impressive from notes I've read in the group. This is why I thought I might be doing something wrong. I stumped which way to take this. There is no obvious fault but something isn't right.
Steve
I thought so. In my test, I tried to be fair/equal since my Sun box has two 4-disc arrays each on their own channel. So, I just used one of them which should be a little slower than the 6-disc with 192MB cache.
Incidently, the two internal SCSI drives, which are on the 6i adapter, generated a TPS of 18.
I thought this server would impressive from notes I've read in the group. This is why I thought I might be doing something wrong. I stumped which way to take this. There is no obvious fault but something isn't right.
Steve
On 8/8/06, Luke Lonergan <LLonergan@greenplum.com> wrote:
Steve,
> Sun box with 4-disc array (4GB RAM. 4 167GB 10K SCSI RAID10
> LSI MegaRAID 128MB). This is after 8 runs.
>
> dbserver-dual-opteron-centos,08/08/06,Tuesday,20,us,12,2,5
> dbserver-dual-opteron-centos,08/08/06,Tuesday,20,sy,59,50,53
> dbserver-dual-opteron-centos,08/08/06,Tuesday,20,wa,1,0,0
> dbserver-dual-opteron-centos,08/08/06,Tuesday,20,id,45,26,38
>
> Average TPS is 75
>
> HP box with 8GB RAM. six disc array RAID10 on SmartArray 642
> with 192MB RAM. After 8 runs, I see:
>
> intown-vetstar-amd64,08/09/06,Tuesday,23,us,31,0,3
> intown-vetstar-amd64,08/09/06,Tuesday,23,sy,16,0,1
> intown-vetstar-amd64,08/09/06,Tuesday,23,wa,99,6,50
> intown-vetstar-amd64,08/09/06,Tuesday,23,id,78,0,42
>
> Average TPS is 31.
Note that the I/O wait (wa) on the HP box high, low and average are all
*much* higher than on the Sun box. The average I/O wait was 50% of one
CPU, which is huge. By comparison there was virtually no I/O wait on
the Sun machine.
This is indicating that your HP machine is indeed I/O bound and
furthermore is tying up a PG process waiting for the disk to return.
- Luke
Luke,
I check dmesg one more time and I found this regarding the cciss driver:
Filesystem "cciss/c1d0p1": Disabling barriers, not supported by the underlying device.
Don't know if it means anything, but thought I'd mention it.
Steve
I check dmesg one more time and I found this regarding the cciss driver:
Filesystem "cciss/c1d0p1": Disabling barriers, not supported by the underlying device.
Don't know if it means anything, but thought I'd mention it.
Steve
On 8/8/06, Steve Poe <steve.poe@gmail.com> wrote:
Luke,
I thought so. In my test, I tried to be fair/equal since my Sun box has two 4-disc arrays each on their own channel. So, I just used one of them which should be a little slower than the 6-disc with 192MB cache.
Incidently, the two internal SCSI drives, which are on the 6i adapter, generated a TPS of 18.
I thought this server would impressive from notes I've read in the group. This is why I thought I might be doing something wrong. I stumped which way to take this. There is no obvious fault but something isn't right.
SteveOn 8/8/06, Luke Lonergan < LLonergan@greenplum.com> wrote:Steve,
> Sun box with 4-disc array (4GB RAM. 4 167GB 10K SCSI RAID10
> LSI MegaRAID 128MB). This is after 8 runs.
>
> dbserver-dual-opteron-centos,08/08/06,Tuesday,20,us,12,2,5
> dbserver-dual-opteron-centos,08/08/06,Tuesday,20,sy,59,50,53
> dbserver-dual-opteron-centos,08/08/06,Tuesday,20,wa,1,0,0
> dbserver-dual-opteron-centos,08/08/06,Tuesday,20,id,45,26,38
>
> Average TPS is 75
>
> HP box with 8GB RAM. six disc array RAID10 on SmartArray 642
> with 192MB RAM. After 8 runs, I see:
>
> intown-vetstar-amd64,08/09/06,Tuesday,23,us,31,0,3
> intown-vetstar-amd64,08/09/06,Tuesday,23,sy,16,0,1
> intown-vetstar-amd64,08/09/06,Tuesday,23,wa,99,6,50
> intown-vetstar-amd64,08/09/06,Tuesday,23,id,78,0,42
>
> Average TPS is 31.
Note that the I/O wait (wa) on the HP box high, low and average are all
*much* higher than on the Sun box. The average I/O wait was 50% of one
CPU, which is huge. By comparison there was virtually no I/O wait on
the Sun machine.
This is indicating that your HP machine is indeed I/O bound and
furthermore is tying up a PG process waiting for the disk to return.
- Luke
On Tue, Aug 08, 2006 at 10:45:07PM -0700, Steve Poe wrote: > Luke, > > I thought so. In my test, I tried to be fair/equal since my Sun box has two > 4-disc arrays each on their own channel. So, I just used one of them which > should be a little slower than the 6-disc with 192MB cache. > > Incidently, the two internal SCSI drives, which are on the 6i adapter, > generated a TPS of 18. You should try putting pg_xlog on the 6 drive array with the data. My (limited) experience with such a config is that on a good controller with writeback caching enabled it won't hurt you, and if the internal drives aren't caching writes it'll probably help you a lot. > I thought this server would impressive from notes I've read in the group. > This is why I thought I might be doing something wrong. I stumped which way > to take this. There is no obvious fault but something isn't right. > > Steve > > On 8/8/06, Luke Lonergan <LLonergan@greenplum.com> wrote: > > > >Steve, > > > >> Sun box with 4-disc array (4GB RAM. 4 167GB 10K SCSI RAID10 > >> LSI MegaRAID 128MB). This is after 8 runs. > >> > >> dbserver-dual-opteron-centos,08/08/06,Tuesday,20,us,12,2,5 > >> dbserver-dual-opteron-centos,08/08/06,Tuesday,20,sy,59,50,53 > >> dbserver-dual-opteron-centos,08/08/06,Tuesday,20,wa,1,0,0 > >> dbserver-dual-opteron-centos,08/08/06,Tuesday,20,id,45,26,38 > >> > >> Average TPS is 75 > >> > >> HP box with 8GB RAM. six disc array RAID10 on SmartArray 642 > >> with 192MB RAM. After 8 runs, I see: > >> > >> intown-vetstar-amd64,08/09/06,Tuesday,23,us,31,0,3 > >> intown-vetstar-amd64,08/09/06,Tuesday,23,sy,16,0,1 > >> intown-vetstar-amd64,08/09/06,Tuesday,23,wa,99,6,50 > >> intown-vetstar-amd64,08/09/06,Tuesday,23,id,78,0,42 > >> > >> Average TPS is 31. > > > >Note that the I/O wait (wa) on the HP box high, low and average are all > >*much* higher than on the Sun box. The average I/O wait was 50% of one > >CPU, which is huge. By comparison there was virtually no I/O wait on > >the Sun machine. > > > >This is indicating that your HP machine is indeed I/O bound and > >furthermore is tying up a PG process waiting for the disk to return. > > > >- Luke > > > > -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Jim,
I'll give it a try. However, I did not see anywhere in the BIOS configuration of the 642 RAID adapter to enable writeback. It may have been mislabled cache accelerator where you can give a percentage to read/write. That aspect did not change the performance like the LSI MegaRAID adapter does.
Steve
I'll give it a try. However, I did not see anywhere in the BIOS configuration of the 642 RAID adapter to enable writeback. It may have been mislabled cache accelerator where you can give a percentage to read/write. That aspect did not change the performance like the LSI MegaRAID adapter does.
Steve
On 8/9/06, Jim C. Nasby <jnasby@pervasive.com> wrote:
On Tue, Aug 08, 2006 at 10:45:07PM -0700, Steve Poe wrote:
> Luke,
>
> I thought so. In my test, I tried to be fair/equal since my Sun box has two
> 4-disc arrays each on their own channel. So, I just used one of them which
> should be a little slower than the 6-disc with 192MB cache.
>
> Incidently, the two internal SCSI drives, which are on the 6i adapter,
> generated a TPS of 18.
You should try putting pg_xlog on the 6 drive array with the data. My
(limited) experience with such a config is that on a good controller
with writeback caching enabled it won't hurt you, and if the internal
drives aren't caching writes it'll probably help you a lot.
> I thought this server would impressive from notes I've read in the group.
> This is why I thought I might be doing something wrong. I stumped which way
> to take this. There is no obvious fault but something isn't right.
>
> Steve
>
> On 8/8/06, Luke Lonergan < LLonergan@greenplum.com> wrote:
> >
> >Steve,
> >
> >> Sun box with 4-disc array (4GB RAM. 4 167GB 10K SCSI RAID10
> >> LSI MegaRAID 128MB). This is after 8 runs.
> >>
> >> dbserver-dual-opteron-centos,08/08/06,Tuesday,20,us,12,2,5
> >> dbserver-dual-opteron-centos,08/08/06,Tuesday,20,sy,59,50,53
> >> dbserver-dual-opteron-centos,08/08/06,Tuesday,20,wa,1,0,0
> >> dbserver-dual-opteron-centos,08/08/06,Tuesday,20,id,45,26,38
> >>
> >> Average TPS is 75
> >>
> >> HP box with 8GB RAM. six disc array RAID10 on SmartArray 642
> >> with 192MB RAM. After 8 runs, I see:
> >>
> >> intown-vetstar-amd64,08/09/06,Tuesday,23,us,31,0,3
> >> intown-vetstar-amd64,08/09/06,Tuesday,23,sy,16,0,1
> >> intown-vetstar-amd64,08/09/06,Tuesday,23,wa,99,6,50
> >> intown-vetstar-amd64,08/09/06,Tuesday,23,id,78,0,42
> >>
> >> Average TPS is 31.
> >
> >Note that the I/O wait (wa) on the HP box high, low and average are all
> >*much* higher than on the Sun box. The average I/O wait was 50% of one
> >CPU, which is huge. By comparison there was virtually no I/O wait on
> >the Sun machine.
> >
> >This is indicating that your HP machine is indeed I/O bound and
> >furthermore is tying up a PG process waiting for the disk to return.
> >
> >- Luke
> >
> >
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Wed, 2006-08-09 at 16:11, Steve Poe wrote: > Jim, > > I'll give it a try. However, I did not see anywhere in the BIOS > configuration of the 642 RAID adapter to enable writeback. It may have > been mislabled cache accelerator where you can give a percentage to > read/write. That aspect did not change the performance like the LSI > MegaRAID adapter does. Nope, that's not the same thing. Does your raid controller have batter backed cache, or plain or regular cache? write back is unsafe without battery backup. The default is write through (i.e. the card waits for the data to get written out before acking an fsync). In write back, the card's driver writes the data to the bb cache, then returns on an fsync while the cache gets written out at leisure. In the event of a loss of power, the cache is flushed on restart.
I believe it does, I'll need to check.Thanks for the correction.
Steve
Steve
On 8/9/06, Scott Marlowe <smarlowe@g2switchworks.com> wrote:
On Wed, 2006-08-09 at 16:11, Steve Poe wrote:
> Jim,
>
> I'll give it a try. However, I did not see anywhere in the BIOS
> configuration of the 642 RAID adapter to enable writeback. It may have
> been mislabled cache accelerator where you can give a percentage to
> read/write. That aspect did not change the performance like the LSI
> MegaRAID adapter does.
Nope, that's not the same thing.
Does your raid controller have batter backed cache, or plain or regular
cache? write back is unsafe without battery backup.
The default is write through (i.e. the card waits for the data to get
written out before acking an fsync). In write back, the card's driver
writes the data to the bb cache, then returns on an fsync while the
cache gets written out at leisure. In the event of a loss of power, the
cache is flushed on restart.
Scott,
Do you know how to activate the writeback on the RAID controller from HP?
Steve
Do you know how to activate the writeback on the RAID controller from HP?
Steve
On 8/9/06, Scott Marlowe < smarlowe@g2switchworks.com> wrote:
On Wed, 2006-08-09 at 16:11, Steve Poe wrote:
> Jim,
>
> I'll give it a try. However, I did not see anywhere in the BIOS
> configuration of the 642 RAID adapter to enable writeback. It may have
> been mislabled cache accelerator where you can give a percentage to
> read/write. That aspect did not change the performance like the LSI
> MegaRAID adapter does.
Nope, that's not the same thing.
Does your raid controller have batter backed cache, or plain or regular
cache? write back is unsafe without battery backup.
The default is write through (i.e. the card waits for the data to get
written out before acking an fsync). In write back, the card's driver
writes the data to the bb cache, then returns on an fsync while the
cache gets written out at leisure. In the event of a loss of power, the
cache is flushed on restart.
Jim, I tried as you suggested and my performance dropped by 50%. I went from a 32 TPS to 16. Oh well. Steve On Wed, 2006-08-09 at 16:05 -0500, Jim C. Nasby wrote: > On Tue, Aug 08, 2006 at 10:45:07PM -0700, Steve Poe wrote: > > Luke, > > > > I thought so. In my test, I tried to be fair/equal since my Sun box has two > > 4-disc arrays each on their own channel. So, I just used one of them which > > should be a little slower than the 6-disc with 192MB cache. > > > > Incidently, the two internal SCSI drives, which are on the 6i adapter, > > generated a TPS of 18. > > You should try putting pg_xlog on the 6 drive array with the data. My > (limited) experience with such a config is that on a good controller > with writeback caching enabled it won't hurt you, and if the internal > drives aren't caching writes it'll probably help you a lot. > > > I thought this server would impressive from notes I've read in the group. > > This is why I thought I might be doing something wrong. I stumped which way > > to take this. There is no obvious fault but something isn't right. > > > > Steve > > > > On 8/8/06, Luke Lonergan <LLonergan@greenplum.com> wrote: > > > > > >Steve, > > > > > >> Sun box with 4-disc array (4GB RAM. 4 167GB 10K SCSI RAID10 > > >> LSI MegaRAID 128MB). This is after 8 runs. > > >> > > >> dbserver-dual-opteron-centos,08/08/06,Tuesday,20,us,12,2,5 > > >> dbserver-dual-opteron-centos,08/08/06,Tuesday,20,sy,59,50,53 > > >> dbserver-dual-opteron-centos,08/08/06,Tuesday,20,wa,1,0,0 > > >> dbserver-dual-opteron-centos,08/08/06,Tuesday,20,id,45,26,38 > > >> > > >> Average TPS is 75 > > >> > > >> HP box with 8GB RAM. six disc array RAID10 on SmartArray 642 > > >> with 192MB RAM. After 8 runs, I see: > > >> > > >> intown-vetstar-amd64,08/09/06,Tuesday,23,us,31,0,3 > > >> intown-vetstar-amd64,08/09/06,Tuesday,23,sy,16,0,1 > > >> intown-vetstar-amd64,08/09/06,Tuesday,23,wa,99,6,50 > > >> intown-vetstar-amd64,08/09/06,Tuesday,23,id,78,0,42 > > >> > > >> Average TPS is 31. > > > > > >Note that the I/O wait (wa) on the HP box high, low and average are all > > >*much* higher than on the Sun box. The average I/O wait was 50% of one > > >CPU, which is huge. By comparison there was virtually no I/O wait on > > >the Sun machine. > > > > > >This is indicating that your HP machine is indeed I/O bound and > > >furthermore is tying up a PG process waiting for the disk to return. > > > > > >- Luke > > > > > > >
On Wed, Aug 09, 2006 at 08:29:13PM -0700, Steve Poe wrote: >I tried as you suggested and my performance dropped by 50%. I went from >a 32 TPS to 16. Oh well. If you put data & xlog on the same array, put them on seperate partitions, probably formatted differently (ext2 on xlog). Mike Stone
Mike, On 8/10/06 4:09 AM, "Michael Stone" <mstone+postgres@mathom.us> wrote: > On Wed, Aug 09, 2006 at 08:29:13PM -0700, Steve Poe wrote: >> I tried as you suggested and my performance dropped by 50%. I went from >> a 32 TPS to 16. Oh well. > > If you put data & xlog on the same array, put them on seperate > partitions, probably formatted differently (ext2 on xlog). If he's doing the same thing on both systems (Sun and HP) and the HP performance is dramatically worse despite using more disks and having faster CPUs and more RAM, ISTM the problem isn't the configuration. Add to this the fact that the Sun machine is CPU bound while the HP is I/O wait bound and I think the problem is the disk hardware or the driver therein. - Luke
On Thu, 2006-08-10 at 10:15, Luke Lonergan wrote: > Mike, > > On 8/10/06 4:09 AM, "Michael Stone" <mstone+postgres@mathom.us> wrote: > > > On Wed, Aug 09, 2006 at 08:29:13PM -0700, Steve Poe wrote: > >> I tried as you suggested and my performance dropped by 50%. I went from > >> a 32 TPS to 16. Oh well. > > > > If you put data & xlog on the same array, put them on seperate > > partitions, probably formatted differently (ext2 on xlog). > > If he's doing the same thing on both systems (Sun and HP) and the HP > performance is dramatically worse despite using more disks and having faster > CPUs and more RAM, ISTM the problem isn't the configuration. > > Add to this the fact that the Sun machine is CPU bound while the HP is I/O > wait bound and I think the problem is the disk hardware or the driver > therein. I agree. The problem here looks to be the RAID controller. Steve, got access to a different RAID controller to test with?
Scott,
I *could* rip out the LSI MegaRAID 2X from my Sun box. This belongs to me for testing. but I don't know if it will fit in the DL385. Do they have full-heigth/length slots? I've not worked on this type of box before. I was thinking this is the next step. In the meantime, I've discovered their no email support for them so I am hoping find a support contact through the sales rep that this box was purchased from.
Steve
I *could* rip out the LSI MegaRAID 2X from my Sun box. This belongs to me for testing. but I don't know if it will fit in the DL385. Do they have full-heigth/length slots? I've not worked on this type of box before. I was thinking this is the next step. In the meantime, I've discovered their no email support for them so I am hoping find a support contact through the sales rep that this box was purchased from.
Steve
On 8/10/06, Scott Marlowe <smarlowe@g2switchworks.com> wrote:
On Thu, 2006-08-10 at 10:15, Luke Lonergan wrote:
> Mike,
>
> On 8/10/06 4:09 AM, "Michael Stone" <mstone+postgres@mathom.us> wrote:
>
> > On Wed, Aug 09, 2006 at 08:29:13PM -0700, Steve Poe wrote:
> >> I tried as you suggested and my performance dropped by 50%. I went from
> >> a 32 TPS to 16. Oh well.
> >
> > If you put data & xlog on the same array, put them on seperate
> > partitions, probably formatted differently (ext2 on xlog).
>
> If he's doing the same thing on both systems (Sun and HP) and the HP
> performance is dramatically worse despite using more disks and having faster
> CPUs and more RAM, ISTM the problem isn't the configuration.
>
> Add to this the fact that the Sun machine is CPU bound while the HP is I/O
> wait bound and I think the problem is the disk hardware or the driver
> therein.
I agree. The problem here looks to be the RAID controller.
Steve, got access to a different RAID controller to test with?
---------------------------(end of broadcast)---------------------------
TIP 1: if posting/reading through Usenet, please send an appropriate
subscribe-nomail command to majordomo@postgresql.org so that your
message can get through to the mailing list cleanly
On Thu, Aug 10, 2006 at 07:09:38AM -0400, Michael Stone wrote: > On Wed, Aug 09, 2006 at 08:29:13PM -0700, Steve Poe wrote: > >I tried as you suggested and my performance dropped by 50%. I went from > >a 32 TPS to 16. Oh well. > > If you put data & xlog on the same array, put them on seperate > partitions, probably formatted differently (ext2 on xlog). Got any data to back that up? The problem with seperate partitions is that it means more head movement for the drives. If it's all one partition the pg_xlog data will tend to be interspersed with the heap data, meaning less need for head repositioning. Of course, if ext2 provided enough of a speed improvement over ext3 with data=writeback then it's possible that this would be a win, though if the controller is good enough to make putting pg_xlog on the same array as $PGDATA a win, I suspect it would make up for most filesystem performance issues associated with pg_xlog as well. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Jim,
I have to say Michael is onto something here to my surprise. I partitioned the RAID10 on the SmartArray 642 adapter into two parts, PGDATA formatted with XFS and pg_xlog as ext2. Performance jumped up to median of 98 TPS. I could reproduce the similar result with the LSI MegaRAID 2X adapter as well as with my own 4-disc drive array.
The problem lies with the HP SmartArray 6i adapter and/or the internal SCSI discs. Putting the pg_xlog on it kills the performance.
Steve
I have to say Michael is onto something here to my surprise. I partitioned the RAID10 on the SmartArray 642 adapter into two parts, PGDATA formatted with XFS and pg_xlog as ext2. Performance jumped up to median of 98 TPS. I could reproduce the similar result with the LSI MegaRAID 2X adapter as well as with my own 4-disc drive array.
The problem lies with the HP SmartArray 6i adapter and/or the internal SCSI discs. Putting the pg_xlog on it kills the performance.
Steve
On 8/14/06, Jim C. Nasby <jnasby@pervasive.com> wrote:
On Thu, Aug 10, 2006 at 07:09:38AM -0400, Michael Stone wrote:
> On Wed, Aug 09, 2006 at 08:29:13PM -0700, Steve Poe wrote:
> >I tried as you suggested and my performance dropped by 50%. I went from
> >a 32 TPS to 16. Oh well.
>
> If you put data & xlog on the same array, put them on seperate
> partitions, probably formatted differently (ext2 on xlog).
Got any data to back that up?
The problem with seperate partitions is that it means more head movement
for the drives. If it's all one partition the pg_xlog data will tend to
be interspersed with the heap data, meaning less need for head
repositioning.
Of course, if ext2 provided enough of a speed improvement over ext3 with
data=writeback then it's possible that this would be a win, though if
the controller is good enough to make putting pg_xlog on the same array
as $PGDATA a win, I suspect it would make up for most filesystem
performance issues associated with pg_xlog as well.
--
Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com
Pervasive Software http://pervasive.com work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
---------------------------(end of broadcast)---------------------------
TIP 5: don't forget to increase your free space map settings
On Mon, Aug 14, 2006 at 10:38:41AM -0500, Jim C. Nasby wrote: >Got any data to back that up? yes. that I'm willing to dig out? no. :) >The problem with seperate partitions is that it means more head movement >for the drives. If it's all one partition the pg_xlog data will tend to >be interspersed with the heap data, meaning less need for head >repositioning. The pg_xlog files will tend to be created up at the front of the disk and just sit there. Any affect the positioning has one way or the other isn't going to be measurable/repeatable. With a write cache for pg_xlog the positioning isn't really going to matter anyway, since you don't have to wait for a seek to do the write. From what I've observed in testing, I'd guess that the issue is that certain filesystem operations (including, possibly, metadata operations) are handled in order. If you have xlog on a seperate partition there will never be anything competing with a log write on the server side, which won't necessarily be true on a shared filesystem. Even if you have a battery backed write cache, you might still have to wait a relatively long time for the pg_xlog data to be written out if there's already a lot of other stuff in a filesystem write queue. Mike Stone
On Mon, Aug 14, 2006 at 08:51:09AM -0700, Steve Poe wrote: > Jim, > > I have to say Michael is onto something here to my surprise. I partitioned > the RAID10 on the SmartArray 642 adapter into two parts, PGDATA formatted > with XFS and pg_xlog as ext2. Performance jumped up to median of 98 TPS. I > could reproduce the similar result with the LSI MegaRAID 2X adapter as well > as with my own 4-disc drive array. > > The problem lies with the HP SmartArray 6i adapter and/or the internal SCSI > discs. Putting the pg_xlog on it kills the performance. Wow, interesting. IIRC, XFS is lower performing than ext3, so if your previous tests were done with XFS, that might be part of it. But without a doubt, if you don't have a good raid controller you don't want to try combining pg_xlog with PGDATA. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Mon, Aug 14, 2006 at 12:05:46PM -0500, Jim C. Nasby wrote: >Wow, interesting. IIRC, XFS is lower performing than ext3, For xlog, maybe. For data, no. Both are definately slower than ext2 for xlog, which is another reason to have xlog on a small filesystem which doesn't need metadata journalling. Mike Stone
On Mon, Aug 14, 2006 at 01:03:41PM -0400, Michael Stone wrote: > On Mon, Aug 14, 2006 at 10:38:41AM -0500, Jim C. Nasby wrote: > >Got any data to back that up? > > yes. that I'm willing to dig out? no. :) Well, I'm not digging hard numbers out either, so that's fair. :) But it would be very handy if people posted results from any testing they're doing as part of setting up new hardware. Actually, a wiki would probably be ideal for this... > >The problem with seperate partitions is that it means more head movement > >for the drives. If it's all one partition the pg_xlog data will tend to > >be interspersed with the heap data, meaning less need for head > >repositioning. > > The pg_xlog files will tend to be created up at the front of the disk > and just sit there. Any affect the positioning has one way or the other > isn't going to be measurable/repeatable. With a write cache for pg_xlog > the positioning isn't really going to matter anyway, since you don't > have to wait for a seek to do the write. Certainly... my contention is that if you have a good controller that's caching writes then drive layout basically won't matter at all, because the controller will just magically make things optimal. > From what I've observed in testing, I'd guess that the issue is that > certain filesystem operations (including, possibly, metadata operations) > are handled in order. If you have xlog on a seperate partition there > will never be anything competing with a log write on the server side, > which won't necessarily be true on a shared filesystem. Even if you have > a battery backed write cache, you might still have to wait a relatively > long time for the pg_xlog data to be written out if there's already a > lot of other stuff in a filesystem write queue. Well, if the controller is caching with a BBU, I'm not sure that order matters anymore, because the controller should be able to re-order at will. Theoretically. :) But this is why having some actual data posted somewhere would be great. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Mon, Aug 14, 2006 at 01:09:04PM -0400, Michael Stone wrote: > On Mon, Aug 14, 2006 at 12:05:46PM -0500, Jim C. Nasby wrote: > >Wow, interesting. IIRC, XFS is lower performing than ext3, > > For xlog, maybe. For data, no. Both are definately slower than ext2 for > xlog, which is another reason to have xlog on a small filesystem which > doesn't need metadata journalling. Are 'we' sure that such a setup can't lose any data? I'm worried about files getting lost when they get written out before the metadata does. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Tue, Aug 15, 2006 at 11:25:24AM -0500, Jim C. Nasby wrote: >Well, if the controller is caching with a BBU, I'm not sure that order >matters anymore, because the controller should be able to re-order at >will. Theoretically. :) But this is why having some actual data posted >somewhere would be great. You're missing the point. It's not a question of what happens once it gets to the disk/controller, it's a question of whether the xlog write has to compete with some other write activity before the write gets to the disk (e.g., at the filesystem level). If you've got a bunch of stuff in a write buffer on the OS level and you try to push the xlog write out, you may have to wait for the other stuff to get to the controller write cache before the xlog does. It doesn't matter if you don't have to wait for the write to get from the controller cache to the disk if you already had to wait to get to the controller cache. The effect is a *lot* smaller than not having a non-volatile cache, but it is an improvement. (Also, the difference between ext2 and xfs for the xlog is pretty big itself, and a good reason all by itself to put xlog on a seperate partition that's small enough to not need journalling.) Mike Stone
On Tue, Aug 15, 2006 at 11:29:26AM -0500, Jim C. Nasby wrote: >Are 'we' sure that such a setup can't lose any data? Yes. If you check the archives, you can even find the last time this was discussed... The bottom line is that the only reason you need a metadata journalling filesystem is to save the fsck time when you come up. On a little partition like xlog, that's not an issue. Mike Stone
On Tue, Aug 15, 2006 at 11:29:26AM -0500, Jim C. Nasby wrote: > On Mon, Aug 14, 2006 at 01:09:04PM -0400, Michael Stone wrote: > > On Mon, Aug 14, 2006 at 12:05:46PM -0500, Jim C. Nasby wrote: > > >Wow, interesting. IIRC, XFS is lower performing than ext3, > > For xlog, maybe. For data, no. Both are definately slower than ext2 for > > xlog, which is another reason to have xlog on a small filesystem which > > doesn't need metadata journalling. > Are 'we' sure that such a setup can't lose any data? I'm worried about > files getting lost when they get written out before the metadata does. I've been worrying about this myself, and my current conclusion is that ext2 is bad because: a) fsck, and b) data can be lost or corrupted, which could lead to the need to trash the xlog. Even ext3 in writeback mode allows for the indirect blocks to be updated without the data underneath, allowing for blocks to point to random data, or worse, previous apparently sane data (especially if the data is from a drive only used for xlog - the chance is high that a block might look partially valid?). So, I'm sticking with ext3 in ordered mode. Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/
On Tue, Aug 15, 2006 at 01:26:46PM -0400, Michael Stone wrote: > On Tue, Aug 15, 2006 at 11:29:26AM -0500, Jim C. Nasby wrote: > >Are 'we' sure that such a setup can't lose any data? > Yes. If you check the archives, you can even find the last time this was > discussed... I looked last night (coincidence actually) and didn't find proof that you cannot lose data. How do you deal with the file system structure being updated before the data blocks are (re-)written? I don't think you can. > The bottom line is that the only reason you need a metadata journalling > filesystem is to save the fsck time when you come up. On a little > partition like xlog, that's not an issue. fsck isn't only about time to fix. fsck is needed, because the file system is broken. If the file system is broken, how can you guarantee data has not been corrupted? Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/
On Tue, Aug 15, 2006 at 02:33:27PM -0400, mark@mark.mielke.cc wrote: >On Tue, Aug 15, 2006 at 01:26:46PM -0400, Michael Stone wrote: >> On Tue, Aug 15, 2006 at 11:29:26AM -0500, Jim C. Nasby wrote: >> >Are 'we' sure that such a setup can't lose any data? >> Yes. If you check the archives, you can even find the last time this was >> discussed... > >I looked last night (coincidence actually) and didn't find proof that >you cannot lose data. You aren't going to find proof, any more than you'll find proof that you won't lose data if you do lose a journalling fs. (Because there isn't any.) Unfortunately, many people misunderstand the what a metadata journal does for you, and overstate its importance in this type of application. >How do you deal with the file system structure being updated before the >data blocks are (re-)written? *That's what the postgres log is for.* If the latest xlog entries don't make it to disk, they won't be replayed; if they didn't make it to disk, the transaction would not have been reported as commited. An application that understands filesystem semantics can guarantee data integrity without metadata journaling. >> The bottom line is that the only reason you need a metadata journalling >> filesystem is to save the fsck time when you come up. On a little >> partition like xlog, that's not an issue. > >fsck isn't only about time to fix. fsck is needed, because the file system >is broken. fsck is needed to reconcile the metadata with the on-disk allocations. To do that, it reads all the inodes and their corresponding directory entries. The time to do that is proportional to the size of the filesystem, hence the comment about time. fsck is not needed "because the filesystem is broken", it's needed because the filesystem is marked dirty. Mike Stone
On Tue, Aug 15, 2006 at 03:02:56PM -0400, Michael Stone wrote: > On Tue, Aug 15, 2006 at 02:33:27PM -0400, mark@mark.mielke.cc wrote: > >On Tue, Aug 15, 2006 at 01:26:46PM -0400, Michael Stone wrote: > >>On Tue, Aug 15, 2006 at 11:29:26AM -0500, Jim C. Nasby wrote: > >>>Are 'we' sure that such a setup can't lose any data? > >>Yes. If you check the archives, you can even find the last time this was > >>discussed... > > > >I looked last night (coincidence actually) and didn't find proof that > >you cannot lose data. > > You aren't going to find proof, any more than you'll find proof that you > won't lose data if you do lose a journalling fs. (Because there isn't > any.) Unfortunately, many people misunderstand the what a metadata > journal does for you, and overstate its importance in this type of > application. > > >How do you deal with the file system structure being updated before the > >data blocks are (re-)written? > > *That's what the postgres log is for.* If the latest xlog entries don't > make it to disk, they won't be replayed; if they didn't make it to > disk, the transaction would not have been reported as commited. An > application that understands filesystem semantics can guarantee data > integrity without metadata journaling. So what causes files to get 'lost' and get stuck in lost+found? AFAIK that's because the file was written before the metadata. Now, if fsync'ing a file also ensures that all the metadata is written, then we're probably fine... if not, then we're at risk every time we create a new file (every WAL segment if archiving is on, and every time a relation passes a 1GB boundary). FWIW, the way that FreeBSD gets around the need to fsck a dirty filesystem before use without using a journal is to ensure that metadate operations are always on the drive before the actual data is written. There's still a need to fsck a dirty filesystem, but it can now be done in the background, with the filesystem mounted and in use. > >>The bottom line is that the only reason you need a metadata journalling > >>filesystem is to save the fsck time when you come up. On a little > >>partition like xlog, that's not an issue. > > > >fsck isn't only about time to fix. fsck is needed, because the file system > >is broken. > > fsck is needed to reconcile the metadata with the on-disk allocations. > To do that, it reads all the inodes and their corresponding directory > entries. The time to do that is proportional to the size of the > filesystem, hence the comment about time. fsck is not needed "because > the filesystem is broken", it's needed because the filesystem is marked > dirty. > > Mike Stone > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings > -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Tue, Aug 15, 2006 at 03:02:56PM -0400, Michael Stone wrote: > On Tue, Aug 15, 2006 at 02:33:27PM -0400, mark@mark.mielke.cc wrote: > >>>Are 'we' sure that such a setup can't lose any data? > >>Yes. If you check the archives, you can even find the last time this was > >>discussed... > >I looked last night (coincidence actually) and didn't find proof that > >you cannot lose data. > You aren't going to find proof, any more than you'll find proof that you > won't lose data if you do lose a journalling fs. (Because there isn't > any.) Unfortunately, many people misunderstand the what a metadata > journal does for you, and overstate its importance in this type of > application. Yes, many people do. :-) > >How do you deal with the file system structure being updated before the > >data blocks are (re-)written? > *That's what the postgres log is for.* If the latest xlog entries don't > make it to disk, they won't be replayed; if they didn't make it to > disk, the transaction would not have been reported as commited. An > application that understands filesystem semantics can guarantee data > integrity without metadata journaling. No. This is not true. Updating the file system structure (inodes, indirect blocks) touches a separate part of the disk than the actual data. If the file system structure is modified, say, to extend a file to allow it to contain more data, but the data itself is not written, then upon a restore, with a system such as ext2, or ext3 with writeback, or xfs, it is possible that the end of the file, even the postgres log file, will contain a random block of data from the disk. If this random block of data happens to look like a valid xlog block, it may be played back, and the database corrupted. If the file system is only used for xlog data, the chance that it looks like a valid block increases, would it not? > >>The bottom line is that the only reason you need a metadata journalling > >>filesystem is to save the fsck time when you come up. On a little > >>partition like xlog, that's not an issue. > >fsck isn't only about time to fix. fsck is needed, because the file system > >is broken. > fsck is needed to reconcile the metadata with the on-disk allocations. > To do that, it reads all the inodes and their corresponding directory > entries. The time to do that is proportional to the size of the > filesystem, hence the comment about time. fsck is not needed "because > the filesystem is broken", it's needed because the filesystem is marked > dirty. This is also wrong. fsck is needed because the file system is broken. It takes time, because it doesn't have a journal to help it, therefore it must look through the entire file system and guess what the problems are. There are classes of problems such as I describe above, for which fsck *cannot* guess how to solve the problem. There is not enough information available for it to deduce that anything is wrong at all. The probability is low, for sure - but then, the chance of a file system failure is already low. Betting on ext2 + postgresql xlog has not been confirmed to me as reliable. Telling me that journalling is misunderstood doesn't prove to me that you understand it. I don't mean to be offensive, but I won't accept what you say, as it does not make sense with my understanding of how file systems work. :-) Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/
On Tue, Aug 15, 2006 at 02:15:05PM -0500, Jim C. Nasby wrote: > So what causes files to get 'lost' and get stuck in lost+found? > AFAIK that's because the file was written before the metadata. Now, if > fsync'ing a file also ensures that all the metadata is written, then > we're probably fine... if not, then we're at risk every time we create a > new file (every WAL segment if archiving is on, and every time a > relation passes a 1GB boundary). Only if fsync ensures that the data written to disk is ordered, which as far as I know, is not done for ext2. Dirty blocks are written in whatever order is fastest for them to be written, or sequential order, or some order that isn't based on examining the metadata. If my understanding is correct - and I've seen nothing yet to say that it isn't - ext2 is not safe, postgresql xlog or not, fsck or not. It is safer than no postgresql xlog - but there exists windows, however small, where the file system can be corrupted. The need for fsck is due to this problem. If fsck needs to do anything at all, other than replay a journal, the file system is broken. Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/
mark@mark.mielke.cc writes: > I've been worrying about this myself, and my current conclusion is that > ext2 is bad because: a) fsck, and b) data can be lost or corrupted, which > could lead to the need to trash the xlog. > Even ext3 in writeback mode allows for the indirect blocks to be updated > without the data underneath, allowing for blocks to point to random data, > or worse, previous apparently sane data (especially if the data is from > a drive only used for xlog - the chance is high that a block might look > partially valid?). At least for xlog, this worrying is misguided, because we zero and fsync a WAL file before we ever put any valuable data into it. Unless the filesystem is lying through its teeth about having done an fsync, there should be no metadata changes happening for an active WAL file (other than mtime of course). regards, tom lane
On Tue, Aug 15, 2006 at 04:05:17PM -0400, Tom Lane wrote: > mark@mark.mielke.cc writes: > > I've been worrying about this myself, and my current conclusion is that > > ext2 is bad because: a) fsck, and b) data can be lost or corrupted, which > > could lead to the need to trash the xlog. > > Even ext3 in writeback mode allows for the indirect blocks to be updated > > without the data underneath, allowing for blocks to point to random data, > > or worse, previous apparently sane data (especially if the data is from > > a drive only used for xlog - the chance is high that a block might look > > partially valid?). > At least for xlog, this worrying is misguided, because we zero and fsync > a WAL file before we ever put any valuable data into it. Unless the > filesystem is lying through its teeth about having done an fsync, there > should be no metadata changes happening for an active WAL file (other > than mtime of course). Hmmm... I may have missed a post about this in the archive. WAL file is never appended - only re-written? If so, then I'm wrong, and ext2 is fine. The requirement is that no file system structures change as a result of any writes that PostgreSQL does. If no file system structures change, then I take everything back as uninformed. Please confirm whichever. :-) Cheers, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/
mark@mark.mielke.cc writes: > WAL file is never appended - only re-written? > If so, then I'm wrong, and ext2 is fine. The requirement is that no > file system structures change as a result of any writes that > PostgreSQL does. If no file system structures change, then I take > everything back as uninformed. That risk certainly exists in the general data directory, but AFAIK it's not a problem for pg_xlog. regards, tom lane
On Tue, Aug 15, 2006 at 02:15:05PM -0500, Jim C. Nasby wrote: >Now, if >fsync'ing a file also ensures that all the metadata is written, then >we're probably fine... ...and it does. Unclean shutdowns cause problems in general because filesystems operate asynchronously. postgres (and other similar programs) go to great lengths to make sure that critical operations are performed synchronously. If the program *doesn't* do that, metadata journaling isn't a magic wand which will guarantee data integrity--it won't. If the program *does* do that, all the metadata journaling adds is the ability to skip fsck and start up faster. Mike Stone
On Tue, Aug 15, 2006 at 03:39:51PM -0400, mark@mark.mielke.cc wrote: >No. This is not true. Updating the file system structure (inodes, indirect >blocks) touches a separate part of the disk than the actual data. If >the file system structure is modified, say, to extend a file to allow >it to contain more data, but the data itself is not written, then upon >a restore, with a system such as ext2, or ext3 with writeback, or xfs, >it is possible that the end of the file, even the postgres log file, >will contain a random block of data from the disk. If this random block >of data happens to look like a valid xlog block, it may be played back, >and the database corrupted. you're conflating a whole lot of different issues here. You're ignoring the fact that postgres preallocates the xlog segment, you're ignoring the fact that you can sync a directory entry, you're ignoring the fact that syncing some metadata (such as atime) doesn't matter (only the block allocation is important in this case, and the blocks are pre-allocated). >This is also wrong. fsck is needed because the file system is broken. nope, the file system *may* be broken. the dirty flag simply indicates that the filesystem needs to be checked to find out whether or not it is broken. >I don't mean to be offensive, but I won't accept what you say, as it does >not make sense with my understanding of how file systems work. :-) <shrug> I'm not getting paid to convince you of anything. Mike Stone
On Tue, Aug 15, 2006 at 04:58:59PM -0400, Michael Stone wrote: > On Tue, Aug 15, 2006 at 03:39:51PM -0400, mark@mark.mielke.cc wrote: > >No. This is not true. Updating the file system structure (inodes, indirect > >blocks) touches a separate part of the disk than the actual data. If > >the file system structure is modified, say, to extend a file to allow > >it to contain more data, but the data itself is not written, then upon > >a restore, with a system such as ext2, or ext3 with writeback, or xfs, > >it is possible that the end of the file, even the postgres log file, > >will contain a random block of data from the disk. If this random block > >of data happens to look like a valid xlog block, it may be played back, > >and the database corrupted. > you're conflating a whole lot of different issues here. You're ignoring > the fact that postgres preallocates the xlog segment, you're ignoring > the fact that you can sync a directory entry, you're ignoring the fact > that syncing some metadata (such as atime) doesn't matter (only the > block allocation is important in this case, and the blocks are > pre-allocated). Yes, no, no, no. :-) I didn't know that the xlog segment only uses pre-allocated space. I ignore mtime/atime as they don't count as file system structure changes to me. It's updating a field in place. No change to the structure. With the pre-allocation knowledge, I agree with you. Not sure how I missed that in my reviewing of the archives... I did know it pre-allocated once upon a time... Hmm.... > >This is also wrong. fsck is needed because the file system is broken. > nope, the file system *may* be broken. the dirty flag simply indicates > that the filesystem needs to be checked to find out whether or not it is > broken. Ah, but if we knew it wasn't broken, then fsck wouldn't be needed, now would it? So we assume that it is broken. A little bit of a game, but it is important to me. If I assumed the file system was not broken, I wouldn't run fsck. I run fsck, because I assume it may be broken. If broken, it indicates potential corruption. The difference for me, is that if you are correct, that the xlog is safe, than for a disk that only uses xlog, fsck is not ever necessary, even after a system crash. If fsck is necessary, then there is potential for a problem. With the pre-allocation knowledge, I'm tempted to agree with you that fsck is not ever necessary for partitions that only hold a properly pre-allocated xlog. > >I don't mean to be offensive, but I won't accept what you say, as it does > >not make sense with my understanding of how file systems work. :-) > <shrug> I'm not getting paid to convince you of anything. Just getting you to back up your claim a bit... As I said, no intent to offend. I learned from it. Thanks, mark -- mark@mielke.cc / markm@ncf.ca / markm@nortel.com __________________________ . . _ ._ . . .__ . . ._. .__ . . . .__ | Neighbourhood Coder |\/| |_| |_| |/ |_ |\/| | |_ | |/ |_ | | | | | | \ | \ |__ . | | .|. |__ |__ | \ |__ | Ottawa, Ontario, Canada One ring to rule them all, one ring to find them, one ring to bring them all and in the darkness bind them... http://mark.mielke.cc/
On Tue, Aug 15, 2006 at 05:38:43PM -0400, mark@mark.mielke.cc wrote: > I didn't know that the xlog segment only uses pre-allocated space. I > ignore mtime/atime as they don't count as file system structure > changes to me. It's updating a field in place. No change to the structure. > > With the pre-allocation knowledge, I agree with you. Not sure how I > missed that in my reviewing of the archives... I did know it > pre-allocated once upon a time... Hmm.... This is only valid if the pre-allocation is also fsync'd *and* fsync ensures that both the metadata and file data are on disk. Anyone actually checked that? :) BTW, I did see some anecdotal evidence on one of the lists a while ago. A PostgreSQL DBA had suggested doing a 'pull the power cord' test to the other DBAs (all of which were responsible for different RDBMSes, including a bunch of well known names). They all thought he was off his rocker. Not too long after that, an unplanned power outage did occur, and PostgreSQL was the only RDBMS that recovered every single database without intervention. -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
On Tue, Aug 15, 2006 at 05:20:25PM -0500, Jim C. Nasby wrote: > This is only valid if the pre-allocation is also fsync'd *and* fsync > ensures that both the metadata and file data are on disk. Anyone > actually checked that? :) fsync() does that, yes. fdatasync() (if it exists), OTOH, doesn't sync the metadata. /* Steinar */ -- Homepage: http://www.sesse.net/
On Tue, 15 Aug 2006 mark@mark.mielke.cc wrote: >>> This is also wrong. fsck is needed because the file system is broken. >> nope, the file system *may* be broken. the dirty flag simply indicates >> that the filesystem needs to be checked to find out whether or not it is >> broken. > > Ah, but if we knew it wasn't broken, then fsck wouldn't be needed, now > would it? So we assume that it is broken. A little bit of a game, but > it is important to me. If I assumed the file system was not broken, I > wouldn't run fsck. I run fsck, because I assume it may be broken. If > broken, it indicates potential corruption. note tha the ext3, reiserfs, jfs, and xfs developers (at least) consider fsck nessasary even for journaling fileysstems. they just let you get away without it being mandatory after a unclean shutdown. David Lang
"Steinar H. Gunderson" <sgunderson@bigfoot.com> writes: > On Tue, Aug 15, 2006 at 05:20:25PM -0500, Jim C. Nasby wrote: >> This is only valid if the pre-allocation is also fsync'd *and* fsync >> ensures that both the metadata and file data are on disk. Anyone >> actually checked that? :) > fsync() does that, yes. fdatasync() (if it exists), OTOH, doesn't sync the > metadata. Well, the POSIX spec says that fsync should do that ;-) My guess is that most/all kernel filesystem layers do indeed try to sync everything that the spec says they should. The Achilles' heel of the whole business is disk drives that lie about write completion. The kernel is just as vulnerable to that as any application ... regards, tom lane
Hi, Jim, Jim C. Nasby wrote: > Well, if the controller is caching with a BBU, I'm not sure that order > matters anymore, because the controller should be able to re-order at > will. Theoretically. :) But this is why having some actual data posted > somewhere would be great. Well, actually, the controller should not reorder over write barriers. Markus -- Markus Schaber | Logical Tracking&Tracing International AG Dipl. Inf. | Software Development GIS Fight against software patents in EU! www.ffii.org www.nosoftwarepatents.org
Everyone, I wanted to follow-up on bonnie results for the internal RAID1 which is connected to the SmartArray 6i. I believe this is the problem, but I am not good at interepting the results. Here's an sample of three runs: scsi disc array ,16G,47983,67,65492,20,37214,6,73785,87,89787,6,578.2,0,16,+++++, +++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++ scsi disc array ,16G,54634,75,67793,21,36835,6,74190,88,89314,6,579.9,0,16,+++++, +++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++ scsi disc array ,16G,55056,76,66108,20,36859,6,74108,87,89559,6,585.0,0,16,+++++, +++,+++++,+++,+++++,+++,+++++,+++,+ This was run on the internal RAID1 on the outer portion of the discs formatted at ext2. Thanks. Steve On Thu, 2006-08-10 at 10:35 -0500, Scott Marlowe wrote: > On Thu, 2006-08-10 at 10:15, Luke Lonergan wrote: > > Mike, > > > > On 8/10/06 4:09 AM, "Michael Stone" <mstone+postgres@mathom.us> wrote: > > > > > On Wed, Aug 09, 2006 at 08:29:13PM -0700, Steve Poe wrote: > > >> I tried as you suggested and my performance dropped by 50%. I went from > > >> a 32 TPS to 16. Oh well. > > > > > > If you put data & xlog on the same array, put them on seperate > > > partitions, probably formatted differently (ext2 on xlog). > > > > If he's doing the same thing on both systems (Sun and HP) and the HP > > performance is dramatically worse despite using more disks and having faster > > CPUs and more RAM, ISTM the problem isn't the configuration. > > > > Add to this the fact that the Sun machine is CPU bound while the HP is I/O > > wait bound and I think the problem is the disk hardware or the driver > > therein. > > I agree. The problem here looks to be the RAID controller. > > Steve, got access to a different RAID controller to test with? > > ---------------------------(end of broadcast)--------------------------- > TIP 1: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly
Steve, If this is an internal RAID1 on two disks, it looks great. Based on the random seeks though (578 seeks/sec), it looks like maybe it's 6 disks in a RAID10? - Luke On 8/16/06 7:10 PM, "Steve Poe" <steve.poe@gmail.com> wrote: > Everyone, > > I wanted to follow-up on bonnie results for the internal RAID1 which is > connected to the SmartArray 6i. I believe this is the problem, but I am > not good at interepting the results. Here's an sample of three runs: > > scsi disc > array ,16G,47983,67,65492,20,37214,6,73785,87,89787,6,578.2,0,16,+++++, > +++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++ > scsi disc > array ,16G,54634,75,67793,21,36835,6,74190,88,89314,6,579.9,0,16,+++++, > +++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++ > scsi disc > array ,16G,55056,76,66108,20,36859,6,74108,87,89559,6,585.0,0,16,+++++, > +++,+++++,+++,+++++,+++,+++++,+++,+ > > This was run on the internal RAID1 on the outer portion of the discs > formatted at ext2. > > Thanks. > > Steve > > On Thu, 2006-08-10 at 10:35 -0500, Scott Marlowe wrote: >> On Thu, 2006-08-10 at 10:15, Luke Lonergan wrote: >>> Mike, >>> >>> On 8/10/06 4:09 AM, "Michael Stone" <mstone+postgres@mathom.us> wrote: >>> >>>> On Wed, Aug 09, 2006 at 08:29:13PM -0700, Steve Poe wrote: >>>>> I tried as you suggested and my performance dropped by 50%. I went from >>>>> a 32 TPS to 16. Oh well. >>>> >>>> If you put data & xlog on the same array, put them on seperate >>>> partitions, probably formatted differently (ext2 on xlog). >>> >>> If he's doing the same thing on both systems (Sun and HP) and the HP >>> performance is dramatically worse despite using more disks and having faster >>> CPUs and more RAM, ISTM the problem isn't the configuration. >>> >>> Add to this the fact that the Sun machine is CPU bound while the HP is I/O >>> wait bound and I think the problem is the disk hardware or the driver >>> therein. >> >> I agree. The problem here looks to be the RAID controller. >> >> Steve, got access to a different RAID controller to test with? >> >> ---------------------------(end of broadcast)--------------------------- >> TIP 1: if posting/reading through Usenet, please send an appropriate >> subscribe-nomail command to majordomo@postgresql.org so that your >> message can get through to the mailing list cleanly > >
That's about what I was getting for a 2 disk RAID 0 setup on a PE 2950. Here's bonnie++ numbers for the RAID10x4 and RAID0x2, unfortunately I only have the 1.93 numbers since this was before I got the advice to run with the earlier version of bonnie and larger file sizes, so I don't know how meaningful they are. RAID 10x4 bash-2.05b$ bonnie++ -d bonnie -s 1000:8k Version 1.93c ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP 1000M 585 99 21705 4 28560 9 1004 99 812997 98 5436 454 Latency 14181us 81364us 50256us 57720us 1671us 1059ms Version 1.93c ------Sequential Create------ --------Random Create-------- c -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 4712 10 +++++ +++ +++++ +++ 4674 10 +++++ +++ +++++ +++ Latency 807ms 21us 36us 804ms 110us 36us 1.93c,1.93c, ,1,1155207445,1000M,,585,99,21705,4,28560,9,1004,99,812997,98,5436,454,1 6,,,,,4712,10,+++++,+++,+++++,+++,4674,10,+++++,+++,+++++,+++,14181us,81 364us,50256us,57720us,1671us,1059ms,807ms,21us,36us,804ms,110us,36us bash-2.05b$ RAID 0x2 bash-2.05b$ bonnie++ -d bonnie -s 1000:8k Version 1.93c ------Sequential Output------ --Sequential Input- --Random- Concurrency 1 -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- Machine Size K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP 1000M 575 99 131621 25 104178 26 1004 99 816928 99 6233 521 Latency 14436us 26663us 47478us 54796us 1487us 38924us Version 1.93c ------Sequential Create------ --------Random Create-------- -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 4935 10 +++++ +++ +++++ +++ 5198 11 +++++ +++ +++++ +++ Latency 738ms 32us 43us 777ms 24us 30us 1.93c,1.93c,beast.corp.lumeta.com,1,1155210203,1000M,,575,99,131621,25,1 04178,26,1004,99,816928,99,6233,521,16,,,,,4935,10,+++++,+++,+++++,+++,5 198,11,+++++,+++,+++++,+++,14436us,26663us,47478us,54796us,1487us,38924u s,738ms,32us,43us,777ms,24us,30us A RAID 5 configuration seems to outperform this on the PE 2950 though (at least in terms of raw read/write perf) If anyone's interested in some more detailed tests of the 2950, I might be able to reconfigure the raid for some tests next week before I start setting up the box for long term use, so I'm open to suggestions. See earlier posts in this thread for details about the hardware. Thanks, Bucky -----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Luke Lonergan Sent: Friday, August 18, 2006 10:38 AM To: steve.poe@gmail.com; Scott Marlowe Cc: Michael Stone; pgsql-performance@postgresql.org Subject: Re: [PERFORM] Postgresql Performance on an HP DL385 and Steve, If this is an internal RAID1 on two disks, it looks great. Based on the random seeks though (578 seeks/sec), it looks like maybe it's 6 disks in a RAID10? - Luke On 8/16/06 7:10 PM, "Steve Poe" <steve.poe@gmail.com> wrote: > Everyone, > > I wanted to follow-up on bonnie results for the internal RAID1 which is > connected to the SmartArray 6i. I believe this is the problem, but I am > not good at interepting the results. Here's an sample of three runs: > > scsi disc > array ,16G,47983,67,65492,20,37214,6,73785,87,89787,6,578.2,0,16,+++++, > +++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++ > scsi disc > array ,16G,54634,75,67793,21,36835,6,74190,88,89314,6,579.9,0,16,+++++, > +++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++ > scsi disc > array ,16G,55056,76,66108,20,36859,6,74108,87,89559,6,585.0,0,16,+++++, > +++,+++++,+++,+++++,+++,+++++,+++,+ > > This was run on the internal RAID1 on the outer portion of the discs > formatted at ext2. > > Thanks. > > Steve > > On Thu, 2006-08-10 at 10:35 -0500, Scott Marlowe wrote: >> On Thu, 2006-08-10 at 10:15, Luke Lonergan wrote: >>> Mike, >>> >>> On 8/10/06 4:09 AM, "Michael Stone" <mstone+postgres@mathom.us> wrote: >>> >>>> On Wed, Aug 09, 2006 at 08:29:13PM -0700, Steve Poe wrote: >>>>> I tried as you suggested and my performance dropped by 50%. I went from >>>>> a 32 TPS to 16. Oh well. >>>> >>>> If you put data & xlog on the same array, put them on seperate >>>> partitions, probably formatted differently (ext2 on xlog). >>> >>> If he's doing the same thing on both systems (Sun and HP) and the HP >>> performance is dramatically worse despite using more disks and having faster >>> CPUs and more RAM, ISTM the problem isn't the configuration. >>> >>> Add to this the fact that the Sun machine is CPU bound while the HP is I/O >>> wait bound and I think the problem is the disk hardware or the driver >>> therein. >> >> I agree. The problem here looks to be the RAID controller. >> >> Steve, got access to a different RAID controller to test with? >> >> ---------------------------(end of broadcast)--------------------------- >> TIP 1: if posting/reading through Usenet, please send an appropriate >> subscribe-nomail command to majordomo@postgresql.org so that your >> message can get through to the mailing list cleanly > > ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend
Luke,
Nope. it is only a RAID1 for the 2 internal discs connected to the SmartArray 6i. This is where I *had* the pg_xlog located when the performance was very poor. Also, I just found out the default stripe size is 128k. Would this be a problem for pg_xlog?
The 6-disc RAID10 you speak of is on the SmartArray 642 RAID adapter.
Steve
Nope. it is only a RAID1 for the 2 internal discs connected to the SmartArray 6i. This is where I *had* the pg_xlog located when the performance was very poor. Also, I just found out the default stripe size is 128k. Would this be a problem for pg_xlog?
The 6-disc RAID10 you speak of is on the SmartArray 642 RAID adapter.
Steve
On 8/18/06, Luke Lonergan < llonergan@greenplum.com> wrote:
Steve,
If this is an internal RAID1 on two disks, it looks great.
Based on the random seeks though (578 seeks/sec), it looks like maybe it's 6
disks in a RAID10?
- Luke
On 8/16/06 7:10 PM, "Steve Poe" <steve.poe@gmail.com > wrote:
> Everyone,
>
> I wanted to follow-up on bonnie results for the internal RAID1 which is
> connected to the SmartArray 6i. I believe this is the problem, but I am
> not good at interepting the results. Here's an sample of three runs:
>
> scsi disc
> array ,16G,47983,67,65492,20,37214,6,73785,87,89787,6,578.2,0,16,+++++,
> +++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++
> scsi disc
> array ,16G,54634,75,67793,21,36835,6,74190,88,89314,6, 579.9,0,16,+++++,
> +++,+++++,+++,+++++,+++,+++++,+++,+++++,+++,+++++,+++
> scsi disc
> array ,16G,55056,76,66108,20,36859,6,74108,87,89559,6,585.0,0,16,+++++,
> +++,+++++,+++,+++++,+++,+++++,+++,+
>
> This was run on the internal RAID1 on the outer portion of the discs
> formatted at ext2.
>
> Thanks.
>
> Steve
>
> On Thu, 2006-08-10 at 10:35 -0500, Scott Marlowe wrote:
>> On Thu, 2006-08-10 at 10:15, Luke Lonergan wrote:
>>> Mike,
>>>
>>> On 8/10/06 4:09 AM, "Michael Stone" <mstone+postgres@mathom.us > wrote:
>>>
>>>> On Wed, Aug 09, 2006 at 08:29:13PM -0700, Steve Poe wrote:
>>>>> I tried as you suggested and my performance dropped by 50%. I went from
>>>>> a 32 TPS to 16. Oh well.
>>>>
>>>> If you put data & xlog on the same array, put them on seperate
>>>> partitions, probably formatted differently (ext2 on xlog).
>>>
>>> If he's doing the same thing on both systems (Sun and HP) and the HP
>>> performance is dramatically worse despite using more disks and having faster
>>> CPUs and more RAM, ISTM the problem isn't the configuration.
>>>
>>> Add to this the fact that the Sun machine is CPU bound while the HP is I/O
>>> wait bound and I think the problem is the disk hardware or the driver
>>> therein.
>>
>> I agree. The problem here looks to be the RAID controller.
>>
>> Steve, got access to a different RAID controller to test with?
>>
>> ---------------------------(end of broadcast)---------------------------
>> TIP 1: if posting/reading through Usenet, please send an appropriate
>> subscribe-nomail command to majordomo@postgresql.org so that your
>> message can get through to the mailing list cleanly
>
>
Steve, On 8/18/06 10:39 AM, "Steve Poe" <steve.poe@gmail.com> wrote: > Nope. it is only a RAID1 for the 2 internal discs connected to the SmartArray > 6i. This is where I *had* the pg_xlog located when the performance was very > poor. Also, I just found out the default stripe size is 128k. Would this be a > problem for pg_xlog? ISTM that the main performance issue for xlog is going to be the rate at which fdatasync operations complete, and the stripe size shouldn't hurt that. What are your postgresql.conf settings for the xlog: how many logfiles, sync_method, etc? > The 6-disc RAID10 you speak of is on the SmartArray 642 RAID adapter. Interesting - the seek rate is very good for two drives, are they 15K RPM? - Luke
Luke,
ISTM that the main performance issue for xlog is going to be the rate at
which fdatasync operations complete, and the stripe size shouldn't hurt
that.
I thought so. However, I've also tried running the PGDATA off of the RAID1 as a test and it is poor.
What are your postgresql.conf settings for the xlog: how many logfiles,
sync_method, etc?
wal_sync_method = fsync # the default varies across platforms:
# fsync, fdatasync, open_sync, or open_datasync
# - Checkpoints -
checkpoint_segments = 14 # in logfile segments, min 1, 16MB each
checkpoint_timeout = 300 # range 30-3600, in seconds
#checkpoint_warning = 30 # 0 is off, in seconds
#commit_delay = 0 # range 0-100000, in microseconds
#commit_siblings = 5
What stumps me is I use the same settings on a Sun box (dual Opteron 4GB w/ LSI MegaRAID 128M) with the same data. This is on pg 7.4.13.
> The 6-disc RAID10 you speak of is on the SmartArray 642 RAID adapter.
Interesting - the seek rate is very good for two drives, are they 15K RPM?
Nope. 10K. RPM.
HP's recommendation for testing is to connect the RAID1 to the second channel off of the SmartArray 642 adapter since they use the same driver, and, according to HP, I should not have to rebuilt the RAID1.
I have to send the new server to the hospital next week, so I have very little testing time left.
Steve
I have to send the new server to the hospital next week, so I have very little testing time left.
Steve
Steve,
One thing here is that “wal_sync_method” should be set to “fdatasync” and not “fsync”. In fact, the default is fdatasync, but because you have uncommented the standard line in the file, it is changed to “fsync”, which is a lot slower. This is a bug in the file defaults.
That could speed things up quite a bit on the xlog.
WRT the difference between the two systems, I’m kind of stumped.
- Luke
On 8/18/06 12:00 PM, "Steve Poe" <steve.poe@gmail.com> wrote:
One thing here is that “wal_sync_method” should be set to “fdatasync” and not “fsync”. In fact, the default is fdatasync, but because you have uncommented the standard line in the file, it is changed to “fsync”, which is a lot slower. This is a bug in the file defaults.
That could speed things up quite a bit on the xlog.
WRT the difference between the two systems, I’m kind of stumped.
- Luke
On 8/18/06 12:00 PM, "Steve Poe" <steve.poe@gmail.com> wrote:
Luke,ISTM that the main performance issue for xlog is going to be the rate at
which fdatasync operations complete, and the stripe size shouldn't hurt
that.
I thought so. However, I've also tried running the PGDATA off of the RAID1 as a test and it is poor.
What are your postgresql.conf settings for the xlog: how many logfiles,
sync_method, etc?
wal_sync_method = fsync # the default varies across platforms:
# fsync, fdatasync, open_sync, or open_datasync
# - Checkpoints -
checkpoint_segments = 14 # in logfile segments, min 1, 16MB each
checkpoint_timeout = 300 # range 30-3600, in seconds
#checkpoint_warning = 30 # 0 is off, in seconds
#commit_delay = 0 # range 0-100000, in microseconds
#commit_siblings = 5
What stumps me is I use the same settings on a Sun box (dual Opteron 4GB w/ LSI MegaRAID 128M) with the same data. This is on pg 7.4.13.
> The 6-disc RAID10 you speak of is on the SmartArray 642 RAID adapter.
Interesting - the seek rate is very good for two drives, are they 15K RPM?
Nope. 10K. RPM.
HP's recommendation for testing is to connect the RAID1 to the second channel off of the SmartArray 642 adapter since they use the same driver, and, according to HP, I should not have to rebuilt the RAID1.
I have to send the new server to the hospital next week, so I have very little testing time left.
Steve
Luke,
I'll try it, but you're right, it should not matter. The two systems are:
HP DL385 (dual Opteron 265 I believe) 8GB of RAM, two internal RAID1 U320 10K
Sun W2100z (dual Opteron 245 I believe) 4GB of RAM, 1 U320 10K drive with LSI MegaRAID 2X 128M driving two external 4-disc arrays U320 10K drives in a RAID10 configuration. Running same version of LInux (Centos 4.3 ) and same kernel version. No changes within the kernel for each of them. Running the same *.conf files for Postgresql 7.4.13.
Steve
I'll try it, but you're right, it should not matter. The two systems are:
HP DL385 (dual Opteron 265 I believe) 8GB of RAM, two internal RAID1 U320 10K
Sun W2100z (dual Opteron 245 I believe) 4GB of RAM, 1 U320 10K drive with LSI MegaRAID 2X 128M driving two external 4-disc arrays U320 10K drives in a RAID10 configuration. Running same version of LInux (Centos 4.3 ) and same kernel version. No changes within the kernel for each of them. Running the same *.conf files for Postgresql 7.4.13.
Steve
On 8/18/06, Luke Lonergan <llonergan@greenplum.com> wrote:
Steve,
One thing here is that "wal_sync_method" should be set to "fdatasync" and not "fsync". In fact, the default is fdatasync, but because you have uncommented the standard line in the file, it is changed to "fsync", which is a lot slower. This is a bug in the file defaults.
That could speed things up quite a bit on the xlog.
WRT the difference between the two systems, I'm kind of stumped.
- Luke
On 8/18/06 12:00 PM, "Steve Poe" <steve.poe@gmail.com> wrote:
Luke,ISTM that the main performance issue for xlog is going to be the rate at
which fdatasync operations complete, and the stripe size shouldn't hurt
that.
I thought so. However, I've also tried running the PGDATA off of the RAID1 as a test and it is poor.
What are your postgresql.conf settings for the xlog: how many logfiles,
sync_method, etc?
wal_sync_method = fsync # the default varies across platforms:
# fsync, fdatasync, open_sync, or open_datasync
# - Checkpoints -
checkpoint_segments = 14 # in logfile segments, min 1, 16MB each
checkpoint_timeout = 300 # range 30-3600, in seconds
#checkpoint_warning = 30 # 0 is off, in seconds
#commit_delay = 0 # range 0-100000, in microseconds
#commit_siblings = 5
What stumps me is I use the same settings on a Sun box (dual Opteron 4GB w/ LSI MegaRAID 128M) with the same data. This is on pg 7.4.13.
> The 6-disc RAID10 you speak of is on the SmartArray 642 RAID adapter.
Interesting - the seek rate is very good for two drives, are they 15K RPM?
Nope. 10K. RPM.
HP's recommendation for testing is to connect the RAID1 to the second channel off of the SmartArray 642 adapter since they use the same driver, and, according to HP, I should not have to rebuilt the RAID1.
I have to send the new server to the hospital next week, so I have very little testing time left.
Steve