Thread: x206-x225
Hello list.
We have compared 2 IBM x servers:
IBM X206 IBM X226
---------------------- -------------------
processor Pentium 4 3.2 Ghz Xeon 3.0 Ghz
main memory 1.25 GB 4 GB
discs 2 x SCSI RAID1 10000RPM 1 x ATA 7200 RPM
LINUX 2.6 (SUSE 9) same
PGSQL 7.4 same
postgresql.conf attached same
We have bij means of an informix-4GL program done the following test:
create table : name char(18)
adres char(20)
key integer
create index on (key)
Time at X206 Time at X226
-------------------- ------------------
insert record (key goes from 1 to 10000) 6 sec. 41 sec.
select record (key goes from 1 to 10000) 4 4
delete record (key goes from 1 to 10000) 6 41
This is ofcourse a totally unexpected results (you should think off the opposite).
Funny is that the select time is the same for both machines.
Does anybody has any any idea what can cause this strange results or where we
can start our investigations?
Regards
Henk Sanders
Attachment
H.J. Sanders wrote: > X206 IBM X226 > ---------------------- ------------------- > processor Pentium 4 3.2 > Ghz Xeon 3.0 Ghz > main memory 1.25 > GB 4 GB > discs 2 x SCSI RAID1 10000RPM > 1 x ATA 7200 RPM Noting that the SCSI discs are on the *slower* machine. > Time at X206 Time at X226 > -------------------- ------------------ > insert record (1 to 10000) 6 sec. 41 sec. > select record (1 to 10000) 4 4 > delete record (1 to 10000) 6 41 > > > This is ofcourse a totally unexpected results (you should think off the > opposite). Your ATA disk is lying about disk caching being turned off. Assuming each insert is in a separate transaction, then it's not going to do 10,000 / 6 = 1667 transactions/sec - that's faster than it's rotational speed. > Funny is that the select time is the same for both machines. Because you're limited by the speed to read from RAM. By the way - these sort of tests are pretty much meaningless in any practical terms. -- Richard Huxton Archonet Ltd
The primary slow down is probably between your system bus from main memory to your disk storage. If you notice from your statistics that the select statements are very close. This is because all the data you need is already in system memory. The primary bottle neck is probably disk I/O. Scsi will always be faster than ATA. Scsi devices have dedicated hardware for getting data to and from the disc to the main system bus without requiring a trip through the CPU.
You may be able to speed up the ata disc by enabling DMA by using hdparm.
hdparm -d1 /dev/hda (or whatever your device is)
-Daniel
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
iD8DBQBEEYzX9SJ2nhowvKERAoiFAKCLR+7a7ReZ2mjjPjpONHLGIQD1SgCeNNON
V1kbyATIFVPWuf1W6Ji0IFg=
=5Msr
-----END PGP SIGNATURE-----
You may be able to speed up the ata disc by enabling DMA by using hdparm.
hdparm -d1 /dev/hda (or whatever your device is)
-Daniel
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)
iD8DBQBEEYzX9SJ2nhowvKERAoiFAKCLR+7a7ReZ2mjjPjpONHLGIQD1SgCeNNON
V1kbyATIFVPWuf1W6Ji0IFg=
=5Msr
-----END PGP SIGNATURE-----
On 3/10/06, Richard Huxton <dev@archonet.com> wrote:
H.J. Sanders wrote:
> X206 IBM X226
> ---------------------- -------------------
> processor Pentium 4 3.2
> Ghz Xeon 3.0 Ghz
> main memory 1.25
> GB 4 GB
> discs 2 x SCSI RAID1 10000RPM
> 1 x ATA 7200 RPM
Noting that the SCSI discs are on the *slower* machine.
> Time at X206 Time at X226
> -------------------- ------------------
> insert record (1 to 10000) 6 sec. 41 sec.
> select record (1 to 10000) 4 4
> delete record (1 to 10000) 6 41
>
>
> This is ofcourse a totally unexpected results (you should think off the
> opposite).
Your ATA disk is lying about disk caching being turned off. Assuming
each insert is in a separate transaction, then it's not going to do
10,000 / 6 = 1667 transactions/sec - that's faster than it's rotational
speed.
> Funny is that the select time is the same for both machines.
Because you're limited by the speed to read from RAM.
By the way - these sort of tests are pretty much meaningless in any
practical terms.
--
Richard Huxton
Archonet Ltd
---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend
On Fri, 2006-03-10 at 13:40 +0000, Richard Huxton wrote: > Your ATA disk is lying about disk caching being turned off. Assuming > each insert is in a separate transaction, then it's not going to do > 10,000 / 6 = 1667 transactions/sec - that's faster than it's rotational > speed. Could you explain the calculation? Why should the number of transactions be related to the rotational speed of the disk, without saying anything about the number of bytes per rotation? -- Groeten, Joost Kraaijeveld Askesis B.V. Molukkenstraat 14 6524NB Nijmegen tel: 024-3888063 / 06-51855277 fax: 024-3608416 e-mail: J.Kraaijeveld@Askesis.nl web: www.askesis.nl
On Sat, 11 Mar 2006, Joost Kraaijeveld wrote: > On Fri, 2006-03-10 at 13:40 +0000, Richard Huxton wrote: >> Your ATA disk is lying about disk caching being turned off. Assuming >> each insert is in a separate transaction, then it's not going to do >> 10,000 / 6 = 1667 transactions/sec - that's faster than it's rotational >> speed. > Could you explain the calculation? Why should the number of transactions > be related to the rotational speed of the disk, without saying anything > about the number of bytes per rotation? each transaction requires a sync to the disk, a sync requires a real write (which you then wait for), so you can only do one transaction per rotation. David Lang
On Fri, 2006-03-10 at 23:57 -0800, David Lang wrote: > On Sat, 11 Mar 2006, Joost Kraaijeveld wrote: > > > On Fri, 2006-03-10 at 13:40 +0000, Richard Huxton wrote: > >> Your ATA disk is lying about disk caching being turned off. Assuming > >> each insert is in a separate transaction, then it's not going to do > >> 10,000 / 6 = 1667 transactions/sec - that's faster than it's rotational > >> speed. > > Could you explain the calculation? Why should the number of transactions > > be related to the rotational speed of the disk, without saying anything > > about the number of bytes per rotation? > > each transaction requires a sync to the disk, a sync requires a real > write (which you then wait for), so you can only do one transaction per > rotation. Not according to a conversation I had with Western Digital about the write performance of my own SATA disks. What I understand from their explanation their disk are limited by the MB/sec and not by the number of writes/second, e.g. I could write 50 MB/sec *in 1 bit/write* on my disk. This would suggest that the maximum transactions of my disk (overhead of OS and PostgreSQL ignored) would be 50MB / (transaction size in MB) per second. Or am I missing something (what would not surprise me, as I do not understand the perforance of my system at all ;-))? -- Groeten, Joost Kraaijeveld Askesis B.V. Molukkenstraat 14 6524NB Nijmegen tel: 024-3888063 / 06-51855277 fax: 024-3608416 e-mail: J.Kraaijeveld@Askesis.nl web: www.askesis.nl
>> each transaction requires a sync to the disk, a sync requires a real >> write (which you then wait for), so you can only do one transaction per >> rotation. > Not according to a conversation I had with Western Digital about the It depends if you consider that "written to the disk" means "data is somewhere between the OS cache and the platter" or "data is writter on the platter and will survive a power loss". Postgres wants the second option, of course. For that, the data has to be on the disk. Thus, the disk has to seek, wait till the desired sector arrives in front of the head, write, and tell the OS it's done. Your disk just stores data in its embedded RAM buffer and tells the OS it's written, but if you lose power, you lose anything that's in the disk embedded RAM cache... Advanced RAID cards have battery backed up RAM cache precisely for that purpose. Your harddisk doesn't.
On Sat, 2006-03-11 at 12:33 +0100, PFC wrote: > >> each transaction requires a sync to the disk, a sync requires a real > >> write (which you then wait for), so you can only do one transaction per > >> rotation. > > Not according to a conversation I had with Western Digital about the > > > It depends if you consider that "written to the disk" means "data is > somewhere between the OS cache and the platter" or "data is writter on the > platter and will survive a power loss". > > Postgres wants the second option, of course. I assume that for PostgreSQL "written to disk" is after fsync returned successfully. In practice that could very well mean that the data is still in a cache somewhere (controller or harddisk, not in the OS anymore, see also man page of fsync) -- Groeten, Joost Kraaijeveld Askesis B.V. Molukkenstraat 14 6524NB Nijmegen tel: 024-3888063 / 06-51855277 fax: 024-3608416 e-mail: J.Kraaijeveld@Askesis.nl web: www.askesis.nl
Joost Kraaijeveld <J.Kraaijeveld@Askesis.nl> writes: > I assume that for PostgreSQL "written to disk" is after fsync returned > successfully. In practice that could very well mean that the data is > still in a cache somewhere (controller or harddisk, not in the OS > anymore, see also man page of fsync) What it had better mean, if you want your database to be reliable, is that the data is stored someplace that will survive a system crash (power outage, kernel panic, etc). A battery-backed RAM cache is OK, assuming that total failure of the RAID controller is not one of the events you consider likely enough to need protection against. The description of your SATA drive makes it sound like the drive does not put data on the platter before reporting "write complete", but only stores it in on-board RAM cache. It is highly unlikely that there is any battery backing for that cache, and therefore that drive is not to be trusted. regards, tom lane
On Sat, 2006-03-11 at 11:59 -0500, Tom Lane wrote: > Joost Kraaijeveld <J.Kraaijeveld@Askesis.nl> writes: > > I assume that for PostgreSQL "written to disk" is after fsync returned > > successfully. In practice that could very well mean that the data is > > still in a cache somewhere (controller or harddisk, not in the OS > > anymore, see also man page of fsync) > > What it had better mean, if you want your database to be reliable, > is that the data is stored someplace that will survive a system crash > (power outage, kernel panic, etc). A battery-backed RAM cache is OK, > assuming that total failure of the RAID controller is not one of the > events you consider likely enough to need protection against. Maybe I should have expressed myself better. The parent post said: > It depends if you consider that "written to the disk" means "data is > somewhere between the OS cache and the platter" or "data is written on > the platter and will survive a power loss". > > Postgres wants the second option, of course. With my remark I meant that the only thing *PostgreSQL* can expect is that the data is out of the OS: there is no greater guarantee in the fsync function. If the *database administrator* wants better guarantees, he (or she) better read your advise. > The description of your SATA drive makes it sound like the drive > does not put data on the platter before reporting "write complete", > but only stores it in on-board RAM cache. It is highly unlikely > that there is any battery backing for that cache, and therefore that > drive is not to be trusted. Yep, the drives have a write cache, and indeed, they are not backed up by a battery (neither is my RAID controller) but as this is a test/development machine, I don't really care. You made me rethink my production machine thought. I will have to check the drives and the state of their write cache of that machine. Thanks for that. -- Groeten, Joost Kraaijeveld Askesis B.V. Molukkenstraat 14 6524NB Nijmegen tel: 024-3888063 / 06-51855277 fax: 024-3608416 e-mail: J.Kraaijeveld@Askesis.nl web: www.askesis.nl
On Sat, 11 Mar 2006, Joost Kraaijeveld wrote: > Date: Sat, 11 Mar 2006 09:17:09 +0100 > From: Joost Kraaijeveld <J.Kraaijeveld@Askesis.nl> > To: David Lang <david@lang.hm> > Cc: Richard Huxton <dev@archonet.com>, pgsql-performance@postgresql.org > Subject: Re: [PERFORM] x206-x225 > > On Fri, 2006-03-10 at 23:57 -0800, David Lang wrote: >> On Sat, 11 Mar 2006, Joost Kraaijeveld wrote: >> >>> On Fri, 2006-03-10 at 13:40 +0000, Richard Huxton wrote: >>>> Your ATA disk is lying about disk caching being turned off. Assuming >>>> each insert is in a separate transaction, then it's not going to do >>>> 10,000 / 6 = 1667 transactions/sec - that's faster than it's rotational >>>> speed. >>> Could you explain the calculation? Why should the number of transactions >>> be related to the rotational speed of the disk, without saying anything >>> about the number of bytes per rotation? >> >> each transaction requires a sync to the disk, a sync requires a real >> write (which you then wait for), so you can only do one transaction per >> rotation. > Not according to a conversation I had with Western Digital about the > write performance of my own SATA disks. What I understand from their > explanation their disk are limited by the MB/sec and not by the number > of writes/second, e.g. I could write 50 MB/sec *in 1 bit/write* on my > disk. This would suggest that the maximum transactions of my disk > (overhead of OS and PostgreSQL ignored) would be 50MB / (transaction > size in MB) per second. Or am I missing something (what would not > surprise me, as I do not understand the perforance of my system at > all ;-))? but if you do a 1 bit write, and wait for it to complete, and then do another 1 bit write that belongs on disk immediatly after the first one (and wait for it to complete) you have to wait until the disk rotates to the point that it can make the write before it's really safe on disk. so you can do one transaction in less then one rotation, but if you do 50 transactions you must wait at least 49 (and a fraction) roatations. if the disk cache is turned on then you don't have to wait for this, but you also will loose the data if you loose power so it's really not safe. David Lang
On Fri, Mar 10, 2006 at 11:57:16PM -0800, David Lang wrote: > On Sat, 11 Mar 2006, Joost Kraaijeveld wrote: > > >On Fri, 2006-03-10 at 13:40 +0000, Richard Huxton wrote: > >>Your ATA disk is lying about disk caching being turned off. Assuming > >>each insert is in a separate transaction, then it's not going to do > >>10,000 / 6 = 1667 transactions/sec - that's faster than it's rotational > >>speed. > >Could you explain the calculation? Why should the number of transactions > >be related to the rotational speed of the disk, without saying anything > >about the number of bytes per rotation? > > each transaction requires a sync to the disk, a sync requires a real > write (which you then wait for), so you can only do one transaction per > rotation. But shouldn't it be possible to batch up WAL writes and syncs? In other words, if you have 5 transactions that all COMMIT at exactly the same time, it should be possible to get all 5 WAL pages (I'll assume each one generated a small enough change so as not to require multiple WAL pages) to the drive before the platter comes around to the right position. The drive should then be able to write all 5 at once. At least, theoretically... -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Jim C. Nasby wrote: > On Fri, Mar 10, 2006 at 11:57:16PM -0800, David Lang wrote: >> On Sat, 11 Mar 2006, Joost Kraaijeveld wrote: >> >>> On Fri, 2006-03-10 at 13:40 +0000, Richard Huxton wrote: >>>> Your ATA disk is lying about disk caching being turned off. Assuming >>>> each insert is in a separate transaction, then it's not going to do >>>> 10,000 / 6 = 1667 transactions/sec - that's faster than it's rotational >>>> speed. >>> Could you explain the calculation? Why should the number of transactions >>> be related to the rotational speed of the disk, without saying anything >>> about the number of bytes per rotation? >> each transaction requires a sync to the disk, a sync requires a real >> write (which you then wait for), so you can only do one transaction per >> rotation. > > But shouldn't it be possible to batch up WAL writes and syncs? In other > words, if you have 5 transactions that all COMMIT at exactly the same > time, it should be possible to get all 5 WAL pages (I'll assume each > one generated a small enough change so as not to require multiple WAL > pages) to the drive before the platter comes around to the right > position. The drive should then be able to write all 5 at once. At > least, theoretically... I think you mean this... http://www.postgresql.org/docs/8.1/static/runtime-config-wal.html commit_delay (integer) Time delay between writing a commit record to the WAL buffer and flushing the buffer out to disk, in microseconds. A nonzero delay can allow multiple transactions to be committed with only one fsync() system call, if system load is high enough that additional transactions become ready to commit within the given interval. But the delay is just wasted if no other transactions become ready to commit. Therefore, the delay is only performed if at least commit_siblings other transactions are active at the instant that a server process has written its commit record. The default is zero (no delay). commit_siblings (integer) Minimum number of concurrent open transactions to require before performing the commit_delay delay. A larger value makes it more probable that at least one other transaction will become ready to commit during the delay interval. The default is five. -- Richard Huxton Archonet Ltd
On Tue, Mar 14, 2006 at 09:37:33PM +0000, Richard Huxton wrote: > >But shouldn't it be possible to batch up WAL writes and syncs? In other > >words, if you have 5 transactions that all COMMIT at exactly the same > >time, it should be possible to get all 5 WAL pages (I'll assume each > >one generated a small enough change so as not to require multiple WAL > >pages) to the drive before the platter comes around to the right > >position. The drive should then be able to write all 5 at once. At > >least, theoretically... > > I think you mean this... > > http://www.postgresql.org/docs/8.1/static/runtime-config-wal.html > > commit_delay (integer) No, that's not what I mean at all. On a system doing a large number of WAL-generating transactions per second, it's certainly possible for multiple transactions to commit in the period of time it takes for the platter to rotate back into position to allow for writing of the WAL data. What I don't know is if those multiple transactions would actually make it to the platter on that rotation, or if they'd serialize, resulting in one commit per revolution. I do know that there's no theoretical reason that they couldn't, it's just a matter of putting enough intelligence in the drive. Perhaps this is something that SCSI supports and (S)ATA doesn't, since SCSI allows multiple transactions to be 'in flight' on the bus at once. But since you mention commit_delay, this does lead to an interesting possible use: set it equal to the effective rotational period of the drive. If you know your transaction load well enough, you could possibly gain some benefit here. But of course a RAID controller with a BBU would be a better bet... -- Jim C. Nasby, Sr. Engineering Consultant jnasby@pervasive.com Pervasive Software http://pervasive.com work: 512-231-6117 vcard: http://jim.nasby.net/pervasive.vcf cell: 512-569-9461
Jim C. Nasby wrote: >> I think you mean this... >> >> http://www.postgresql.org/docs/8.1/static/runtime-config-wal.html >> >> commit_delay (integer) > > No, that's not what I mean at all. On a system doing a large number of > WAL-generating transactions per second, it's certainly possible for > multiple transactions to commit in the period of time it takes for the > platter to rotate back into position to allow for writing of the WAL > data. What I don't know is if those multiple transactions would actually > make it to the platter on that rotation, or if they'd serialize, > resulting in one commit per revolution. I do know that there's no > theoretical reason that they couldn't, it's just a matter of putting > enough intelligence in the drive. > > Perhaps this is something that SCSI supports and (S)ATA doesn't, since > SCSI allows multiple transactions to be 'in flight' on the bus at once. SCSI Command queueing: http://www.storagereview.com/guide2000/ref/hdd/if/scsi/protCQR.html SATA "native command queuing": http://www.tomshardware.com/2004/11/16/can_command_queuing_turbo_charge_sata/ > But since you mention commit_delay, this does lead to an interesting > possible use: set it equal to the effective rotational period of the > drive. If you know your transaction load well enough, you could possibly > gain some benefit here. But of course a RAID controller with a BBU would > be a better bet... I suppose as long as you always have several transactions trying to commit, have a separate spindle(s) for the WAL then you could improve throughput at the cost of the shortest transaction times. Of course, it might be that the increase in lock duration etc. might outweigh any benefits. I'd suspect the cost/gain would be highly variable with changes in workload, and as you say write-cache+BBU seems more sensible. -- Richard Huxton Archonet Ltd