Home > mailing lists

Thread: x206-x225

x206-x225

From

"H.J. Sanders"

Date:

10 March 2006, 05:50:25

Hello list.

We have compared 2 IBM x servers:

IBM X206 IBM X226

---------------------- -------------------

processor Pentium 4 3.2 Ghz Xeon 3.0 Ghz

main memory 1.25 GB 4 GB

discs 2 x SCSI RAID1 10000RPM 1 x ATA 7200 RPM

LINUX 2.6 (SUSE 9) same

PGSQL 7.4 same

postgresql.conf attached same

We have bij means of an informix-4GL program done the following test:

create table : name char(18)

adres char(20)

key integer

create index on (key)

Time at X206 Time at X226

-------------------- ------------------

insert record (key goes from 1 to 10000) 6 sec. 41 sec.

select record (key goes from 1 to 10000) 4 4

delete record (key goes from 1 to 10000) 6 41

This is ofcourse a totally unexpected results (you should think off the opposite).

Funny is that the select time is the same for both machines.

Does anybody has any any idea what can cause this strange results or where we

can start our investigations?

Regards

Henk Sanders

Attachment

postgresql.conf-74

Re: x206-x225

From

Richard Huxton

Date:

10 March 2006, 10:16:06

H.J. Sanders wrote:
> X206                                       IBM X226
> ----------------------                            -------------------
> processor                        Pentium 4 3.2
> Ghz                            Xeon 3.0 Ghz
> main memory                    1.25
> GB                                        4 GB
> discs                              2 x SCSI RAID1  10000RPM
> 1 x ATA 7200 RPM

Noting that the SCSI discs are on the *slower* machine.

> Time at X206            Time at X226
> --------------------        ------------------
> insert record (1 to 10000)          6 sec.    41 sec.
> select record (1 to 10000)          4          4
> delete record (1 to 10000)          6         41
>
>
> This is ofcourse a totally unexpected results (you should think off the
> opposite).

Your ATA disk is lying about disk caching being turned off. Assuming
each insert is in a separate transaction, then it's not going to do
10,000 / 6 = 1667 transactions/sec - that's faster than it's rotational
speed.

> Funny is that the select time is the same for both machines.

Because you're limited by the speed to read from RAM.

By the way - these sort of tests are pretty much meaningless in any
practical terms.
--
   Richard Huxton
   Archonet Ltd

Re: x206-x225

From

"Daniel Blaisdell"

Date:

10 March 2006, 10:27:47

The primary slow down is probably between your system bus from main memory to your disk storage. If you notice from your statistics that the select statements are very close. This is because all the data you need is already in system memory. The primary bottle neck is probably disk I/O. Scsi will always be faster than ATA. Scsi devices have dedicated hardware for getting data to and from the disc to the main system bus without requiring a trip through the CPU.

You may be able to speed up the ata disc by enabling DMA by using hdparm.

hdparm -d1 /dev/hda (or whatever your device is)

-Daniel

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.2 (GNU/Linux)

iD8DBQBEEYzX9SJ2nhowvKERAoiFAKCLR+7a7ReZ2mjjPjpONHLGIQD1SgCeNNON
V1kbyATIFVPWuf1W6Ji0IFg=
=5Msr
-----END PGP SIGNATURE-----

On 3/10/06, Richard Huxton <dev@archonet.com> wrote:

H.J. Sanders wrote:
> X206                                       IBM X226
> ----------------------                            -------------------
> processor                        Pentium 4 3.2
> Ghz                            Xeon 3.0 Ghz
> main memory                    1.25
> GB                                        4 GB
> discs                              2 x SCSI RAID1  10000RPM
> 1 x ATA 7200 RPM

Noting that the SCSI discs are on the *slower* machine.

> Time at X206            Time at X226
> --------------------        ------------------
> insert record (1 to 10000)          6 sec.    41 sec.
> select record (1 to 10000)          4          4
> delete record (1 to 10000)          6         41
>
>
> This is ofcourse a totally unexpected results (you should think off the
> opposite).

Your ATA disk is lying about disk caching being turned off. Assuming
each insert is in a separate transaction, then it's not going to do
10,000 / 6 = 1667 transactions/sec - that's faster than it's rotational
speed.

> Funny is that the select time is the same for both machines.

Because you're limited by the speed to read from RAM.

By the way - these sort of tests are pretty much meaningless in any
practical terms.
--
   Richard Huxton
   Archonet Ltd

---------------------------(end of broadcast)---------------------------
TIP 6: explain analyze is your friend

Re: x206-x225

From

Joost Kraaijeveld

Date:

11 March 2006, 03:49:44

On Fri, 2006-03-10 at 13:40 +0000, Richard Huxton wrote:
> Your ATA disk is lying about disk caching being turned off. Assuming
> each insert is in a separate transaction, then it's not going to do
> 10,000 / 6 = 1667 transactions/sec - that's faster than it's rotational
> speed.
Could you explain the calculation? Why should the number of transactions
be related to the rotational speed of the disk, without saying anything
about the number of bytes per rotation?

--
Groeten,

Joost Kraaijeveld
Askesis B.V.
Molukkenstraat 14
6524NB Nijmegen
tel: 024-3888063 / 06-51855277
fax: 024-3608416
e-mail: J.Kraaijeveld@Askesis.nl
web: www.askesis.nl

Re: x206-x225

From

David Lang

Date:

11 March 2006, 03:57:30

On Sat, 11 Mar 2006, Joost Kraaijeveld wrote:

> On Fri, 2006-03-10 at 13:40 +0000, Richard Huxton wrote:
>> Your ATA disk is lying about disk caching being turned off. Assuming
>> each insert is in a separate transaction, then it's not going to do
>> 10,000 / 6 = 1667 transactions/sec - that's faster than it's rotational
>> speed.
> Could you explain the calculation? Why should the number of transactions
> be related to the rotational speed of the disk, without saying anything
> about the number of bytes per rotation?

each transaction requires a sync to the disk, a sync requires a real
write (which you then wait for), so you can only do one transaction per
rotation.

David Lang

Re: x206-x225

From

Joost Kraaijeveld

Date:

11 March 2006, 04:17:11

On Fri, 2006-03-10 at 23:57 -0800, David Lang wrote:
> On Sat, 11 Mar 2006, Joost Kraaijeveld wrote:
>
> > On Fri, 2006-03-10 at 13:40 +0000, Richard Huxton wrote:
> >> Your ATA disk is lying about disk caching being turned off. Assuming
> >> each insert is in a separate transaction, then it's not going to do
> >> 10,000 / 6 = 1667 transactions/sec - that's faster than it's rotational
> >> speed.
> > Could you explain the calculation? Why should the number of transactions
> > be related to the rotational speed of the disk, without saying anything
> > about the number of bytes per rotation?
>
> each transaction requires a sync to the disk, a sync requires a real
> write (which you then wait for), so you can only do one transaction per
> rotation.
Not according to a conversation I had with Western Digital about the
write performance of my own SATA disks. What I understand from their
explanation their disk are limited by the MB/sec and not by the number
of writes/second, e.g. I could write 50 MB/sec *in 1 bit/write* on my
disk. This would suggest that the maximum transactions of my disk
(overhead of OS and PostgreSQL ignored) would be 50MB / (transaction
size in MB) per second. Or am I missing something (what would not
surprise me, as I do not understand the perforance of my system at
all ;-))?

--
Groeten,

Joost Kraaijeveld
Askesis B.V.
Molukkenstraat 14
6524NB Nijmegen
tel: 024-3888063 / 06-51855277
fax: 024-3608416
e-mail: J.Kraaijeveld@Askesis.nl
web: www.askesis.nl

Re: x206-x225

From

PFC

Date:

11 March 2006, 07:33:05

>> each transaction requires a sync to the disk, a sync requires a real
>> write (which you then wait for), so you can only do one transaction per
>> rotation.
> Not according to a conversation I had with Western Digital about the


It depends if you consider that "written to the disk" means "data is
somewhere between the OS cache and the platter" or "data is writter on the
platter and will survive a power loss".

Postgres wants the second option, of course.

For that, the data has to be on the disk. Thus, the disk has to seek, wait
till the desired sector arrives in front of the head, write, and tell the
OS it's done. Your disk just stores data in its embedded RAM buffer and
tells the OS it's written, but if you lose power, you lose anything that's
in the disk embedded RAM cache...

Advanced RAID cards have battery backed up RAM cache precisely for that
purpose. Your harddisk doesn't.

Re: x206-x225

From

Joost Kraaijeveld

Date:

11 March 2006, 10:26:14

On Sat, 2006-03-11 at 12:33 +0100, PFC wrote:
> >> each transaction requires a sync to the disk, a sync requires a real
> >> write (which you then wait for), so you can only do one transaction per
> >> rotation.
> > Not according to a conversation I had with Western Digital about the
>
>
> It depends if you consider that "written to the disk" means "data is
> somewhere between the OS cache and the platter" or "data is writter on the
> platter and will survive a power loss".
>
> Postgres wants the second option, of course.

I assume that for PostgreSQL "written to disk" is after fsync returned
successfully. In practice that could very well mean that the data is
still in a cache somewhere (controller or harddisk, not in the OS
anymore, see also man page of fsync)


--
Groeten,

Joost Kraaijeveld
Askesis B.V.
Molukkenstraat 14
6524NB Nijmegen
tel: 024-3888063 / 06-51855277
fax: 024-3608416
e-mail: J.Kraaijeveld@Askesis.nl
web: www.askesis.nl

Re: x206-x225

From

Tom Lane

Date:

11 March 2006, 12:59:39

Joost Kraaijeveld <J.Kraaijeveld@Askesis.nl> writes:
> I assume that for PostgreSQL "written to disk" is after fsync returned
> successfully. In practice that could very well mean that the data is
> still in a cache somewhere (controller or harddisk, not in the OS
> anymore, see also man page of fsync)

What it had better mean, if you want your database to be reliable,
is that the data is stored someplace that will survive a system crash
(power outage, kernel panic, etc).  A battery-backed RAM cache is OK,
assuming that total failure of the RAID controller is not one of the
events you consider likely enough to need protection against.

The description of your SATA drive makes it sound like the drive
does not put data on the platter before reporting "write complete",
but only stores it in on-board RAM cache.  It is highly unlikely
that there is any battery backing for that cache, and therefore that
drive is not to be trusted.

            regards, tom lane

Re: x206-x225

From

Joost Kraaijeveld

Date:

11 March 2006, 14:26:40

On Sat, 2006-03-11 at 11:59 -0500, Tom Lane wrote:
> Joost Kraaijeveld <J.Kraaijeveld@Askesis.nl> writes:
> > I assume that for PostgreSQL "written to disk" is after fsync returned
> > successfully. In practice that could very well mean that the data is
> > still in a cache somewhere (controller or harddisk, not in the OS
> > anymore, see also man page of fsync)
>
> What it had better mean, if you want your database to be reliable,
> is that the data is stored someplace that will survive a system crash
> (power outage, kernel panic, etc).  A battery-backed RAM cache is OK,
> assuming that total failure of the RAID controller is not one of the
> events you consider likely enough to need protection against.

Maybe I should have expressed myself better. The parent post said:

> It depends if you consider that "written to the disk" means "data is
> somewhere between the OS cache and the platter" or "data is written on
> the platter and will survive a power loss".
>
> Postgres wants the second option, of course.

With my remark I meant that the only thing *PostgreSQL* can expect is
that the data is out of the OS: there is no greater guarantee in the
fsync function. If the *database administrator* wants better guarantees,
he (or she) better read your advise.

> The description of your SATA drive makes it sound like the drive
> does not put data on the platter before reporting "write complete",
> but only stores it in on-board RAM cache.  It is highly unlikely
> that there is any battery backing for that cache, and therefore that
> drive is not to be trusted.
Yep, the drives have a write cache, and indeed, they are not backed up
by a battery (neither is my RAID controller) but as this is a
test/development machine, I don't really care.

You made me rethink my production machine thought. I will have to check
the drives and the state of their write cache of that machine. Thanks
for that.



--
Groeten,

Joost Kraaijeveld
Askesis B.V.
Molukkenstraat 14
6524NB Nijmegen
tel: 024-3888063 / 06-51855277
fax: 024-3608416
e-mail: J.Kraaijeveld@Askesis.nl
web: www.askesis.nl

Re: x206-x225

From

David Lang

Date:

11 March 2006, 17:15:48

On Sat, 11 Mar 2006, Joost Kraaijeveld wrote:

> Date: Sat, 11 Mar 2006 09:17:09 +0100
> From: Joost Kraaijeveld <J.Kraaijeveld@Askesis.nl>
> To: David Lang <david@lang.hm>
> Cc: Richard Huxton <dev@archonet.com>, pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] x206-x225
>
> On Fri, 2006-03-10 at 23:57 -0800, David Lang wrote:
>> On Sat, 11 Mar 2006, Joost Kraaijeveld wrote:
>>
>>> On Fri, 2006-03-10 at 13:40 +0000, Richard Huxton wrote:
>>>> Your ATA disk is lying about disk caching being turned off. Assuming
>>>> each insert is in a separate transaction, then it's not going to do
>>>> 10,000 / 6 = 1667 transactions/sec - that's faster than it's rotational
>>>> speed.
>>> Could you explain the calculation? Why should the number of transactions
>>> be related to the rotational speed of the disk, without saying anything
>>> about the number of bytes per rotation?
>>
>> each transaction requires a sync to the disk, a sync requires a real
>> write (which you then wait for), so you can only do one transaction per
>> rotation.
> Not according to a conversation I had with Western Digital about the
> write performance of my own SATA disks. What I understand from their
> explanation their disk are limited by the MB/sec and not by the number
> of writes/second, e.g. I could write 50 MB/sec *in 1 bit/write* on my
> disk. This would suggest that the maximum transactions of my disk
> (overhead of OS and PostgreSQL ignored) would be 50MB / (transaction
> size in MB) per second. Or am I missing something (what would not
> surprise me, as I do not understand the perforance of my system at
> all ;-))?

but if you do a 1 bit write, and wait for it to complete, and then do
another 1 bit write that belongs on disk immediatly after the first one
(and wait for it to complete) you have to wait until the disk rotates to
the point that it can make the write before it's really safe on disk.

so you can do one transaction in less then one rotation, but if you do 50
transactions you must wait at least 49 (and a fraction) roatations.

if the disk cache is turned on then you don't have to wait for this, but
you also will loose the data if you loose power so it's really not safe.

David Lang

Re: x206-x225

From

"Jim C. Nasby"

Date:

14 March 2006, 16:32:55

On Fri, Mar 10, 2006 at 11:57:16PM -0800, David Lang wrote:
> On Sat, 11 Mar 2006, Joost Kraaijeveld wrote:
>
> >On Fri, 2006-03-10 at 13:40 +0000, Richard Huxton wrote:
> >>Your ATA disk is lying about disk caching being turned off. Assuming
> >>each insert is in a separate transaction, then it's not going to do
> >>10,000 / 6 = 1667 transactions/sec - that's faster than it's rotational
> >>speed.
> >Could you explain the calculation? Why should the number of transactions
> >be related to the rotational speed of the disk, without saying anything
> >about the number of bytes per rotation?
>
> each transaction requires a sync to the disk, a sync requires a real
> write (which you then wait for), so you can only do one transaction per
> rotation.

But shouldn't it be possible to batch up WAL writes and syncs? In other
words, if you have 5 transactions that all COMMIT at exactly the same
time, it should be possible to get all 5 WAL pages (I'll assume each
one generated a small enough change so as not to require multiple WAL
pages) to the drive before the platter comes around to the right
position. The drive should then be able to write all 5 at once. At
least, theoretically...
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: x206-x225

From

Richard Huxton

Date:

14 March 2006, 17:46:05

Jim C. Nasby wrote:
> On Fri, Mar 10, 2006 at 11:57:16PM -0800, David Lang wrote:
>> On Sat, 11 Mar 2006, Joost Kraaijeveld wrote:
>>
>>> On Fri, 2006-03-10 at 13:40 +0000, Richard Huxton wrote:
>>>> Your ATA disk is lying about disk caching being turned off. Assuming
>>>> each insert is in a separate transaction, then it's not going to do
>>>> 10,000 / 6 = 1667 transactions/sec - that's faster than it's rotational
>>>> speed.
>>> Could you explain the calculation? Why should the number of transactions
>>> be related to the rotational speed of the disk, without saying anything
>>> about the number of bytes per rotation?
>> each transaction requires a sync to the disk, a sync requires a real
>> write (which you then wait for), so you can only do one transaction per
>> rotation.
>
> But shouldn't it be possible to batch up WAL writes and syncs? In other
> words, if you have 5 transactions that all COMMIT at exactly the same
> time, it should be possible to get all 5 WAL pages (I'll assume each
> one generated a small enough change so as not to require multiple WAL
> pages) to the drive before the platter comes around to the right
> position. The drive should then be able to write all 5 at once. At
> least, theoretically...

I think you mean this...

http://www.postgresql.org/docs/8.1/static/runtime-config-wal.html

commit_delay (integer)

     Time delay between writing a commit record to the WAL buffer and
flushing the buffer out to disk, in microseconds. A nonzero delay can
allow multiple transactions to be committed with only one fsync() system
call, if system load is high enough that additional transactions become
ready to commit within the given interval. But the delay is just wasted
if no other transactions become ready to commit. Therefore, the delay is
only performed if at least commit_siblings other transactions are active
at the instant that a server process has written its commit record. The
default is zero (no delay).

commit_siblings (integer)

     Minimum number of concurrent open transactions to require before
performing the commit_delay delay. A larger value makes it more probable
that at least one other transaction will become ready to commit during
the delay interval. The default is five.

--
   Richard Huxton
   Archonet Ltd

Re: x206-x225

From

"Jim C. Nasby"

Date:

14 March 2006, 18:08:20

On Tue, Mar 14, 2006 at 09:37:33PM +0000, Richard Huxton wrote:
> >But shouldn't it be possible to batch up WAL writes and syncs? In other
> >words, if you have 5 transactions that all COMMIT at exactly the same
> >time, it should be possible to get all 5 WAL pages (I'll assume each
> >one generated a small enough change so as not to require multiple WAL
> >pages) to the drive before the platter comes around to the right
> >position. The drive should then be able to write all 5 at once. At
> >least, theoretically...
>
> I think you mean this...
>
> http://www.postgresql.org/docs/8.1/static/runtime-config-wal.html
>
> commit_delay (integer)

No, that's not what I mean at all. On a system doing a large number of
WAL-generating transactions per second, it's certainly possible for
multiple transactions to commit in the period of time it takes for the
platter to rotate back into position to allow for writing of the WAL
data. What I don't know is if those multiple transactions would actually
make it to the platter on that rotation, or if they'd serialize,
resulting in one commit per revolution. I do know that there's no
theoretical reason that they couldn't, it's just a matter of putting
enough intelligence in the drive.

Perhaps this is something that SCSI supports and (S)ATA doesn't, since
SCSI allows multiple transactions to be 'in flight' on the bus at once.

But since you mention commit_delay, this does lead to an interesting
possible use: set it equal to the effective rotational period of the
drive. If you know your transaction load well enough, you could possibly
gain some benefit here. But of course a RAID controller with a BBU would
be a better bet...
--
Jim C. Nasby, Sr. Engineering Consultant      jnasby@pervasive.com
Pervasive Software      http://pervasive.com    work: 512-231-6117
vcard: http://jim.nasby.net/pervasive.vcf       cell: 512-569-9461

Re: x206-x225

From

Richard Huxton

Date:

15 March 2006, 05:43:46

Jim C. Nasby wrote:
>> I think you mean this...
>>
>> http://www.postgresql.org/docs/8.1/static/runtime-config-wal.html
>>
>> commit_delay (integer)
>
> No, that's not what I mean at all. On a system doing a large number of
> WAL-generating transactions per second, it's certainly possible for
> multiple transactions to commit in the period of time it takes for the
> platter to rotate back into position to allow for writing of the WAL
> data. What I don't know is if those multiple transactions would actually
> make it to the platter on that rotation, or if they'd serialize,
> resulting in one commit per revolution. I do know that there's no
> theoretical reason that they couldn't, it's just a matter of putting
> enough intelligence in the drive.
>
> Perhaps this is something that SCSI supports and (S)ATA doesn't, since
> SCSI allows multiple transactions to be 'in flight' on the bus at once.

SCSI Command queueing:
http://www.storagereview.com/guide2000/ref/hdd/if/scsi/protCQR.html

SATA "native command queuing":
http://www.tomshardware.com/2004/11/16/can_command_queuing_turbo_charge_sata/

> But since you mention commit_delay, this does lead to an interesting
> possible use: set it equal to the effective rotational period of the
> drive. If you know your transaction load well enough, you could possibly
> gain some benefit here. But of course a RAID controller with a BBU would
> be a better bet...

I suppose as long as you always have several transactions trying to
commit, have a separate spindle(s) for the WAL then you could improve
throughput at the cost of the shortest transaction times. Of course, it
might be that the increase in lock duration etc. might outweigh any
benefits. I'd suspect the cost/gain would be highly variable with
changes in workload, and as you say write-cache+BBU seems more sensible.


--
   Richard Huxton
   Archonet Ltd