Re: BLCKSZ - Mailing list pgsql-performance

From David Lang
Subject Re: BLCKSZ
Date
Msg-id Pine.LNX.4.62.0512060318070.2807@qnivq.ynat.uz
Whole thread Raw
In response to Re: BLCKSZ  ("Steinar H. Gunderson" <sgunderson@bigfoot.com>)
List pgsql-performance
On Tue, 6 Dec 2005, Steinar H. Gunderson wrote:

> On Tue, Dec 06, 2005 at 01:40:47PM +0300, Olleg wrote:
>> I can't undestand why "bigger is better". For instance in search by
>> index. Index point to page and I need load page to get one row. Thus I
>> load 8kb from disk for every raw. And keep it then in cache. You
>> recommend 64kb. With your recomendation I'll get 8 times more IO
>> throughput, 8 time more head seek on disk, 8 time more memory cache (OS
>> cache and postgresql) become busy.
>
> Hopefully, you won't have eight times the seeking; a single block ought to be
> in one chunk on disk. You're of course at your filesystem's mercy, though.

in fact useually it would mean 1/8 as many seeks, since the 64k chunk
would be created all at once it's probably going to be one chunk on disk
as Steiner points out and that means that you do one seek per 64k instead
of one seek per 8k.

With current disks it's getting to the point where it's the same cost to
read 8k as it is to read 64k (i.e. almost free, you could read
substantially more then 64k and not notice it in I/O speed), it's the
seeks that are expensive.

yes it will eat up more ram, but assuming that you are likly to need other
things nearby it's likly to be a win.

as processor speed keeps climing compared to memory and disk speed true
random access is really not the correct way to think about I/O anymore.
It's frequently more appropriate to think of your memory and disks as if
they were tape drives (seek then read, repeat)

even for memory access what you really do is seek to the beginning of a
block (expensive) then read that block into cache (cheap, you get the
entire cacheline of 64-128 bytes no matter if you need it or not) and then
you can then access that block fairly quickly. with memory on SMP machines
it's a constant cost to seek anywhere in memory, with NUMA machines
(including multi-socket Opterons) the cost to do the seek and fetch
depends on where in memory you are seeking to and what cpu you are running
on. it also becomes very expensive for multiple CPU's to write to memory
addresses that are in the same block (cacheline) of memory.

for disks it's even more dramatic, the seek is incredibly expensive
compared to the read/write, and the cost of the seek varies based on how
far you need to seek, but once you are on a track you can read the entire
track in for about the same cost as a single block (in fact the drive
useually does read the entire track before sending the one block on to
you). Raid complicates this becouse you have a block size per drive and
reading larger then that block size involves multiple drives.

most of the work in dealing with these issues and optimizing for them is
the job of the OS, some other databases work very hard to take over this
work from the OS, Postgres instead tries to let the OS do this work, but
we still need to keep it in mind when configuring things becouse it's
possible to make it much easier or much harder for the OS optimize things.

David Lang

pgsql-performance by date:

Previous
From: Tino Wildenhain
Date:
Subject: Re: Can this query go faster???
Next
From: Joost Kraaijeveld
Date:
Subject: Re: Can this query go faster???