Thread: BLCKSZ

BLCKSZ

From
Olleg Samoylov
Date:
src/include/pg_config_manual.h define BLCKSZ 8196 (8kb).

Somewhere I readed BLCKSZ must be equal to memory page of operational
system. And default BLCKSZ 8kb because first OS where postgres was build
has memory page size 8kb.

I try to test this. Linux, memory page 4kb, disk page 4kb. I set BLCKSZ
to 4kb. I get some performance improve, but not big, may be because I
have 4Gb on test server (amd64).

Can anyone test it also? May be better move BLCKSZ from
pg_config_manual.h to pg_config.h?

--
Olleg Samoylov

Re: BLCKSZ

From
Tom Lane
Date:
Olleg Samoylov <olleg_s@mail.ru> writes:
> I try to test this. Linux, memory page 4kb, disk page 4kb. I set BLCKSZ
> to 4kb. I get some performance improve, but not big, may be because I
> have 4Gb on test server (amd64).

It's highly unlikely that reducing BLCKSZ is a good idea.  There are bad
side-effects on the maximum index entry size, maximum number of tuple
fields, etc.  In any case, when you didn't say *what* you tested, it's
impossible to judge the usefulness of the change.

            regards, tom lane

Re: BLCKSZ

From
Olleg
Date:
Tom Lane wrote:
> Olleg Samoylov <olleg_s@mail.ru> writes:
>
>>I try to test this. Linux, memory page 4kb, disk page 4kb. I set BLCKSZ
>>to 4kb. I get some performance improve, but not big, may be because I
>>have 4Gb on test server (amd64).
>
> It's highly unlikely that reducing BLCKSZ is a good idea.  There are bad
> side-effects on the maximum index entry size, maximum number of tuple
> fields, etc.

Yes, when I set BLCKSZ=512, database dont' work. With BLCKSZ=1024
database very slow. (This was surprise me. I expect increase performance
in 8 times with 1024 BLCKSZ. :) ) As I already see in this maillist,
increase of  BLCKSZ reduce performace too. May be exist optimum value?
Theoretically BLCKSZ equal memory/disk page/block size may reduce
defragmentation drawback of memory and disk.

> In any case, when you didn't say *what* you tested, it's
> impossible to judge the usefulness of the change.
>             regards, tom lane

I test performace on database test server. This is copy of working
billing system to test new features and experiments. Test task was one
day traffic log. Average time of a one test was 260 minutes. Postgresql
7.4.8. Server dual Opteron 240, 4Gb RAM.

--
Olleg

Re: BLCKSZ

From
Alvaro Herrera
Date:
Olleg wrote:

> I test performace on database test server. This is copy of working
> billing system to test new features and experiments. Test task was one
> day traffic log. Average time of a one test was 260 minutes. Postgresql
> 7.4.8. Server dual Opteron 240, 4Gb RAM.

Did you execute queries from the log, one after another?  That may not
be a representative test -- try sending multiple queries in parallel, to
see how the server would perform in the real world.

--
Alvaro Herrera                                http://www.CommandPrompt.com/
The PostgreSQL Company - Command Prompt, Inc.

Re: BLCKSZ

From
Ron
Date:
At 04:32 PM 12/5/2005, Olleg wrote:
>Tom Lane wrote:
>>Olleg Samoylov <olleg_s@mail.ru> writes:
>>
>>>I try to test this. Linux, memory page 4kb, disk page 4kb. I set
>>>BLCKSZ to 4kb. I get some performance improve, but not big, may be
>>>because I have 4Gb on test server (amd64).
>>It's highly unlikely that reducing BLCKSZ is a good idea.  There
>>are bad side-effects on the maximum index entry size, maximum
>>number of tuple fields, etc.
>
>Yes, when I set BLCKSZ=512, database dont' work. With BLCKSZ=1024
>database very slow. (This was surprise me. I expect increase
>performance in 8 times with 1024 BLCKSZ. :) )

No wonder pg did not work or was very slow BLCKSZ= 512 or 1024 means
512 or 1024 *Bytes* respectively.  That's 1/16 and 1/8 the default 8KB BLCKSZ.


>  As I already see in this maillist, increase of  BLCKSZ reduce
> performace too.

Where?  BLCKSZ as large as 64KB has been shown to improve
performance.  If running a RAID, BLCKSZ of ~1/2 the RAID stripe size
seems to be a good value.


>May be exist optimum value? Theoretically BLCKSZ equal memory/disk
>page/block size may reduce defragmentation drawback of memory and disk.
Of course there's an optimal value... ...and of course it is
dependent on your HW, OS, and DB application.

In general, and in a very fuzzy sense, "bigger is better".  pg files
are laid down in 1GB chunks, so there's probably one limitation.
Given the HW you have mentioned, I'd try BLCKSZ= 65536 (you may have
to recompile your kernel) and a RAID stripe of 128KB or 256KB as a first guess.


>>In any case, when you didn't say *what* you tested, it's
>>impossible to judge the usefulness of the change.
>>                         regards, tom lane
>
>I test performace on database test server. This is copy of working
>billing system to test new features and experiments. Test task was
>one day traffic log. Average time of a one test was 260 minutes.

How large is a record in your billing system?  You want it to be an
integer divisor of BLCKSZ (so for instance odd sizes in Bytes are BAD),
Beyond that, you application domain matters.  OLTP like systems need
low latency access for frequent small transactions.  Data mining like
systems need to do IO in as big a chunk as the HW and OS will
allow.  Probably a good idea for BLCKSZ to be _at least_ max(8KB, 2x
record size)


>  Postgresql 7.4.8. Server dual Opteron 240, 4Gb RAM.

_Especially_ with that HW, upgrade to at least 8.0.x ASAP.  It's a
good idea to not be running pg 7.x anymore anyway, but it's
particularly so if you are running 64b SMP boxes.

Ron



Re: BLCKSZ

From
Tom Lane
Date:
Ron <rjpeace@earthlink.net> writes:
> Where?  BLCKSZ as large as 64KB has been shown to improve
> performance.

Not in the Postgres context, because you can't set BLCKSZ higher than
32K without doing extensive surgery on the page item pointer layout.
If anyone's actually gone to that much trouble, they sure didn't
publicize their results ...

>> Postgresql 7.4.8. Server dual Opteron 240, 4Gb RAM.

> _Especially_ with that HW, upgrade to at least 8.0.x ASAP.  It's a
> good idea to not be running pg 7.x anymore anyway, but it's
> particularly so if you are running 64b SMP boxes.

I agree with this bit --- 8.1 is a significant improvement on any prior
version for SMP boxes.  It's likely that 8.2 will be better yet,
because this is an area we just recently started paying serious
attention to.

            regards, tom lane

Re: BLCKSZ

From
Olleg
Date:
Ron wrote:
> In general, and in a very fuzzy sense, "bigger is better".  pg files are
> laid down in 1GB chunks, so there's probably one limitation.

Hm, expect result of tests on other platforms, but if there theoretical
dispute...
I can't undestand why "bigger is better". For instance in search by
index. Index point to page and I need load page to get one row. Thus I
load 8kb from disk for every raw. And keep it then in cache. You
recommend 64kb. With your recomendation I'll get 8 times more IO
throughput, 8 time more head seek on disk, 8 time more memory cache (OS
cache and postgresql) become busy. I have small row in often loaded
table, 32 bytes. Table is not clustered, used several indices. And you
recommend load 64Kb when I need only 32b, isn't it?
--
Olleg

Re: BLCKSZ

From
"Steinar H. Gunderson"
Date:
On Tue, Dec 06, 2005 at 01:40:47PM +0300, Olleg wrote:
> I can't undestand why "bigger is better". For instance in search by
> index. Index point to page and I need load page to get one row. Thus I
> load 8kb from disk for every raw. And keep it then in cache. You
> recommend 64kb. With your recomendation I'll get 8 times more IO
> throughput, 8 time more head seek on disk, 8 time more memory cache (OS
> cache and postgresql) become busy.

Hopefully, you won't have eight times the seeking; a single block ought to be
in one chunk on disk. You're of course at your filesystem's mercy, though.

/* Steinar */
--
Homepage: http://www.sesse.net/

Re: BLCKSZ

From
David Lang
Date:
On Tue, 6 Dec 2005, Steinar H. Gunderson wrote:

> On Tue, Dec 06, 2005 at 01:40:47PM +0300, Olleg wrote:
>> I can't undestand why "bigger is better". For instance in search by
>> index. Index point to page and I need load page to get one row. Thus I
>> load 8kb from disk for every raw. And keep it then in cache. You
>> recommend 64kb. With your recomendation I'll get 8 times more IO
>> throughput, 8 time more head seek on disk, 8 time more memory cache (OS
>> cache and postgresql) become busy.
>
> Hopefully, you won't have eight times the seeking; a single block ought to be
> in one chunk on disk. You're of course at your filesystem's mercy, though.

in fact useually it would mean 1/8 as many seeks, since the 64k chunk
would be created all at once it's probably going to be one chunk on disk
as Steiner points out and that means that you do one seek per 64k instead
of one seek per 8k.

With current disks it's getting to the point where it's the same cost to
read 8k as it is to read 64k (i.e. almost free, you could read
substantially more then 64k and not notice it in I/O speed), it's the
seeks that are expensive.

yes it will eat up more ram, but assuming that you are likly to need other
things nearby it's likly to be a win.

as processor speed keeps climing compared to memory and disk speed true
random access is really not the correct way to think about I/O anymore.
It's frequently more appropriate to think of your memory and disks as if
they were tape drives (seek then read, repeat)

even for memory access what you really do is seek to the beginning of a
block (expensive) then read that block into cache (cheap, you get the
entire cacheline of 64-128 bytes no matter if you need it or not) and then
you can then access that block fairly quickly. with memory on SMP machines
it's a constant cost to seek anywhere in memory, with NUMA machines
(including multi-socket Opterons) the cost to do the seek and fetch
depends on where in memory you are seeking to and what cpu you are running
on. it also becomes very expensive for multiple CPU's to write to memory
addresses that are in the same block (cacheline) of memory.

for disks it's even more dramatic, the seek is incredibly expensive
compared to the read/write, and the cost of the seek varies based on how
far you need to seek, but once you are on a track you can read the entire
track in for about the same cost as a single block (in fact the drive
useually does read the entire track before sending the one block on to
you). Raid complicates this becouse you have a block size per drive and
reading larger then that block size involves multiple drives.

most of the work in dealing with these issues and optimizing for them is
the job of the OS, some other databases work very hard to take over this
work from the OS, Postgres instead tries to let the OS do this work, but
we still need to keep it in mind when configuring things becouse it's
possible to make it much easier or much harder for the OS optimize things.

David Lang