Thread: random_page_costs - are defaults of 4.0 realistic for SCSI RAID 1

random_page_costs - are defaults of 4.0 realistic for SCSI RAID 1

From
"Carlo Stonebanks"
Date:
Can anyone answer this for me: Although I realize my client's disk subsystem
(SCSI/RAID Smart Array E200 controller using RAID 1) is less than
impressive - is the default setting of 4.0 realistic or could it be lower?

Thanks!


Re: random_page_costs - are defaults of 4.0 realistic for SCSI RAID 1

From
"Luke Lonergan"
Date:
Should be a lot higher, something like 10-15 is approximating accurate.

Increasing the number of disks in a RAID actually makes the number higher,
not lower.  Until Postgres gets AIO + the ability to post multiple
concurrent IOs on index probes, random IO does not scale with increasing
disk count, but sequential does, thus the increasing "random page cost" as
the RAID gets faster.

The reason to change the number is to try to discourage the planner from
choosing index scans too aggressively.  We (GP) have implemented something
we call "Adaptive Nested Loop" to replace a nested loop + index scan with a
hash join when the selectivity estimates are off in order to improve this
behavior.  We also run with a "random_page_cost=100" because we generally
run on machines with fast sequential I/O.

- Luke


On 9/10/07 12:25 PM, "Carlo Stonebanks" <stonec.register@sympatico.ca>
wrote:

> Can anyone answer this for me: Although I realize my client's disk subsystem
> (SCSI/RAID Smart Array E200 controller using RAID 1) is less than
> impressive - is the default setting of 4.0 realistic or could it be lower?
>
> Thanks!
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: You can help support the PostgreSQL project by donating at
>
>                 http://www.postgresql.org/about/donate



Re: random_page_costs - are defaults of 4.0 realistic for SCSI RAID 1

From
Mark Mielke
Date:
Luke Lonergan wrote:
> Should be a lot higher, something like 10-15 is approximating accurate.
>
In my own case, I have a much smaller database that I normally work
with, where everything should fit in memory (100 Mbytes?), and reducing
it to 3.0 has resulted in consistently better timings for me. I think
this means that the planner doesn't understand my database size :
effective memory size ratio. :-)

Anyways - my point is that if you change the default to 10 you may hurt
people like me.

Cheers,
mark

--
Mark Mielke <mark@mielke.cc>

Re: random_page_costs - are defaults of 4.0 realistic for SCSI RAID 1

From
Josh Berkus
Date:
Luke,

>  We (GP) have implemented
> something we call "Adaptive Nested Loop" to replace a nested loop +
> index scan with a hash join when the selectivity estimates are off in
> order to improve this behavior.  We also run with a
> "random_page_cost=100" because we generally run on machines with fast
> sequential I/O.

So, when is this getting contributed?  ;-)

--
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

Re: random_page_costs - are defaults of 4.0 realistic for SCSI RAID 1

From
Gregory Stark
Date:
"Luke Lonergan" <llonergan@greenplum.com> writes:

> Should be a lot higher, something like 10-15 is approximating accurate.

Most people's experience is that due to Postgres underestimating the benefits
of caching lowering the random_page_cost is helpful.

> Increasing the number of disks in a RAID actually makes the number higher,
> not lower.  Until Postgres gets AIO + the ability to post multiple
> concurrent IOs on index probes, random IO does not scale with increasing
> disk count, but sequential does, thus the increasing "random page cost" as
> the RAID gets faster.

That does sound right, though I don't think it's very common. If you have very
wide stripes you can get some amazing sequential scan speeds and it won't
really help random access at all. This is especially helpful if you're in an
environment where you don't care about the costs you're imposing on other
processes, such as a data warehouse where you have a single large batch query
running at a time.

What I don't understand is the bit about "until Postgres gets AIO + the
ability to post multiple concurrent IOs on index probes". Even with AIO your
seek times are not going to be improved by wide raid stripes. And you can't
possibly find the page at level n+1 before you've looked at the page at level
n. Do you mean to be able to probe multiple index keys simultaneously? How
does that work out?

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: random_page_costs - are defaults of 4.0 realistic for SCSI RAID 1

From
"Scott Marlowe"
Date:
On 9/10/07, Gregory Stark <stark@enterprisedb.com> wrote:
>
> "Luke Lonergan" <llonergan@greenplum.com> writes:
>
> > Should be a lot higher, something like 10-15 is approximating accurate.
>
> Most people's experience is that due to Postgres underestimating the benefits
> of caching lowering the random_page_cost is helpful.

Quite often the real problem is that they have effective_cache_size
too small, and they use random_page_cost to get the planner to switch
to index scans on small tables.  With a large effective_cache_size and
small to moderate table (i.e. it fits in memory pretty handily) the
planner seems much better in the last few major releases about picking
an index over a sequential scan.

Re: random_page_costs - are defaults of 4.0 realistic for SCSI RAID 1

From
Jeff Davis
Date:
On Mon, 2007-09-10 at 22:44 +0100, Gregory Stark wrote:
> What I don't understand is the bit about "until Postgres gets AIO + the
> ability to post multiple concurrent IOs on index probes". Even with AIO your
> seek times are not going to be improved by wide raid stripes. And you can't
> possibly find the page at level n+1 before you've looked at the page at level
> n. Do you mean to be able to probe multiple index keys simultaneously? How
> does that work out?
>

I think he's referring to mirrors, in which there are multiple spindles
that can return a requested block. That could mitigate random I/O, if
the I/O is asynchronous and something intelligent (OS or controller) can
schedule it.

Regards,
    Jeff Davis


Re: random_page_costs - are defaults of 4.0 realistic for SCSI RAID 1

From
Mark Mielke
Date:
Gregory Stark wrote:
"Luke Lonergan" <llonergan@greenplum.com> writes: 
Increasing the number of disks in a RAID actually makes the number higher,
not lower.  Until Postgres gets AIO + the ability to post multiple
concurrent IOs on index probes, random IO does not scale with increasing
disk count, but sequential does, thus the increasing "random page cost" as
the RAID gets faster.   
What I don't understand is the bit about "until Postgres gets AIO + the
ability to post multiple concurrent IOs on index probes". Even with AIO your
seek times are not going to be improved by wide raid stripes. And you can't
possibly find the page at level n+1 before you've looked at the page at level
n. Do you mean to be able to probe multiple index keys simultaneously? How
does that work out
One suggestion: The plan is already in a tree. With some dependency analysis, I assume the tree could be executed in parallel (multiple threads or event triggered entry into a state machine), and I/O to fetch index pages or table pages could be scheduled in parallel. At this point, AIO becomes necessary to let the underlying system (and hardware with tagged queueing?) schedule which pages should be served best first.

Cheers,
mark

-- 
Mark Mielke <mark@mielke.cc>

Re: random_page_costs - are defaults of 4.0 realistic for SCSI RAID 1

From
Mark Mielke
Date:
Scott Marlowe wrote:
On 9/10/07, Gregory Stark <stark@enterprisedb.com> wrote: 
"Luke Lonergan" <llonergan@greenplum.com> writes:   
Should be a lot higher, something like 10-15 is approximating accurate.     
Most people's experience is that due to Postgres underestimating the benefits
of caching lowering the random_page_cost is helpful.   
Quite often the real problem is that they have effective_cache_size
too small, and they use random_page_cost to get the planner to switch
to index scans on small tables.  With a large effective_cache_size and
small to moderate table (i.e. it fits in memory pretty handily) the
planner seems much better in the last few major releases about picking
an index over a sequential scan
In my case, I set effective_cache_size to 25% of the RAM available to the system (256 Mbytes), for a database that was about 100 Mbytes or less. I found performance to increase when reducing random_page_cost from 4.0 to 3.0.

For a database that truly fits entirely in memory, I assume random_page_cost is closer to 1.0. The planner should know that there is no significant seek cost for RAM.

I will try to compare results tonight using 8.2. The last time I checked may have been 8.1. I am also curious to see what the current algorithm is with regard to effective_cache_size.

Cheers,
mark

-- 
Mark Mielke <mark@mielke.cc>

Re: random_page_costs - are defaults of 4.0 realistic for SCSI RAID 1

From
"Luke Lonergan"
Date:
Hi Mark, Greg,

On 9/10/07 3:08 PM, "Mark Mielke" <mark@mark.mielke.cc> wrote:

> One suggestion: The plan is already in a tree. With some dependency analysis,
> I assume the tree could be executed in parallel (multiple threads or event
> triggered entry into a state machine), and I/O to fetch index pages or table
> pages could be scheduled in parallel. At this point, AIO becomes necessary to
> let the underlying system (and hardware with tagged queueing?) schedule which
> pages should be served best first.

Right now the pattern for index scan goes like this:

- Find qualifying TID in index
  - Seek to TID location in relfile
    - Acquire tuple from relfile, return

When the tuples are widely distributed in the table, as is the case with a
very selective predicate against an evenly distributed attribute on a
relation 2x larger than the I/O cache + bufcache, this pattern will result
in effectively "random I/O".  In actual fact, the use of the in-memory
bitmap index will make the I/Os sequential, but sparse, which is another
version of "random" if the sequential I/Os are larger than the
gather/scatter I/O aggregation in the OS scheduler (say 1MB).  This is a
very common circumstance for DSS / OLAP / DW workloads.

For plans that qualify with the above conditions, the executor will issue
blocking calls to lseek(), which will translate to a single disk actuator
moving to the needed location in seek_time, approximately 8ms.  The
seek_time for a single query will not improve with the increase in number of
disks in an underlying RAID pool, so we can do about 1000/8 = 125 seeks per
second no matter what I/O subsystem we have.

If we implement AIO and allow for multiple pending I/Os used to prefetch
groups of qualifying tuples, basically a form of random readahead, we can
improve the throughput for any given query by taking advantage of multiple
disk actuators.  This will work for RAID5, RAID10 and other disk pooling
mechanisms because the lseek() will be issued as parallel events.  Note that
the same approach would also work to speed sequential access by overlapping
compute and I/O.

- Luke



Re: random_page_costs - are defaults of 4.0 realistic for SCSI RAID 1

From
"Luke Lonergan"
Date:
Hi Josh,

On 9/10/07 2:26 PM, "Josh Berkus" <josh@agliodbs.com> wrote:

> So, when is this getting contributed?  ;-)

Yes, that's the right question to ask :-)

One feeble answer: "when we're not overwhelmed by customer activity"...

- Luke



Re: random_page_costs - are defaults of 4.0 realistic for SCSI RAID 1

From
Gregory Stark
Date:
"Luke Lonergan" <llonergan@greenplum.com> writes:

> Right now the pattern for index scan goes like this:
>
> - Find qualifying TID in index
>   - Seek to TID location in relfile
>     - Acquire tuple from relfile, return
>...
> If we implement AIO and allow for multiple pending I/Os used to prefetch
> groups of qualifying tuples, basically a form of random readahead

Ah, I see what you mean now. It makes a lot more sense if you think of it for
bitmap index scans. So, for example, the bitmap index scan could stream tids
to the executor and the executor would strip out the block numbers and pass
them to the i/o layer saying "i need this block now but following that I'll
need these blocks so get them moving now".

I think this seems pretty impractical for regular (non-bitmap) index probes
though. You might be able to do it sometimes but not very effectively and you
won't know when it would be useful.

I think what this means is that there are actually *three* kinds of i/o: 1)
Sequential which means you get the full bandwidth of your drives * the number
of spindles; 2) Random which gets you 1 block per seek latency regardless of
how many spindles you have; and 3) Random but with prefetch which gets you the
random bandwidth above times the number of spindles.

The extra spindles speed up sequential i/o too so the ratio between sequential
and random with prefetch would still be about 4.0. But the ratio between
sequential and random without prefetch would be even higher.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: random_page_costs - are defaults of 4.0 realistic for SCSI RAID 1

From
"Luke Lonergan"
Date:
Greg,

> I think this seems pretty impractical for regular
> (non-bitmap) index probes though. You might be able to do it
> sometimes but not very effectively and you won't know when it
> would be useful.

Maybe so, though I think it's reasonable to get multiple actuators going
even if the seeks are truly random.  It's a dynamic / tricky business to
determine how many pending seeks to post, but it should be roughly close
to the number of disks in the pool IMO.

> I think what this means is that there are actually *three*
> kinds of i/o: 1) Sequential which means you get the full
> bandwidth of your drives * the number of spindles; 2) Random
> which gets you 1 block per seek latency regardless of how
> many spindles you have; and 3) Random but with prefetch which
> gets you the random bandwidth above times the number of spindles.

Perhaps so, though I'm more optimistic that prefetch would help most
random seek situations.

For reasonable amounts of concurrent usage this point becomes moot - we
get the benefit of multiple backends doing seeking anyway, but I think
if we do dynamic prefetch right it would degenerate gracefully in those
circumstances.

> The extra spindles speed up sequential i/o too so the ratio
> between sequential and random with prefetch would still be
> about 4.0. But the ratio between sequential and random
> without prefetch would be even higher.

Right :-)

- Luke


Re: random_page_costs - are defaults of 4.0 realistic for SCSI RAID 1

From
"Heikki Linnakangas"
Date:
Luke Lonergan wrote:
> For plans that qualify with the above conditions, the executor will issue
> blocking calls to lseek(), which will translate to a single disk actuator
> moving to the needed location in seek_time, approximately 8ms.

I doubt it's actually the lseeks, but the reads/writes after the lseeks
that block.

> If we implement AIO and allow for multiple pending I/Os used to prefetch
> groups of qualifying tuples, basically a form of random readahead, we can
> improve the throughput for any given query by taking advantage of multiple
> disk actuators.

Rather than jumping to AIO, which is a huge change, I think we could get
much of the benefit by using posix_fadvise(WILLNEED) in strategic places
to tell the OS what pages we're going to need in the near future. If the
OS has implemented that properly, it should schedule I/Os for the
requested pages ahead of time. That would require very little change to
PostgreSQL code, and could simply be #ifdef'd away on platforms that
don't support posix_fadvise.

> Note that
> the same approach would also work to speed sequential access by overlapping
> compute and I/O.

Yes, though the OS should already doing read ahead for us. How efficient
it is is another question. posix_fadvise(SEQUENTIAL) could be used to
give a hint on that as well.

--
  Heikki Linnakangas
  EnterpriseDB   http://www.enterprisedb.com

Re: random_page_costs - are defaults of 4.0 realistic for SCSI RAID 1

From
Jean-David Beyer
Date:
Gregory Stark wrote (in part):

> The extra spindles speed up sequential i/o too so the ratio between sequential
> and random with prefetch would still be about 4.0. But the ratio between
> sequential and random without prefetch would be even higher.
>
I never figured out how extra spindles help sequential I-O because
consecutive logical blocks are not necessarily written consecutively in a
Linux or UNIX file system. They try to group a bunch (8 512-bit?) of blocks
together, but that is about it. So even if you are reading sequentially, the
head actuator may be seeking around anyway. I suppose you could fix this, if
the database were reasonably static, by backing up the entire database,
doing a mkfs on the file system, and restoring it. This might make the
database more contiguous, at least for a while.

When I was working on a home-brew UNIX dbms, I used raw IO on a separate
disk drive so that the files could be contiguous, and this would work.
Similarly, IBM's DB2 does that (optionally). But it is my understanding that
postgreSQL does not. OTOH, the large (in my case) cache in the kernel can be
helpful if I seek around back and forth to nearby records since they may be
in the cache. On my 8 GByte RAM, I have the shared buffers set to 200,000
which should keep any busy stuff in memory, and there are about 6 GBytes of
ram presently available for the system I-O cache. I have not optimized
anything yet because I am still laundering the major input data to
initialize the database so I do not have any real transactions going through
it yet.

I have 6 SCSI hard drives on two Ultra/320 SCSI controllers. Of the database
partitions, sda8 has the write-ahead-log, sdb7 has a few tiny seldom-used
tables and pg_log, and sdc1, sdd1, sde1, and sdf1 are just for the other
tables. For the data on sd[c-f]1 (there is nothing else on these drives), I
keep the index for a table on a different drive from the data. When
populating the database initially, this seems to help since I tend to fill
one table, or a very few tables, at a time, so the table itself and its
index do not contend for the head actuator. Presumably, the SCSI controllers
can do simultaneous seeks on the various drives and one transfer on each
controller.

When loading the database (using INSERTs mainly -- because the input data
are gawdawful unnormalized spreadsheets obtained from elsewhere, growing
once a week), the system is IO limited with seeks (and rotational latency
time). IO transfers average about 1.7 Megabytes/second, although there are
peaks that exceed 10 Megabytes/second. If I run pg_restore from a backup
tape, I can see 90 Megabyte/second transfer rates for bursts of several
seconds at a time, but that is pretty much of a record.

--
  .~.  Jean-David Beyer          Registered Linux User 85642.
  /V\  PGP-Key: 9A2FC99A         Registered Machine   241939.
 /( )\ Shrewsbury, New Jersey    http://counter.li.org
 ^^-^^ 06:35:01 up 33 days, 9:57, 0 users, load average: 4.06, 4.07, 4.02

Re: random_page_costs - are defaults of 4.0 realistic for SCSI RAID 1

From
Gregory Stark
Date:
"Jean-David Beyer" <jeandavid8@verizon.net> writes:

> Gregory Stark wrote (in part):
>
>> The extra spindles speed up sequential i/o too so the ratio between sequential
>> and random with prefetch would still be about 4.0. But the ratio between
>> sequential and random without prefetch would be even higher.
>>
> I never figured out how extra spindles help sequential I-O because
> consecutive logical blocks are not necessarily written consecutively in a
> Linux or UNIX file system. They try to group a bunch (8 512-bit?) of blocks
> together, but that is about it. So even if you are reading sequentially, the
> head actuator may be seeking around anyway.

That's somewhat true but good filesystems group a whole lot more than 8 blocks
together. You can do benchmarks with dd and compare the speed of reading from
a file with the speed of reading from the raw device. On typical consumer
drives these days you'll get 50-60MB/s raw and I would expect not a whole lot
less than that with a large ext2 file, at least if it's created all in one
chunk on a not overly-full filesystem.

(Those assumptions is not necessarily valid for Postgres which is another
topic, but one that requires some empirical numbers before diving into.)

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: random_page_costs - are defaults of 4.0 realistic for SCSI RAID 1

From
Jean-David Beyer
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Gregory Stark wrote:
> "Jean-David Beyer" <jeandavid8@verizon.net> writes:
>
>> Gregory Stark wrote (in part):
>>
>>> The extra spindles speed up sequential i/o too so the ratio between sequential
>>> and random with prefetch would still be about 4.0. But the ratio between
>>> sequential and random without prefetch would be even higher.
>>>
>> I never figured out how extra spindles help sequential I-O because
>> consecutive logical blocks are not necessarily written consecutively in a
>> Linux or UNIX file system. They try to group a bunch (8 512-bit?) of blocks
>> together, but that is about it. So even if you are reading sequentially, the
>> head actuator may be seeking around anyway.
>
> That's somewhat true but good filesystems group a whole lot more than 8 blocks
> together. You can do benchmarks with dd and compare the speed of reading from
> a file with the speed of reading from the raw device. On typical consumer
> drives these days you'll get 50-60MB/s raw and I would expect not a whole lot
> less than that with a large ext2 file, at least if it's created all in one
> chunk on a not overly-full filesystem.

# date; dd if=/dev/sda8 of=/dev/null;date
Tue Sep 11 14:27:36 EDT 2007
8385867+0 records in
8385867+0 records out
4293563904 bytes (4.3 GB) copied, 71.7648 seconds, 59.8 MB/s
Tue Sep 11 14:28:48 EDT 2007

# date; dd bs=8192 if=/dev/sda8 of=/dev/null;date
Tue Sep 11 14:29:15 EDT 2007
524116+1 records in
524116+1 records out
4293563904 bytes (4.3 GB) copied, 68.2595 seconds, 62.9 MB/s
Tue Sep 11 14:30:23 EDT 2007

# date; dd bs=8192
if=/srv/dbms/dataA/pgsql/data/pg_xlog/000000010000002B0000002F of=/dev/null;date
Tue Sep 11 14:34:25 EDT 2007
2048+0 records in
2048+0 records out
16777216 bytes (17 MB) copied, 0.272343 seconds, 61.6 MB/s
Tue Sep 11 14:34:26 EDT 2007

The first two are the partition where the W.A.L. is in (and a bit more:

[/srv/dbms/dataA/pgsql/data]# ls -l
total 104
- -rw------- 1 postgres postgres     4 Aug 11 13:32 PG_VERSION
drwx------ 5 postgres postgres  4096 Aug 11 13:32 base
drwx------ 2 postgres postgres  4096 Sep 11 14:35 global
drwx------ 2 postgres postgres  4096 Sep 10 18:58 pg_clog
- -rw------- 1 postgres postgres  3396 Aug 11 13:32 pg_hba.conf
- -rw------- 1 root     root      3396 Aug 16 14:32 pg_hba.conf.dist
- -rw------- 1 postgres postgres  1460 Aug 11 13:32 pg_ident.conf
drwx------ 4 postgres postgres  4096 Aug 11 13:32 pg_multixact
drwx------ 2 postgres postgres  4096 Sep 10 19:48 pg_subtrans
drwx------ 2 postgres postgres  4096 Aug 12 16:14 pg_tblspc
drwx------ 2 postgres postgres  4096 Aug 11 13:32 pg_twophase
drwx------ 3 postgres postgres  4096 Sep 10 19:53 pg_xlog
- -rw------- 1 postgres postgres 15527 Sep  8 00:35 postgresql.conf
- -rw------- 1 postgres postgres 13659 Aug 11 13:32 postgresql.conf.dist
- -rw------- 1 root     root     15527 Sep  4 10:37 postgresql.conf~
- -rw------- 1 postgres postgres    56 Sep  8 08:12 postmaster.opts
- -rw------- 1 postgres postgres    53 Sep  8 08:12 postmaster.pid

It is tricky for me to find a big enough file to test. I tried one of the
pg_xlog files, but I cannot easily copy from there because it acts a bit
interactive and the time is mostly my time. If I copy it elsewhere and give
it to non-root, then it is all in the cache, so it does not really read it.
>
> (Those assumptions is not necessarily valid for Postgres which is another
> topic, but one that requires some empirical numbers before diving into.)
>


- --
  .~.  Jean-David Beyer          Registered Linux User 85642.
  /V\  PGP-Key: 9A2FC99A         Registered Machine   241939.
 /( )\ Shrewsbury, New Jersey    http://counter.li.org
 ^^-^^ 14:30:04 up 33 days, 17:52, 1 user, load average: 5.50, 4.67, 4.29
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with CentOS - http://enigmail.mozdev.org

iD8DBQFG5uM4Ptu2XpovyZoRAhtlAKDFs5eP/CGIqB/z207j2dpwDSHOlwCfevp4
lBWn3b2GW6gesaq+l3Rbooc=
=F4H6
-----END PGP SIGNATURE-----

Re: random_page_costs - are defaults of 4.0 realistic for SCSI RAID 1

From
Decibel!
Date:
On Mon, Sep 10, 2007 at 06:22:06PM -0400, Mark Mielke wrote:
> In my case, I set effective_cache_size to 25% of the RAM available to
> the system (256 Mbytes), for a database that was about 100 Mbytes or
> less. I found performance to increase when reducing random_page_cost
> from 4.0 to 3.0.

Just for the record, effective_cache_size of 25% is *way* too low in
most cases, though if you only have 1GB setting it to 500MB probably
isn't too far off.

Generally, I'll set this to however much memory is in the server, minus
1G for the OS, unless there's less than 4G of total memory in which case
I subtract less.
--
Decibel!, aka Jim Nasby                        decibel@decibel.org
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Attachment

Re: random_page_costs - are defaults of 4.0 realistic for SCSI RAID 1

From
Decibel!
Date:
On Tue, Sep 11, 2007 at 02:49:28PM -0400, Jean-David Beyer wrote:
> It is tricky for me to find a big enough file to test. I tried one of the

dd if=/dev/zero of=bigfile bs=8192 count=1000000
--
Decibel!, aka Jim Nasby                        decibel@decibel.org
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Attachment

Re: random_page_costs - are defaults of 4.0 realistic for SCSI RAID 1

From
Mark Mielke
Date:
Decibel! wrote:
On Mon, Sep 10, 2007 at 06:22:06PM -0400, Mark Mielke wrote: 
In my case, I set effective_cache_size to 25% of the RAM available to 
the system (256 Mbytes), for a database that was about 100 Mbytes or 
less. I found performance to increase when reducing random_page_cost 
from 4.0 to 3.0.   
Just for the record, effective_cache_size of 25% is *way* too low in
most cases, though if you only have 1GB setting it to 500MB probably
isn't too far off.

Generally, I'll set this to however much memory is in the server, minus
1G for the OS, unless there's less than 4G of total memory in which case
I subtract less. 
Agree. My point was only that there are conflicting database requirements, and that one setting may not be valid for both. The default should be whatever is the most useful for the most number of people. People who fall into one of the two extremes should know enough to set the value based on actual performance measurements.

Cheers,
mark

-- 
Mark Mielke <mark@mielke.cc>

Re: random_page_costs - are defaults of 4.0 realistic for SCSI RAID 1

From
Gregory Stark
Date:
"Decibel!" <decibel@decibel.org> writes:

> On Tue, Sep 11, 2007 at 02:49:28PM -0400, Jean-David Beyer wrote:
>> It is tricky for me to find a big enough file to test. I tried one of the
>
> dd if=/dev/zero of=bigfile bs=8192 count=1000000

On linux another useful trick is:

echo 1 > /proc/sys/vm/drop_caches

Also, it helps to run a "vmstat 1" in another window and watch the bi and bo
columns.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: random_page_costs - are defaults of 4.0 realistic for SCSI RAID 1

From
Decibel!
Date:
On Wed, Sep 12, 2007 at 12:02:46AM +0100, Gregory Stark wrote:
> "Decibel!" <decibel@decibel.org> writes:
>
> > On Tue, Sep 11, 2007 at 02:49:28PM -0400, Jean-David Beyer wrote:
> >> It is tricky for me to find a big enough file to test. I tried one of the
> >
> > dd if=/dev/zero of=bigfile bs=8192 count=1000000
>
> On linux another useful trick is:
>
> echo 1 > /proc/sys/vm/drop_caches

The following C code should have similar effect...


/*
 * $Id: clearmem.c,v 1.1 2003/06/29 20:41:33 decibel Exp $
 *
 * Utility to clear out a chunk of memory and zero it. Useful for flushing disk buffers
 */

int main(int argc, char *argv[]) {
    if (!calloc(atoi(argv[1]), 1024*1024)) { printf("Error allocating memory.\n"); }
}
--
Decibel!, aka Jim Nasby                        decibel@decibel.org
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Attachment

Re: random_page_costs - are defaults of 4.0 realistic for SCSI RAID 1

From
Greg Smith
Date:
On Wed, 12 Sep 2007, Gregory Stark wrote:

> Also, it helps to run a "vmstat 1" in another window and watch the bi and bo
> columns.

Recently on Linux systems I've been using dstat (
http://dag.wieers.com/home-made/dstat/ ) for live monitoring in this sort
of situation.  Once you get the command line parameters right, you can get
data for each of the major disks on your system that keep the columns
human readable (like switching from KB/s to MB/s as appropriate) as
activity goes up and down combined with the standard vmstat data.

I still use vmstat/iostat if I want to archive or parse the data, but if
I'm watching it I always use dstat now.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: random_page_costs - are defaults of 4.0 realistic for SCSI RAID 1

From
"Joshua D. Drake"
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Greg Smith wrote:
> On Wed, 12 Sep 2007, Gregory Stark wrote:
>
>> Also, it helps to run a "vmstat 1" in another window and watch the bi
>> and bo
>> columns.
>
> Recently on Linux systems I've been using dstat (
> http://dag.wieers.com/home-made/dstat/ ) for live monitoring in this
> sort of situation.  Once you get the command line parameters right, you
> can get data for each of the major disks on your system that keep the
> columns human readable (like switching from KB/s to MB/s as appropriate)
> as activity goes up and down combined with the standard vmstat data.
>
> I still use vmstat/iostat if I want to archive or parse the data, but if
> I'm watching it I always use dstat now.

Thanks for the tip Greg... I hadn't heard of dstat.

Sincerely,

Joshua D. Drake


>
> --
> * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: explain analyze is your friend
>


- --

      === The PostgreSQL Company: Command Prompt, Inc. ===
Sales/Support: +1.503.667.4564   24x7/Emergency: +1.800.492.2240
PostgreSQL solutions since 1997  http://www.commandprompt.com/
            UNIQUE NOT NULL
Donate to the PostgreSQL Project: http://www.postgresql.org/about/donate
PostgreSQL Replication: http://www.commandprompt.com/products/

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.6 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iD8DBQFG5zbRATb/zqfZUUQRAnz4AJwM1bGsVPdUZWy6ldqEq9l8SqRpJACcCfUc
Joc8dLj12hISB5mQO6Tn+a8=
=E5D2
-----END PGP SIGNATURE-----

Re: random_page_costs - are defaults of 4.0 realistic for SCSI RAID 1

From
Rafael Barrera Oro
Date:
Jean-David Beyer escribió:
> Gregory Stark wrote (in part):
>
>
>> The extra spindles speed up sequential i/o too so the ratio between sequential
>> and random with prefetch would still be about 4.0. But the ratio between
>> sequential and random without prefetch would be even higher.
>>
>>
> I never figured out how extra spindles help sequential I-O because
> consecutive logical blocks are not necessarily written consecutively in a
> Linux or UNIX file system. They try to group a bunch (8 512-bit?) of blocks
> together, but that is about it. So even if you are reading sequentially, the
> head actuator may be seeking around anyway. I suppose you could fix this, if
> the database were reasonably static, by backing up the entire database,
> doing a mkfs on the file system, and restoring it. This might make the
> database more contiguous, at least for a while.
>
> When I was working on a home-brew UNIX dbms, I used raw IO on a separate
> disk drive so that the files could be contiguous, and this would work.
> Similarly, IBM's DB2 does that (optionally). But it is my understanding that
> postgreSQL does not. OTOH, the large (in my case) cache in the kernel can be
> helpful if I seek around back and forth to nearby records since they may be
> in the cache. On my 8 GByte RAM, I have the shared buffers set to 200,000
> which should keep any busy stuff in memory, and there are about 6 GBytes of
> ram presently available for the system I-O cache. I have not optimized
> anything yet because I am still laundering the major input data to
> initialize the database so I do not have any real transactions going through
> it yet.
>
> I have 6 SCSI hard drives on two Ultra/320 SCSI controllers. Of the database
> partitions, sda8 has the write-ahead-log, sdb7 has a few tiny seldom-used
> tables and pg_log, and sdc1, sdd1, sde1, and sdf1 are just for the other
> tables. For the data on sd[c-f]1 (there is nothing else on these drives), I
> keep the index for a table on a different drive from the data. When
> populating the database initially, this seems to help since I tend to fill
> one table, or a very few tables, at a time, so the table itself and its
> index do not contend for the head actuator. Presumably, the SCSI controllers
> can do simultaneous seeks on the various drives and one transfer on each
> controller.
>
> When loading the database (using INSERTs mainly -- because the input data
> are gawdawful unnormalized spreadsheets obtained from elsewhere, growing
> once a week), the system is IO limited with seeks (and rotational latency
> time). IO transfers average about 1.7 Megabytes/second, although there are
> peaks that exceed 10 Megabytes/second. If I run pg_restore from a backup
> tape, I can see 90 Megabyte/second transfer rates for bursts of several
> seconds at a time, but that is pretty much of a record.
>
>


Re: random_page_costs - are defaults of 4.0 realistic for SCSI RAID 1

From
Gregory Stark
Date:
"Gregory Stark" <stark@enterprisedb.com> writes:

> "Luke Lonergan" <llonergan@greenplum.com> writes:
>
>> Right now the pattern for index scan goes like this:
>>
>> - Find qualifying TID in index
>>   - Seek to TID location in relfile
>>     - Acquire tuple from relfile, return
>>...
>> If we implement AIO and allow for multiple pending I/Os used to prefetch
>> groups of qualifying tuples, basically a form of random readahead
>
> Ah, I see what you mean now. It makes a lot more sense if you think of it for
> bitmap index scans. So, for example, the bitmap index scan could stream tids
> to the executor and the executor would strip out the block numbers and pass
> them to the i/o layer saying "i need this block now but following that I'll
> need these blocks so get them moving now".

Wow, I've done some preliminary testing here on Linux using posix_fadvise and
Solaris using libaio to prefetch blocks and then access them randomly and I
think there's a lot of low hanging fruit here.

The use case where this helps is indeed on a raid array where you're not
maxing out the bandwidth of the array and care about the transaction latency,
perhaps a narrow use case but still, quite common.

Since our random access is synchronous it means we have to wait for one seek,
process that page, then wait for the next seek on another drive which was
sitting idle while we were processing the first page. By prefetching the pages
we'll need next we can get all the members of the array working for us
simultaneously even if they're all doing seeks.

What I've done is write a test program which generates a 1G file, syncs it and
drops the caches (not working yet on Solaris but doesn't seem to affect the
results) and then picks 4096 8k buffers and reads them in random order. The
machines it's running on have a small raid array with 4 drives.

Just seeking without any prefetch it takes about 12.5s on Linux and 13.5s on
Solaris. If I prefetch even a single buffer using posix_fadvise or libaio I
see a noticeable improvement, over 25%. At 128 buffers of prefetch both
systems are down to about 2.5-2.7s. That's on the small raid array. On the
boot both have a small beneficial effect but only at very large prefetch set
sizes which I would chalk down to being able to re-order the reads even if it
can't overlap them.

I want to test how much of this effect evaporates when I compare it to a
bitmap index style scan but that depends on a lot of factors like the exact
pattern of file extensions on the database files. In any case bitmap index
scans get us the reordering effect, but not the overlapping i/o requests
assuming they're spread quite far apart in the data files.

> I think this seems pretty impractical for regular (non-bitmap) index probes
> though. You might be able to do it sometimes but not very effectively and you
> won't know when it would be useful.

How useful this is depends a lot on how invasively we let it infect things
like regular index scans. If we can prefetch right siblings and deeper index
pages as we descend an index tree and future heap pages it could help a lot as
those aren't sorted like bitmap index scans. But even if we only fetch heap
pages all together before processing the heap pages it could be a big help.

Incidentally we do need to try to make use of both as Solaris doesn't have
posix_fadvise as far as I can tell and Linux's libaio doesn't support
non-O_DIRECT files.

Raw data:

Blocks          Linux           Solaris
Prefetched      posix_fadvise   libaio
---------------------------------------
    1           12.473          13.597
    2            9.053           9.830
    4            6.787           7.594
    8            5.303           6.588
   16            4.209           5.120
   32            3.388           4.014
   64            2.869           3.216
  128            2.515           2.710
  256            2.312           2.327
  512            2.168           2.099
 1024            2.139           1.974
 2048            2.242           1.903
 4096            2.222           1.890

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com

Re: random_page_costs - are defaults of 4.0 realistic for SCSI RAID 1

From
"Kevin Grittner"
Date:
>>> On Mon, Sep 10, 2007 at  2:25 PM, in message <fc45mi$1tj9$1@news.hub.org>,
"Carlo Stonebanks" <stonec.register@sympatico.ca> wrote:

> is the default setting of 4.0 realistic or could it be lower?

Wow, such a simple, innocent question.

As you may have inferred, it can't be answered in isolation.  Make sure that
you have reviewed all of your memory settings, then try adjusting this and
seeing what the results are.  With accurate effective_cache_size and a fairly
generous work_mem setting, we have found that these settings work best for us
with our actual production loads:

(1)  Cache well below database size (for example 6 GB or 12 GB RAM on a box
running a 210 GB database):

#seq_page_cost = 1.0
random_page_cost = 2.0

(2)  On a database which is entirely contained within cache:

seq_page_cost = 0.1
random_page_cost = 0.1

(3)  Where caching is very significant, but not complete, we have to test
to see where performance is best.  One example that significantly beat both
of the above in production on a particular box:

seq_page_cost = 0.3
random_page_cost = 0.5

So, the short answer to your question is that the default might be realistic
in some environments; the best choice will be lower in many environments;
the best choice will be higher in some environments; only testing your
actual applications in your actual environment can tell you which is the
case for you.

My approach is to pick one of the first two, depending on whether the
database will be fully cached, then monitor for performance problems.  When
the queries with unacceptable response time have been identified, I look
for ways to improve them.  One of the things I may try, where a bad plan
seems to have been chosen, is to adjust the random page cost.  If I do
change that in production, then I closely monitor for regression in other
queries.

-Kevin