Thread: Non-linear Performance

Non-linear Performance

From
Curt Sampson
Date:
I'm noticing that performance in creating a particular index, and
also a little bit in a simple query, seems somewhat non-linear,
and I'm wondering if anybody could give me any clues as to what
might be causing this, and how I might fix it, if that's possible.

I've done a COPY in and index build on three data sets (just integer
data) consisting of 10m rows (500 MB table), 100m rows (5 GB table),
and 500m rows (25 GB table). (This is all on a 1.4 GHz P4, 512 MB RAM,
two 7200 RPM IDE drives, one for data and one for log.)

The COPY command to do the import is pretty linear, at about 230
sec., 2000 sec and 10,000 sec. for the 10m, 100m and 500m row
tables. Neither disk I/O nor CPU seem to be pushed particularly,
though I'm getting lots of small writes to the log files. I tried
using a larger checkpoint size, but it didn't seem to help any.

Recreating the primary key (on an INT, and of course all values
unique) took 123, 1300 and 6700 seconds, again pretty linear.

The next column is a DATE, which is the same value for the first
3.5M records, the next day for the next 3.5M records, and so on,
incrementing a day for every 3.5M records (in order). This index
build took about 60, 700 and 3500 seconds, respectively, again
linear.

But then I get to the next INT column which in every row is filled
in with a random value between 0 and 99,999. This index takes about
175, 3600, and 28,000 seconds seconds, respectively, to generate.
So it take about 2x as long per record going from 10m to 100m
records, and about 1.5x as long again per record when going from
100m to 500m records.

Queries using that index seem to do this too, though not quite as
badly.  Using a very simple query such as "SELECT COUNT(*) FROM
table WHERE value = 12345" (where value is the last INT column
above that took ages to index), typical query times (including
connection overhead) for data not in the cache are 0.6 sec., 11
sec. and 72 sec.

This query, as expected, is completely dominated by random IO; the
disk the table is on sits there at 100% usage (i.e., disk requests
outstanding 100% of the time) and not much else is happening at all.

It does seem to do a few more more disk transfers than I would
really expect. I get back a count of around 4000-5000, which to me
implies about 5000 reads plus the index reads (which one would
think would not amount to more than one or two hundred pages), yet
110 I/O requests per second times 70 seconds implies about 7000
reads. Is there something I'm missing here?

(If the query plan and analysis is any help, here it is:

    Aggregate  (cost=22081.46..22081.46 rows=1 width=0)
    (actual time=70119.88..70119.88 rows=1 loops=1) ->
    Index Scan using data_3_value on data_3
    (cost=0.00..22067.71 rows=5498 width=0)
    (actual time=38.70..70090.45 rows=4980 loops=1)
    Total runtime: 70121.74 msec

Anyway, I'm open to any thoughts on this. In particular, I'm open
to suggestions for cheap ways of dealing with this horrible random
I/O load. (Yeah, yeah, I know: disk array, and SCSI while I'm about
it. But I've been asked to get this sort of thing working fast on
systems much cheaper than the current Sun/Sybase/EMC or whatever
it is that they're using.)

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC


Re: Non-linear Performance

From
Tom Lane
Date:
Curt Sampson <cjs@cynic.net> writes:
> I'm noticing that performance in creating a particular index, and
> also a little bit in a simple query, seems somewhat non-linear,

Btree index build is primarily a sort, so cannot have better than
O(n*log(n)) performance for random data.  Not sure why you'd expect
linearity.

Increasing SORT_MEM would help the constant factor, however...

> Queries using that index seem to do this too, though not quite as
> badly.  Using a very simple query such as "SELECT COUNT(*) FROM
> table WHERE value = 12345" (where value is the last INT column
> above that took ages to index), typical query times (including
> connection overhead) for data not in the cache are 0.6 sec., 11
> sec. and 72 sec.

I guess that the smaller datasets would get proportionally more benefit
from kernel disk caching.

> It does seem to do a few more more disk transfers than I would
> really expect. I get back a count of around 4000-5000, which to me
> implies about 5000 reads plus the index reads (which one would
> think would not amount to more than one or two hundred pages), yet
> 110 I/O requests per second times 70 seconds implies about 7000
> reads. Is there something I'm missing here?

Can you demonstrate that it actually did 7000 reads, and not 5000+?
That extrapolation technique doesn't look to me like it has the
accuracy to tell the difference.  You might try setting show_query_stats
(note the results go into the postmaster log, not to the client;
perhaps we ought to change that someday).

Also, if you've updated the table at all, there might be some fetches of
dead tuples involved.

> Anyway, I'm open to any thoughts on this. In particular, I'm open
> to suggestions for cheap ways of dealing with this horrible random
> I/O load.

More RAM, perhaps.

            regards, tom lane

Re: Non-linear Performance

From
"Peter A. Daly"
Date:
Tom Lane wrote:

>
>Btree index build is primarily a sort, so cannot have better than
>O(n*log(n)) performance for random data.  Not sure why you'd expect
>linearity.
>
>Increasing SORT_MEM would help the constant factor, however...
>
What is the most amount of SORT_MEM it makes sense to allocate?  Pretend
the ONLY thing I care about is BTREE index creation time.  (2 gigs of
RAM on my DB machine.)  Disk IO is not a bottleneck.

-Pete



Re: Non-linear Performance

From
"Peter A. Daly"
Date:
Peter A. Daly wrote:

>
> Tom Lane wrote:
>
>>
>> Btree index build is primarily a sort, so cannot have better than
>> O(n*log(n)) performance for random data.  Not sure why you'd expect
>> linearity.
>>
>> Increasing SORT_MEM would help the constant factor, however...
>>
> What is the most amount of SORT_MEM it makes sense to allocate?
> Pretend the ONLY thing I care about is BTREE index creation time.  (2
> gigs of RAM on my DB machine.)  Disk IO is not a bottleneck.

Another though.  If postgres has one of my CPU's at near 100%, does that
mean I can't get any more performance out of it?  Still, how big can I
make sort_mem?

-Pete



Re: Non-linear Performance

From
Tom Lane
Date:
"Peter A. Daly" <petedaly@ix.netcom.com> writes:
> Tom Lane wrote:
>> Increasing SORT_MEM would help the constant factor, however...
>>
> What is the most amount of SORT_MEM it makes sense to allocate?

I've never experimented with it, but certainly the standard default
(512K = 0.5M) is pretty small for modern machines.  In a 2G machine
I might try settings around 100M-500M to see what works best.  (Note
this is just for a one-off btree creation --- for ordinary queries you
need to allow for multiple sorts going on in parallel, which is one
reason the default sort_mem is not very large.)

            regards, tom lane

Re: Non-linear Performance

From
"Peter A. Daly"
Date:
>
>
>>What is the most amount of SORT_MEM it makes sense to allocate?
>>
>
>I've never experimented with it, but certainly the standard default
>(512K = 0.5M) is pretty small for modern machines.  In a 2G machine
>I might try settings around 100M-500M to see what works best.  (Note
>this is just for a one-off btree creation --- for ordinary queries you
>need to allow for multiple sorts going on in parallel, which is one
>reason the default sort_mem is not very large.)
>
I will run some benchmarks and let the list know the results.  If this
can speed it up a large amount, I can get another 6 gig of RAM into this
machine which I hope can let me leave the SORT_MEM at a high enough
amount to speed up our huge nightly batch database reload.

It's a Dual Xeon 500Mhz Machine.

-Pete


Scaling with memory & disk planning (was Re: Non-linear Performance)

From
Doug Fields
Date:
>Another though.  If postgres has one of my CPU's at near 100%, does that
>mean I can't get any more performance out of it?  Still, how big can I
>make sort_mem?

I regularly have one CPU pegged at 100% with PostgreSQL on my queries
(tables of 4m-ish rows and up). (Dual P3 1ghz 2GB, Linux 2.2, PG 7.2.1)

I believe it cannot multithread individual queries using several processors.

I believe IBM DB2 Enterprise _can_ do that - but it also costs $20,000 per
CPU and damned if I can figure out how to install it even for the trial.

Let me ask a similar question:

If I ran PostgreSQL with a Linux 2.4 kernel and 6GB of RAM with dual P4 Xeons:

a) Could PostgreSQL use all the RAM?
b) The RAM paging would incur a 2-4% slowdown, any other noted speed
reductions?
c) Could PostgreSQL use this "hyperthreading" that apparently is in these
P4 Xeons making dual processors look like four processors (obviously, on
four separate queries)?
d) How much extra performance does having the log or indices on a different
disk buy you, esp. in the instance where you are inserting millions of
records into a table? An indexed table?

I ask "d" above because as I am considering upgrading to such a box, it
will have an 8-disk RAID controller, and I'm wondering if it is a better
idea to:

a) Run everything on one 7-drive RAID 5 partition (8th drive as hot spare)
b) Run logs as a 2-drive mirror and the rest on a 5-drive RAID 5
c) Run logs on a 2-drive mirror, indices on a 2-drive mirror, and the rest
on a 3-drive RAID5?
d) Run logs & indices on a 2-drive mirror and the rest on a 5-drive RAID 5

Thanks,

Doug


Re: Non-linear Performance

From
Doug Fields
Date:
> >> Increasing SORT_MEM would help the constant factor, however...
> >>
> > What is the most amount of SORT_MEM it makes sense to allocate?
>
>I've never experimented with it, but certainly the standard default
>(512K = 0.5M) is pretty small for modern machines.  In a 2G machine
>I might try settings around 100M-500M to see what works best.  (Note

I currently use 1.4 gigs for "shared mem" in my database (out of 2G) - I
couldn't get PostgreSQL to run with more than that (it might be an OS
limit, Linux 2.2).

I also use 64 megs as a SORT_MEM setting, on the theory that 4 sorts at
once will leave just enough RAM for the rest of my system not to swap.

Cheers,

Doug



Doug Fields <dfields-pg-general@pexicom.com> writes:
> d) How much extra performance does having the log or indices on a different
> disk buy you, esp. in the instance where you are inserting millions of
> records into a table? An indexed table?

Keeping the logs on a separate drive is a big win, I believe, for heavy
update situations.  (For read-only queries, of course the log doesn't
matter.)

Keeping indexes on a separate drive is also traditional database advice,
but I don't have any feeling for how much it matters in Postgres.

> a) Run everything on one 7-drive RAID 5 partition (8th drive as hot spare)
> b) Run logs as a 2-drive mirror and the rest on a 5-drive RAID 5
> c) Run logs on a 2-drive mirror, indices on a 2-drive mirror, and the rest
> on a 3-drive RAID5?
> d) Run logs & indices on a 2-drive mirror and the rest on a 5-drive RAID 5

You could probably get away without mirroring the indices, if you are
willing to incur a little downtime to rebuild them after an index drive
failure.  So another possibility is

2-drive mirror for log, 1 plain old drive for indexes, rest for data.

If your data will fit on 2 drives then you could mirror both, still have
your 8th drive as hot spare, and feel pretty secure.

Note that while it is reasonably painless to configure PG with WAL logs
in a special place (after initdb, move the pg_xlog subdirectory and make
a symlink to its new location), it's not currently easy to separate
indexes from data.  So the most practical approach in the short term is
probably your (b).

            regards, tom lane

Re: Scaling with memory & disk planning

From
Kurt Gunderson
Date:
Bear in mind that I am a newbie to the PostgreSQL world but have
experience in other RDBMSs when I ask this question:

If you are looking for the best performance, why go with a RAID5 as
opposed to a RAID1+0 (mirrored stripes) solution?  Understandably RAID5
is a cheaper solution requiring fewer drives for redundancy but, from my
experience, RAID5 chokes horribly under heavy disk writing.  RAID5
always requires at least two write operations for every block written;
one to the data and one to the redundancy algorithm.

Is this wrong?

(I mean no disrespect)

Tom Lane wrote:

> Doug Fields <dfields-pg-general@pexicom.com> writes:
>
>>d) How much extra performance does having the log or indices on a different
>>disk buy you, esp. in the instance where you are inserting millions of
>>records into a table? An indexed table?
>>
>
> Keeping the logs on a separate drive is a big win, I believe, for heavy
> update situations.  (For read-only queries, of course the log doesn't
> matter.)
>
> Keeping indexes on a separate drive is also traditional database advice,
> but I don't have any feeling for how much it matters in Postgres.
>
>
>>a) Run everything on one 7-drive RAID 5 partition (8th drive as hot spare)
>>b) Run logs as a 2-drive mirror and the rest on a 5-drive RAID 5
>>c) Run logs on a 2-drive mirror, indices on a 2-drive mirror, and the rest
>>on a 3-drive RAID5?
>>d) Run logs & indices on a 2-drive mirror and the rest on a 5-drive RAID 5
>>
>
> You could probably get away without mirroring the indices, if you are
> willing to incur a little downtime to rebuild them after an index drive
> failure.  So another possibility is
>
> 2-drive mirror for log, 1 plain old drive for indexes, rest for data.
>
> If your data will fit on 2 drives then you could mirror both, still have
> your 8th drive as hot spare, and feel pretty secure.
>
> Note that while it is reasonably painless to configure PG with WAL logs
> in a special place (after initdb, move the pg_xlog subdirectory and make
> a symlink to its new location), it's not currently easy to separate
> indexes from data.  So the most practical approach in the short term is
> probably your (b).
>
>             regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
>
>


--
Kurt Gunderson
  Senior Programmer
  Applications Development
  Lottery Group
Canadian Bank Note Company, Limited
Email: kgunders@cbnlottery.com
Phone:
613.225.6566 x326
Fax:
613.225.6651
http://www.cbnco.com/

"Entropy isn't what is used to be"

Obtaining any information from this message for the purpose of sending
unsolicited commercial Email is strictly prohibited.  Receiving this
email does not constitute a request of or consent to send unsolicited
commercial Email.


Actual Marketing happening

From
Joshua Drake
Date:
Hello,

 I noticed recently a bunch of people talking about making sure that
marketing is happening with PostgreSQL. I thought I would drop a line to
let you guys know that we are marketing on Google, as well as SysAdmin
magazine.

 I also know that dbExperts is marketing on Google.

J


Re: Scaling with memory & disk planning (was Re: Non-linear Performance)

From
"Steve Wolfe"
Date:
> Keeping indexes on a separate drive is also traditional database advice,
> but I don't have any feeling for how much it matters in Postgres.

  My gut feeling is that you'd be better off taking two drives and
striping them RAID 0 instead of putting data on one and indexes on
another.  Either way, you have the same potential total throughput, but if
they're in the RAID array, then you can use the total throughput more
easily, and in more situations - you don't have to depend on reading two
seperate pieces of data simultaneously to utilize the entire throughput.

steve



Re: Scaling with memory & disk planning

From
terry@greatgulfhomes.com
Date:
Your RAID analysis is a bit wrong.
In striping (disk joining) every byte written requires 1 byte sent to 1
disk.  This gives you ZERO redundancy:  RAID0 is used purely for making a
large partition from smaller disks.

In RAID1 (mirroring) Every 1 byte written requires 1 byte written to EACH of
the 2 mirrored disks, for total disk IO of 2bytes.

In RAID5, the most efficient solution, every 1 byte written requires LESS
then 1 byte written for the CRC.  Roughly (depending on implementation,
number of disks) every 3 bytes written requires 4 bytes of disk IO.

RAID5 is the fastest from an algorithm, standpoint.  There is some gotchas,
RAID5 implemented by hardware is faster the RAID5 implemented by OS, simply
because the controller on the SCSI card acts like a parallel processor.

RAID5 also wastes the least amount of disk space.


What is the cheapest is a relative thing, what is certain is that RAID 5
requires more disks (at least 3) then mirroring (exactly 2), but RAID5
wastes less space, so the cost analysis begins with a big "it depends...".

Any disk system will choke under heavy load, especially if the disk write
system is inefficient (like IBM's IDE interface).  I think if you did a
test, you would find RAID1 would choke more then RAID5 simply because RAID1
requires MORE disk IO for the same bytes being saved.

Referring to what Tom Lane said, he recommends 7 drive RAID5 for a very good
reason:  The more the drives, the faster the performance.  Here's why:
Write 7 bytes on a 7 drive RAID5, the first byte goes to drive 1, 2nd byte
to drive 2, etc, and the CRC to the final drive.  For high performance SCSI
systems, whose BUS IO is faster then drives (and most SCSI IO chains ARE
faster then the drives they are attached to) the drives actually write in
PARALLEL.  I can give you a more detailed example, but suffice to say that
with RAID5 writing 7 bytes to 7 data drives takes about the same time to
write 3 or 4 bytes to a single non raid drive.  That my friends, is why
RAID5 (especially when done by hardware) actually improves performance.



Terry Fielder
Network Engineer
Great Gulf Homes / Ashton Woods Homes
terry@greatgulfhomes.com

> -----Original Message-----
> From: pgsql-general-owner@postgresql.org
> [mailto:pgsql-general-owner@postgresql.org]On Behalf Of Kurt Gunderson
> Sent: Thursday, May 30, 2002 12:59 PM
> Cc: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] Scaling with memory & disk planning
>
>
> Bear in mind that I am a newbie to the PostgreSQL world but have
> experience in other RDBMSs when I ask this question:
>
> If you are looking for the best performance, why go with a RAID5 as
> opposed to a RAID1+0 (mirrored stripes) solution?
> Understandably RAID5
> is a cheaper solution requiring fewer drives for redundancy
> but, from my
> experience, RAID5 chokes horribly under heavy disk writing.  RAID5
> always requires at least two write operations for every block
> written;
> one to the data and one to the redundancy algorithm.
>
> Is this wrong?
>
> (I mean no disrespect)
>
> Tom Lane wrote:
>
> > Doug Fields <dfields-pg-general@pexicom.com> writes:
> >
> >>d) How much extra performance does having the log or
> indices on a different
> >>disk buy you, esp. in the instance where you are inserting
> millions of
> >>records into a table? An indexed table?
> >>
> >
> > Keeping the logs on a separate drive is a big win, I
> believe, for heavy
> > update situations.  (For read-only queries, of course the
> log doesn't
> > matter.)
> >
> > Keeping indexes on a separate drive is also traditional
> database advice,
> > but I don't have any feeling for how much it matters in Postgres.
> >
> >
> >>a) Run everything on one 7-drive RAID 5 partition (8th
> drive as hot spare)
> >>b) Run logs as a 2-drive mirror and the rest on a 5-drive RAID 5
> >>c) Run logs on a 2-drive mirror, indices on a 2-drive
> mirror, and the rest
> >>on a 3-drive RAID5?
> >>d) Run logs & indices on a 2-drive mirror and the rest on a
> 5-drive RAID 5
> >>
> >
> > You could probably get away without mirroring the indices,
> if you are
> > willing to incur a little downtime to rebuild them after an
> index drive
> > failure.  So another possibility is
> >
> > 2-drive mirror for log, 1 plain old drive for indexes, rest
> for data.
> >
> > If your data will fit on 2 drives then you could mirror
> both, still have
> > your 8th drive as hot spare, and feel pretty secure.
> >
> > Note that while it is reasonably painless to configure PG
> with WAL logs
> > in a special place (after initdb, move the pg_xlog
> subdirectory and make
> > a symlink to its new location), it's not currently easy to separate
> > indexes from data.  So the most practical approach in the
> short term is
> > probably your (b).
> >
> >             regards, tom lane
> >
> > ---------------------------(end of
> broadcast)---------------------------
> > TIP 6: Have you searched our list archives?
> >
> > http://archives.postgresql.org
> >
> >
>
>
> --
> Kurt Gunderson
>   Senior Programmer
>   Applications Development
>   Lottery Group
> Canadian Bank Note Company, Limited
> Email: kgunders@cbnlottery.com
> Phone:
> 613.225.6566 x326
> Fax:
> 613.225.6651
> http://www.cbnco.com/
>
> "Entropy isn't what is used to be"
>
> Obtaining any information from this message for the purpose of sending
> unsolicited commercial Email is strictly prohibited.  Receiving this
> email does not constitute a request of or consent to send unsolicited
> commercial Email.
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
>


Re: Scaling with memory & disk planning

From
Jean-Luc Lachance
Date:
I think your undestanding of RAID 5 is wrong also.

For a general N disk RAID 5 the process is:
1)Read sector
2)XOR with data to write
3)Read parity sector
4)XOR with result above
5)write data
6)write parity

So you can see, for every logical write, there is two reads and two
writes.

For a 3 disks RAID 5 the process can be shortened:
1)Write data
2)Read other disk
3)XOR with data
4)Write to parity disk.

So, two writes and one read.

JLL


terry@greatgulfhomes.com wrote:
>
> Your RAID analysis is a bit wrong.
> In striping (disk joining) every byte written requires 1 byte sent to 1
> disk.  This gives you ZERO redundancy:  RAID0 is used purely for making a
> large partition from smaller disks.
>
> In RAID1 (mirroring) Every 1 byte written requires 1 byte written to EACH of
> the 2 mirrored disks, for total disk IO of 2bytes.

> In RAID5, the most efficient solution, every 1 byte written requires LESS
> then 1 byte written for the CRC.  Roughly (depending on implementation,
> number of disks) every 3 bytes written requires 4 bytes of disk IO.
>
> RAID5 is the fastest from an algorithm, standpoint.  There is some gotchas,
> RAID5 implemented by hardware is faster the RAID5 implemented by OS, simply
> because the controller on the SCSI card acts like a parallel processor.
>
> RAID5 also wastes the least amount of disk space.
>
> What is the cheapest is a relative thing, what is certain is that RAID 5
> requires more disks (at least 3) then mirroring (exactly 2), but RAID5
> wastes less space, so the cost analysis begins with a big "it depends...".
>
> Any disk system will choke under heavy load, especially if the disk write
> system is inefficient (like IBM's IDE interface).  I think if you did a
> test, you would find RAID1 would choke more then RAID5 simply because RAID1
> requires MORE disk IO for the same bytes being saved.
>
> Referring to what Tom Lane said, he recommends 7 drive RAID5 for a very good
> reason:  The more the drives, the faster the performance.  Here's why:
> Write 7 bytes on a 7 drive RAID5, the first byte goes to drive 1, 2nd byte
> to drive 2, etc, and the CRC to the final drive.  For high performance SCSI
> systems, whose BUS IO is faster then drives (and most SCSI IO chains ARE
> faster then the drives they are attached to) the drives actually write in
> PARALLEL.  I can give you a more detailed example, but suffice to say that
> with RAID5 writing 7 bytes to 7 data drives takes about the same time to
> write 3 or 4 bytes to a single non raid drive.  That my friends, is why
> RAID5 (especially when done by hardware) actually improves performance.
>
> Terry Fielder
> Network Engineer
> Great Gulf Homes / Ashton Woods Homes
> terry@greatgulfhomes.com
>
> > -----Original Message-----
> > From: pgsql-general-owner@postgresql.org
> > [mailto:pgsql-general-owner@postgresql.org]On Behalf Of Kurt Gunderson
> > Sent: Thursday, May 30, 2002 12:59 PM
> > Cc: pgsql-general@postgresql.org
> > Subject: Re: [GENERAL] Scaling with memory & disk planning
> >
> >
> > Bear in mind that I am a newbie to the PostgreSQL world but have
> > experience in other RDBMSs when I ask this question:
> >
> > If you are looking for the best performance, why go with a RAID5 as
> > opposed to a RAID1+0 (mirrored stripes) solution?
> > Understandably RAID5
> > is a cheaper solution requiring fewer drives for redundancy
> > but, from my
> > experience, RAID5 chokes horribly under heavy disk writing.  RAID5
> > always requires at least two write operations for every block
> > written;
> > one to the data and one to the redundancy algorithm.
> >
> > Is this wrong?
> >
> > (I mean no disrespect)
> >
> > Tom Lane wrote:
> >
> > > Doug Fields <dfields-pg-general@pexicom.com> writes:
> > >
> > >>d) How much extra performance does having the log or
> > indices on a different
> > >>disk buy you, esp. in the instance where you are inserting
> > millions of
> > >>records into a table? An indexed table?
> > >>
> > >
> > > Keeping the logs on a separate drive is a big win, I
> > believe, for heavy
> > > update situations.  (For read-only queries, of course the
> > log doesn't
> > > matter.)
> > >
> > > Keeping indexes on a separate drive is also traditional
> > database advice,
> > > but I don't have any feeling for how much it matters in Postgres.
> > >
> > >
> > >>a) Run everything on one 7-drive RAID 5 partition (8th
> > drive as hot spare)
> > >>b) Run logs as a 2-drive mirror and the rest on a 5-drive RAID 5
> > >>c) Run logs on a 2-drive mirror, indices on a 2-drive
> > mirror, and the rest
> > >>on a 3-drive RAID5?
> > >>d) Run logs & indices on a 2-drive mirror and the rest on a
> > 5-drive RAID 5
> > >>
> > >
> > > You could probably get away without mirroring the indices,
> > if you are
> > > willing to incur a little downtime to rebuild them after an
> > index drive
> > > failure.  So another possibility is
> > >
> > > 2-drive mirror for log, 1 plain old drive for indexes, rest
> > for data.
> > >
> > > If your data will fit on 2 drives then you could mirror
> > both, still have
> > > your 8th drive as hot spare, and feel pretty secure.
> > >
> > > Note that while it is reasonably painless to configure PG
> > with WAL logs
> > > in a special place (after initdb, move the pg_xlog
> > subdirectory and make
> > > a symlink to its new location), it's not currently easy to separate
> > > indexes from data.  So the most practical approach in the
> > short term is
> > > probably your (b).
> > >
> > >                     regards, tom lane
> > >
> > > ---------------------------(end of
> > broadcast)---------------------------
> > > TIP 6: Have you searched our list archives?
> > >
> > > http://archives.postgresql.org
> > >
> > >
> >
> >
> > --
> > Kurt Gunderson
> >   Senior Programmer
> >   Applications Development
> >   Lottery Group
> > Canadian Bank Note Company, Limited
> > Email: kgunders@cbnlottery.com
> > Phone:
> > 613.225.6566 x326
> > Fax:
> > 613.225.6651
> > http://www.cbnco.com/
> >
> > "Entropy isn't what is used to be"
> >
> > Obtaining any information from this message for the purpose of sending
> > unsolicited commercial Email is strictly prohibited.  Receiving this
> > email does not constitute a request of or consent to send unsolicited
> > commercial Email.
> >
> >
> > ---------------------------(end of
> > broadcast)---------------------------
> > TIP 6: Have you searched our list archives?
> >
> > http://archives.postgresql.org
> >
>
> ---------------------------(end of broadcast)---------------------------
> TIP 4: Don't 'kill -9' the postmaster

Re: Scaling with memory & disk planning

From
"Steve Wolfe"
Date:
> If you are looking for the best performance, why go with a RAID5 as
> opposed to a RAID1+0 (mirrored stripes) solution?  Understandably RAID5
> is a cheaper solution requiring fewer drives for redundancy but, from my
> experience, RAID5 chokes horribly under heavy disk writing.  RAID5
> always requires at least two write operations for every block written;
> one to the data and one to the redundancy algorithm.
>
> Is this wrong?

  Here's my take on it...

  If you have enough RAM to keep everything buffered/cached, and fsync()
is turned off, then the speed of the disk subsystem becomes vastly less
important - you'll only read the data once (first couple of queries), and
then the disks will sit idle.  The lights on the disks on my DB machine
only flicker once per minute or less.   If that's the case, then I'd
rather use RAID 5 with a hot-spare or two, to increase storage capacity
over 1+0, and speed at reading as well.

  Of course, if you write tremendous amounts of data, and your data set is
larger than you can cache/buffer, that all flies out the window.

steve



Re: Scaling with memory & disk planning

From
terry@greatgulfhomes.com
Date:
My simplification was intended, anyway it still equates to the same, because
in a performance machine (lots of memory) reads are (mostly) pulled from
cache (not disk IO).  So the real cost is disk writes, and 2 = 2.

Terry Fielder
Network Engineer
Great Gulf Homes / Ashton Woods Homes
terry@greatgulfhomes.com

> -----Original Message-----
> From: pgsql-general-owner@postgresql.org
> [mailto:pgsql-general-owner@postgresql.org]On Behalf Of Jean-Luc
> Lachance
> Sent: Thursday, May 30, 2002 3:17 PM
> To: terry@greatgulfhomes.com
> Cc: kgunders@cbnlottery.com; pgsql-general@postgresql.org
> Subject: Re: [GENERAL] Scaling with memory & disk planning
>
>
> I think your undestanding of RAID 5 is wrong also.
>
> For a general N disk RAID 5 the process is:
> 1)Read sector
> 2)XOR with data to write
> 3)Read parity sector
> 4)XOR with result above
> 5)write data
> 6)write parity
>
> So you can see, for every logical write, there is two reads and two
> writes.
>
> For a 3 disks RAID 5 the process can be shortened:
> 1)Write data
> 2)Read other disk
> 3)XOR with data
> 4)Write to parity disk.
>
> So, two writes and one read.
>
> JLL
>
>
> terry@greatgulfhomes.com wrote:
> >
> > Your RAID analysis is a bit wrong.
> > In striping (disk joining) every byte written requires 1
> byte sent to 1
> > disk.  This gives you ZERO redundancy:  RAID0 is used
> purely for making a
> > large partition from smaller disks.
> >
> > In RAID1 (mirroring) Every 1 byte written requires 1 byte
> written to EACH of
> > the 2 mirrored disks, for total disk IO of 2bytes.
>
> > In RAID5, the most efficient solution, every 1 byte written
> requires LESS
> > then 1 byte written for the CRC.  Roughly (depending on
> implementation,
> > number of disks) every 3 bytes written requires 4 bytes of disk IO.
> >
> > RAID5 is the fastest from an algorithm, standpoint.  There
> is some gotchas,
> > RAID5 implemented by hardware is faster the RAID5
> implemented by OS, simply
> > because the controller on the SCSI card acts like a
> parallel processor.
> >
> > RAID5 also wastes the least amount of disk space.
> >
> > What is the cheapest is a relative thing, what is certain
> is that RAID 5
> > requires more disks (at least 3) then mirroring (exactly
> 2), but RAID5
> > wastes less space, so the cost analysis begins with a big
> "it depends...".
> >
> > Any disk system will choke under heavy load, especially if
> the disk write
> > system is inefficient (like IBM's IDE interface).  I think
> if you did a
> > test, you would find RAID1 would choke more then RAID5
> simply because RAID1
> > requires MORE disk IO for the same bytes being saved.
> >
> > Referring to what Tom Lane said, he recommends 7 drive
> RAID5 for a very good
> > reason:  The more the drives, the faster the performance.
> Here's why:
> > Write 7 bytes on a 7 drive RAID5, the first byte goes to
> drive 1, 2nd byte
> > to drive 2, etc, and the CRC to the final drive.  For high
> performance SCSI
> > systems, whose BUS IO is faster then drives (and most SCSI
> IO chains ARE
> > faster then the drives they are attached to) the drives
> actually write in
> > PARALLEL.  I can give you a more detailed example, but
> suffice to say that
> > with RAID5 writing 7 bytes to 7 data drives takes about the
> same time to
> > write 3 or 4 bytes to a single non raid drive.  That my
> friends, is why
> > RAID5 (especially when done by hardware) actually improves
> performance.
> >
> > Terry Fielder
> > Network Engineer
> > Great Gulf Homes / Ashton Woods Homes
> > terry@greatgulfhomes.com
> >
> > > -----Original Message-----
> > > From: pgsql-general-owner@postgresql.org
> > > [mailto:pgsql-general-owner@postgresql.org]On Behalf Of
> Kurt Gunderson
> > > Sent: Thursday, May 30, 2002 12:59 PM
> > > Cc: pgsql-general@postgresql.org
> > > Subject: Re: [GENERAL] Scaling with memory & disk planning
> > >
> > >
> > > Bear in mind that I am a newbie to the PostgreSQL world but have
> > > experience in other RDBMSs when I ask this question:
> > >
> > > If you are looking for the best performance, why go with
> a RAID5 as
> > > opposed to a RAID1+0 (mirrored stripes) solution?
> > > Understandably RAID5
> > > is a cheaper solution requiring fewer drives for redundancy
> > > but, from my
> > > experience, RAID5 chokes horribly under heavy disk writing.  RAID5
> > > always requires at least two write operations for every block
> > > written;
> > > one to the data and one to the redundancy algorithm.
> > >
> > > Is this wrong?
> > >
> > > (I mean no disrespect)
> > >
> > > Tom Lane wrote:
> > >
> > > > Doug Fields <dfields-pg-general@pexicom.com> writes:
> > > >
> > > >>d) How much extra performance does having the log or
> > > indices on a different
> > > >>disk buy you, esp. in the instance where you are inserting
> > > millions of
> > > >>records into a table? An indexed table?
> > > >>
> > > >
> > > > Keeping the logs on a separate drive is a big win, I
> > > believe, for heavy
> > > > update situations.  (For read-only queries, of course the
> > > log doesn't
> > > > matter.)
> > > >
> > > > Keeping indexes on a separate drive is also traditional
> > > database advice,
> > > > but I don't have any feeling for how much it matters in
> Postgres.
> > > >
> > > >
> > > >>a) Run everything on one 7-drive RAID 5 partition (8th
> > > drive as hot spare)
> > > >>b) Run logs as a 2-drive mirror and the rest on a 5-drive RAID 5
> > > >>c) Run logs on a 2-drive mirror, indices on a 2-drive
> > > mirror, and the rest
> > > >>on a 3-drive RAID5?
> > > >>d) Run logs & indices on a 2-drive mirror and the rest on a
> > > 5-drive RAID 5
> > > >>
> > > >
> > > > You could probably get away without mirroring the indices,
> > > if you are
> > > > willing to incur a little downtime to rebuild them after an
> > > index drive
> > > > failure.  So another possibility is
> > > >
> > > > 2-drive mirror for log, 1 plain old drive for indexes, rest
> > > for data.
> > > >
> > > > If your data will fit on 2 drives then you could mirror
> > > both, still have
> > > > your 8th drive as hot spare, and feel pretty secure.
> > > >
> > > > Note that while it is reasonably painless to configure PG
> > > with WAL logs
> > > > in a special place (after initdb, move the pg_xlog
> > > subdirectory and make
> > > > a symlink to its new location), it's not currently easy
> to separate
> > > > indexes from data.  So the most practical approach in the
> > > short term is
> > > > probably your (b).
> > > >
> > > >                     regards, tom lane
> > > >
> > > > ---------------------------(end of
> > > broadcast)---------------------------
> > > > TIP 6: Have you searched our list archives?
> > > >
> > > > http://archives.postgresql.org
> > > >
> > > >
> > >
> > >
> > > --
> > > Kurt Gunderson
> > >   Senior Programmer
> > >   Applications Development
> > >   Lottery Group
> > > Canadian Bank Note Company, Limited
> > > Email: kgunders@cbnlottery.com
> > > Phone:
> > > 613.225.6566 x326
> > > Fax:
> > > 613.225.6651
> > > http://www.cbnco.com/
> > >
> > > "Entropy isn't what is used to be"
> > >
> > > Obtaining any information from this message for the
> purpose of sending
> > > unsolicited commercial Email is strictly prohibited.
> Receiving this
> > > email does not constitute a request of or consent to send
> unsolicited
> > > commercial Email.
> > >
> > >
> > > ---------------------------(end of
> > > broadcast)---------------------------
> > > TIP 6: Have you searched our list archives?
> > >
> > > http://archives.postgresql.org
> > >
> >
> > ---------------------------(end of
> broadcast)---------------------------
> > TIP 4: Don't 'kill -9' the postmaster
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to
> majordomo@postgresql.org)
>


Re: Scaling with memory & disk planning

From
Kurt Gunderson
Date:
This still troubles me.

terry@greatgulfhomes.com wrote:

> In striping (disk joining) every byte written requires 1 byte sent to 1
> disk.  This gives you ZERO redundancy:  RAID0 is used purely for making a
> large partition from smaller disks.


Not necessarily.  RAID0 absolutely shines in 'concurrent' access to
disk.  When a hundred people want a hundred different records from disk,
the chance becomes greater that any needle on any disk in the striped
set will be found near the block where the record exists.  Small gains
for a small shop but when multiplied with hundreds/thousands of
concurrent users across many Gigs of data the benefits add up.

Likewise, because a data block is written across many disks (depending
on strip size) the task can 'almost' be done concurrently by the controller.


> In RAID1 (mirroring) Every 1 byte written requires 1 byte written to EACH of
> the 2 mirrored disks, for total disk IO of 2bytes.


I agree and, to go further, on some (most?) RAID disk controllers and
when reading a block of data, the first disk of the mirrored pair (given
the location of the needle on the spindle) to locate the data will
return the block to the controller.

Likewise, when writing to the mirrored pair (and using 'write-through',
never 'write-back'), the controller will pass along the 'data written'
flag to the CPU when the first disk of the pair writes the data.  The
second will sync eventually but the controller need not wait for both.

So in combining these technologies and utilizing RAID1+0, you gain all
the benefits of striping and mirroring.


> In RAID5, the most efficient solution, every 1 byte written requires LESS
> then 1 byte written for the CRC.  Roughly (depending on implementation,
> number of disks) every 3 bytes written requires 4 bytes of disk IO.
>
> RAID5 is the fastest from an algorithm, standpoint.  There is some gotchas,
> RAID5 implemented by hardware is faster the RAID5 implemented by OS, simply
> because the controller on the SCSI card acts like a parallel processor.
>
> RAID5 also wastes the least amount of disk space.


In addition to the additional writes, RAID5 still requires at least the
computation of the CRC based on the amount of data written even if it
can concurrently write that data across many disks.  You can achieve the
same efficiency with RAID0 less the cost of calculating the CRC and gain
the integrity/performance by using RAID1.


> What is the cheapest is a relative thing, what is certain is that RAID 5
> requires more disks (at least 3) then mirroring (exactly 2), but RAID5
> wastes less space, so the cost analysis begins with a big "it depends...".
>
> Any disk system will choke under heavy load, especially if the disk write
> system is inefficient (like IBM's IDE interface).  I think if you did a
> test, you would find RAID1 would choke more then RAID5 simply because RAID1
> requires MORE disk IO for the same bytes being saved.

>

> Referring to what Tom Lane said, he recommends 7 drive RAID5 for a very good
> reason:  The more the drives, the faster the performance.  Here's why:
> Write 7 bytes on a 7 drive RAID5, the first byte goes to drive 1, 2nd byte
> to drive 2, etc, and the CRC to the final drive.  For high performance SCSI
> systems, whose BUS IO is faster then drives (and most SCSI IO chains ARE
> faster then the drives they are attached to) the drives actually write in
> PARALLEL.  I can give you a more detailed example, but suffice to say that
> with RAID5 writing 7 bytes to 7 data drives takes about the same time to
> write 3 or 4 bytes to a single non raid drive.  That my friends, is why
> RAID5 (especially when done by hardware) actually improves performance.


I fully agree with you that hardware RAID is the way to go.

I would still place log files (or sequential transaction files) on a
RAID1 because movement of the needle on the disk of the mirrored pair is
minimal.

> Terry Fielder
> Network Engineer
> Great Gulf Homes / Ashton Woods Homes
> terry@greatgulfhomes.com
>
>
>>-----Original Message-----
>>From: pgsql-general-owner@postgresql.org
>>[mailto:pgsql-general-owner@postgresql.org]On Behalf Of Kurt Gunderson
>>Sent: Thursday, May 30, 2002 12:59 PM
>>Cc: pgsql-general@postgresql.org
>>Subject: Re: [GENERAL] Scaling with memory & disk planning
>>
>>
>>Bear in mind that I am a newbie to the PostgreSQL world but have
>>experience in other RDBMSs when I ask this question:
>>
>>If you are looking for the best performance, why go with a RAID5 as
>>opposed to a RAID1+0 (mirrored stripes) solution?
>>Understandably RAID5
>>is a cheaper solution requiring fewer drives for redundancy
>>but, from my
>>experience, RAID5 chokes horribly under heavy disk writing.  RAID5
>>always requires at least two write operations for every block
>>written;
>>one to the data and one to the redundancy algorithm.
>>
>>Is this wrong?
>>
>>(I mean no disrespect)
>>
>>Tom Lane wrote:
>>
>>
>>>Doug Fields <dfields-pg-general@pexicom.com> writes:
>>>
>>>
>>>>d) How much extra performance does having the log or
>>>>
>>indices on a different
>>
>>>>disk buy you, esp. in the instance where you are inserting
>>>>
>>millions of
>>
>>>>records into a table? An indexed table?
>>>>
>>>>
>>>Keeping the logs on a separate drive is a big win, I
>>>
>>believe, for heavy
>>
>>>update situations.  (For read-only queries, of course the
>>>
>>log doesn't
>>
>>>matter.)
>>>
>>>Keeping indexes on a separate drive is also traditional
>>>
>>database advice,
>>
>>>but I don't have any feeling for how much it matters in Postgres.
>>>
>>>
>>>
>>>>a) Run everything on one 7-drive RAID 5 partition (8th
>>>>
>>drive as hot spare)
>>
>>>>b) Run logs as a 2-drive mirror and the rest on a 5-drive RAID 5
>>>>c) Run logs on a 2-drive mirror, indices on a 2-drive
>>>>
>>mirror, and the rest
>>
>>>>on a 3-drive RAID5?
>>>>d) Run logs & indices on a 2-drive mirror and the rest on a
>>>>
>>5-drive RAID 5
>>
>>>You could probably get away without mirroring the indices,
>>>
>>if you are
>>
>>>willing to incur a little downtime to rebuild them after an
>>>
>>index drive
>>
>>>failure.  So another possibility is
>>>
>>>2-drive mirror for log, 1 plain old drive for indexes, rest
>>>
>>for data.
>>
>>>If your data will fit on 2 drives then you could mirror
>>>
>>both, still have
>>
>>>your 8th drive as hot spare, and feel pretty secure.
>>>
>>>Note that while it is reasonably painless to configure PG
>>>
>>with WAL logs
>>
>>>in a special place (after initdb, move the pg_xlog
>>>
>>subdirectory and make
>>
>>>a symlink to its new location), it's not currently easy to separate
>>>indexes from data.  So the most practical approach in the
>>>
>>short term is
>>
>>>probably your (b).


--
Kurt Gunderson
  Senior Programmer
  Applications Development
  Lottery Group
Canadian Bank Note Company, Limited
Email: kgunders@cbnlottery.com
Phone:
613.225.6566 x326
Fax:
613.225.6651
http://www.cbnco.com/

"Entropy isn't what is used to be"

Obtaining any information from this message for the purpose of sending
unsolicited commercial Email is strictly prohibited.  Receiving this
email does not constitute a request of or consent to send unsolicited
commercial Email.


Re: Scaling with memory & disk planning

From
Scott Marlowe
Date:
On Thu, 30 May 2002 terry@greatgulfhomes.com wrote:

I agree with ALMOST everything you say, but have a few minor nits to pick.
nothing personal, just my own experience on RAID testing and such.

> In RAID5, the most efficient solution, every 1 byte written requires LESS
> then 1 byte written for the CRC.

This isn't true for any RAID 5 implementation I'm familiar with.  The
parity stripe is exactly the same size as the data stripes it shares space
with.  But this isn't a real important point, since most RAID arrays write
8k to 128k at a time.  Note that it's not CRC (Cyclic Redunancy Check)
that gets written, but straight XOR, hence no space savings.

> Roughly (depending on implementation,
> number of disks) every 3 bytes written requires 4 bytes of disk IO.
>
> RAID5 is the fastest from an algorithm, standpoint.  There is some gotchas,
> RAID5 implemented by hardware is faster the RAID5 implemented by OS, simply
> because the controller on the SCSI card acts like a parallel processor.

This is most definitely not always true, even given equal hardware specs
(i.e. number and type of drives / interfaces all the same).

My old AMI Megaraid card with 3 Ultra Wide SCSI ports can generate 64 Megs
of parity data per second.  My Celeron 1.1GHz machine can generate
2584 Megs of parity data per second.  The load on the CPU under VERY
heavy reads and writes is about 0.3% cpu, and the max throughput on reads
on a RAID array of 4 VERY old (non-ultra non-wide 7200RPM) 2 Gig drives is
about 33 Megs a second read speed.

Same setup with 7200 4 Gig Ultra narrow drives on an AMI raid can read at
about 14 Megabytes a second.

The most important part of fast RAID is the drives first, interface
second, and hardware versus software raid last.  While it was often true
in the dark past of 33 Megahertz CPUs and such that hardware raid was
always faster, it is often much better to spend the extra money a RAID
controller would cost you and just buy more drives or cheaper controllers
or memory or CPUs.

Generally speaking, I've found RAID5 with 4 or fewer drives to be about
even with RAID1, while RAID5 with 6 or more drives quickly starts to
outrun a two drive mirror set.  This is especially true under heavy
parallel access.

On a subject no one's mentioned yet, >2 drives in a RAID1 setup.

I've done some testing with >2 drives in a mirror (NOT 1+0 or 0+1, just
RAID1 with >2 drives) under Linux, and found that if you are doing 90%
reads then it's also a good solution, but for most real world database
apps, it doesn't really help a lot.



Re: Scaling with memory & disk planning

From
terry@greatgulfhomes.com
Date:
What is RAID1 with > 2 drives???

Terry Fielder
Network Engineer
Great Gulf Homes / Ashton Woods Homes
terry@greatgulfhomes.com

> -----Original Message-----
> From: pgsql-general-owner@postgresql.org
> [mailto:pgsql-general-owner@postgresql.org]On Behalf Of Scott Marlowe
> Sent: Thursday, May 30, 2002 5:45 PM
> To: terry@greatgulfhomes.com
> Cc: pgsql-general@postgresql.org
> Subject: Re: [GENERAL] Scaling with memory & disk planning
>
>
> On Thu, 30 May 2002 terry@greatgulfhomes.com wrote:
>
> I agree with ALMOST everything you say, but have a few minor
> nits to pick.
> nothing personal, just my own experience on RAID testing and such.
>
> > In RAID5, the most efficient solution, every 1 byte written
> requires LESS
> > then 1 byte written for the CRC.
>
> This isn't true for any RAID 5 implementation I'm familiar with.  The
> parity stripe is exactly the same size as the data stripes it
> shares space
> with.  But this isn't a real important point, since most RAID
> arrays write
> 8k to 128k at a time.  Note that it's not CRC (Cyclic
> Redunancy Check)
> that gets written, but straight XOR, hence no space savings.
>
> > Roughly (depending on implementation,
> > number of disks) every 3 bytes written requires 4 bytes of disk IO.
> >
> > RAID5 is the fastest from an algorithm, standpoint.  There
> is some gotchas,
> > RAID5 implemented by hardware is faster the RAID5
> implemented by OS, simply
> > because the controller on the SCSI card acts like a
> parallel processor.
>
> This is most definitely not always true, even given equal
> hardware specs
> (i.e. number and type of drives / interfaces all the same).
>
> My old AMI Megaraid card with 3 Ultra Wide SCSI ports can
> generate 64 Megs
> of parity data per second.  My Celeron 1.1GHz machine can generate
> 2584 Megs of parity data per second.  The load on the CPU under VERY
> heavy reads and writes is about 0.3% cpu, and the max
> throughput on reads
> on a RAID array of 4 VERY old (non-ultra non-wide 7200RPM) 2
> Gig drives is
> about 33 Megs a second read speed.
>
> Same setup with 7200 4 Gig Ultra narrow drives on an AMI raid
> can read at
> about 14 Megabytes a second.
>
> The most important part of fast RAID is the drives first, interface
> second, and hardware versus software raid last.  While it was
> often true
> in the dark past of 33 Megahertz CPUs and such that hardware raid was
> always faster, it is often much better to spend the extra
> money a RAID
> controller would cost you and just buy more drives or cheaper
> controllers
> or memory or CPUs.
>
> Generally speaking, I've found RAID5 with 4 or fewer drives
> to be about
> even with RAID1, while RAID5 with 6 or more drives quickly starts to
> outrun a two drive mirror set.  This is especially true under heavy
> parallel access.
>
> On a subject no one's mentioned yet, >2 drives in a RAID1 setup.
>
> I've done some testing with >2 drives in a mirror (NOT 1+0 or
> 0+1, just
> RAID1 with >2 drives) under Linux, and found that if you are
> doing 90%
> reads then it's also a good solution, but for most real world
> database
> apps, it doesn't really help a lot.
>
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to
> majordomo@postgresql.org
>

Re: Scaling with memory & disk planning

From
Scott Marlowe
Date:
On Thu, 30 May 2002 terry@greatgulfhomes.com wrote:

> What is RAID1 with > 2 drives???

Just that.  A mirror set that writes all changes to all drives.

It costs more for writes, but runs faster for reads, so it's not a good
idea to use one in a heavily written environment, but it really flies for
heavy read setups, like data mining or CRM stuff.  Plus, a single drive
failure is no big deal, you just replace it at your leisure, since you're
still fully redundant.

This means you could do crazy things like build 9 disks into a RAID 0+1 by
making three sets of RAID 1 with three drives, and then put all three
together to make the RAID 0.  Very good read speeds under heavy loads, and
fairly good write speeds as well.

You could run a fairly large database on a setup of 9 36 Gig ultra SCSI
drives like this, and get some nice throughput, and have it be VERY
redundant.  I.e. ANY two drives could fail, and you'd still be up, and if
the two drives that failed were in different RAID1s, you'd still be fully
redundant.

Crazy stuff.


Re: Scaling with memory & disk planning

From
Curt Sampson
Date:
Jean-Luc Lachance writes:

> > I think your undestanding of RAID 5 is wrong also.
> >
> > For a general N disk RAID 5 the process is:
> > 1)Read sector
> > 2)XOR with data to write
> > 3)Read parity sector
> > 4)XOR with result above
> > 5)write data
> > 6)write parity

Yes, generally. There are a couple of tricks you can do to help get
around this, though.

One, which works very nicely when doing sequential writes, is to attempt
to hold off on the write until you collect an entire stripe's worth
of data. Then you can calculate the parity based on what's in memory,
and write the new blocks across all of the disks without worrying
about what was on them before. 3ware's Escalade IDE RAID controllers
(the 3W-7x50 series, anyway) do this. Their explanation is at
http://www.3ware.com/NewFaq/general_operating_and_troubleshooting.htm#R5
_Fusion_Explained .

Another tactic is just to buffer entire stripes. Baydel does this
with their disk arrays, which are actually RAID-3, not RAID-5.
(Since they do only full-stripe reads and writes, it doesn't really
make any difference which they use.) You want a fair amount of RAM
in your controller for buffering in this case, but it keeps the
computers "read, modify, write" cycle on one block from turning
into "read, read, modify, write".

Terry Fielder writes:

> My simplification was intended, anyway it still equates to the same,
> because in a performance machine (lots of memory) reads are (mostly)
> pulled from cache (not disk IO). So the real cost is disk writes, and
> 2 = 2.

Well, it really depends on your workload. If you're selecting stuff
almost completely randomly scattered about a large table (like the
25 GB one I'm dealing with right now), it's going to be a bit pricy
to get hold of a machine with enough memory to cache that effectively.

Kurt Gunderson writes:

] Likewise, when writing to the mirrored pair (and using 'write-through',
] never 'write-back'), the controller will pass along the 'data written'
] flag to the CPU when the first disk of the pair writes the data.  The
] second will sync eventually but the controller need not wait for both.

I hope not! I think that controller ought to wait for both to be written,
because otherwise you can have this scenario:

    1. Write of block X scheduled for drives A and B.
    2. Block written to drive A. Still pending on drive B.
    3. Controller returns "block committed to stable storage" to application.
    4. Power failure. Pending write to drive B is never written.

Now, how do you know, when the system comes back up, that you have
a good copy of the block on drive A, but not on drive B?

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC


Re: Scaling with memory & disk planning (was Re: Non-linear

From
Curt Sampson
Date:
On Thu, 30 May 2002, Steve Wolfe wrote:

> > Keeping indexes on a separate drive is also traditional database advice,
> > but I don't have any feeling for how much it matters in Postgres.
>
>   My gut feeling is that you'd be better off taking two drives and
> striping them RAID 0 instead of putting data on one and indexes on
> another.  Either way, you have the same potential total throughput, but if
> they're in the RAID array, then you can use the total throughput more
> easily, and in more situations - you don't have to depend on reading two
> seperate pieces of data simultaneously to utilize the entire throughput.

I'd go with that too. Looking at the stuff I'm playing with right
now, which is heavily random-read oriented, When I get 5000 rows
back out of a table, the system does about 5020 reads. Since it's
highly unlikely that more than a very few of those rows would be
in the same disk block, I expect that the system is only doing
about 20-30 index block reads here, and the rest is on the table.
Splitting this by index vs. table would put 99% of the burden of
this query on the table's drive.

The log-file situation tends to be somewhat different, since you're
also looking at different types of disk access. Log files just
don't get randomly accessed, and so having only one disk arm
available for them is not really a disadvantage. So it's usually
fine on a single disk (or mirrored pair). Giving more disk arms to
a random-access application (such as pulling blocks out of an
(unsorted) table based on what you retrieved from an index) can
provide a lot of help, though, so spreading it across many disks
makes sense.

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC


Re: Non-linear Performance

From
Curt Sampson
Date:
On Thu, 30 May 2002, Doug Fields wrote:

> I currently use 1.4 gigs for "shared mem" in my database (out of 2G) - I
> couldn't get PostgreSQL to run with more than that (it might be an OS
> limit, Linux 2.2).

Does it just fail to allocate? There may be a kernel parameter you
have to tweak.

I'm presuming that the idea with allocating lots of shared memory
is so that postgres can buffer data blocks from the disk. However,
since postgres uses the filesystem, the operating system will also
buffer disk data, using whatever memory is free to do so. Thus,
increaing the shared memory allocated to postgres will just reduce
the amount of memory available to the OS to do block buffering.
What is the advantage, if any, to having postgres do the buffering
in its shared memory rather than letting the OS do it?

One disadvantage I can think of is that when a back end (or several
back ends) allocates a lot of memory for sorting (assuming you let
them do that), you might end up pushing the sort memory out to your
swap disk, whereas if the OS is doing the buffer management, it
can just buffer fewer file blocks while you're doing the sort,
instead.

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC


Re: Scaling with memory & disk planning (was Re: Non-linear

From
Curt Sampson
Date:
On Thu, 30 May 2002, Doug Fields wrote:

> If I ran PostgreSQL with a Linux 2.4 kernel and 6GB of RAM with dual P4 Xeons:
> a) Could PostgreSQL use all the RAM?

Yes, but no single process could. Since you're on a 32-bit machine
your maximum address space is 4 GB, and the Linux kernel keeps 1
GB of that for itself, so no process (i.e., no individual backend,
in this case) can address more than 3 GB of RAM.

I'm not sure how the Linux buffer caching works, but there is a
possibility that this kind of thing could give you a very nice
large buffer cache.

> c) Could PostgreSQL use this "hyperthreading" that apparently is in these
> P4 Xeons making dual processors look like four processors (obviously, on
> four separate queries)?

Does Linux come up saying there are 4 processors installed in your
system? If so, yes. If not, no.

> d) How much extra performance does having the log or indices on a different
> disk buy you, esp. in the instance where you are inserting millions of
> records into a table? An indexed table?

When inserting, log on a separate disk buys you a lot. (You don't get
the table and index writes moving the disk head away from the end of the
log file.)

Splitting index and table files may help somewhat, but that's harder to
determine. You always want to throw at them as many disk arms as you
can, but it's usually the safer bet to combine it all on one big stripe
set or raid or whatever, unless you have a very consistent workload and
are willing to play around a lot to determine how you can best optimize
this stuff.

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC


sort_mem sizing (Non-linear Performance)

From
Curt Sampson
Date:
On Thu, 30 May 2002, Tom Lane wrote:

> Curt Sampson <cjs@cynic.net> writes:
> > I'm noticing that performance in creating a particular index, and
> > also a little bit in a simple query, seems somewhat non-linear,
>
> Btree index build is primarily a sort, so cannot have better than
> O(n*log(n)) performance for random data.  Not sure why you'd expect
> linearity.

Not exactly "expect." More "wish," really.

> Increasing SORT_MEM would help the constant factor, however...

Thanks for the tip. I tried that, though not on the monster 500m
row table yet. On the 100m row table, I get some interesting results.

First I confirmed that with the default sortmem = 512, I get the
results I previously got, about an hour:

! system usage stats:
!       3715.756060 elapsed 1975.705112 user 218.771892 system sec
!       [1975.717191 user 218.771892 sys total]
!       29/535 [29/538] filesystem blocks in/out
!       124/1949 [124/2281] page faults/reclaims, 0 [0] swaps
!       0 [0] signals rcvd, 0/0 [0/5] messages rcvd/sent
!       98271/335134 [98271/335139] voluntary/involuntary context switches
! postgres usage stats:
!       Shared blocks:     636965 read,          9 written, buffer hit rate = 0.05%
!       Local  blocks:     218962 read,     218962 written, buffer hit rate = 0.00%
!       Direct blocks:          0 read,          0 written

Next I tried to bump the sortmem up to 256 MB, half the physical RAM
of the machine. I found out the hard way that the back-end doing the
indexing will grow to something over three times the size of sortmem,
and proceeded (slowly) to extricate myself from swap hell.

Next try, 128 MB. This time the process was about 400 MB in size
(350 MB resident), and it did indeed shave more than 20% off my
sort time, clocking in at about 48 1/2 minutes.

! system usage stats:
!       2897.031560 elapsed 2396.215835 user 168.270137 system sec
!       [2396.215835 user 168.285445 sys total]
!       2/678 [2/680] filesystem blocks in/out
!       165926/91843 [165926/92159] page faults/reclaims, 0 [0] swaps
!       0 [0] signals rcvd, 0/0 [0/5] messages rcvd/sent
!       166718/225004 [166750/225010] voluntary/involuntary context switches
! postgres usage stats:
!       Shared blocks:     636979 read,          9 written, buffer hit rate = 0.05%
!       Local  blocks:     218962 read,     218962 written, buffer hit rate = 0.00%
!       Direct blocks:          0 read,          0 written

But I decided I couldn't spare that much memory for a database doing
real work, too, and backed down to 32 MB. That got the process down to
115 MB (100 MB resident), but oddly enough sped it up even more, shaving
the sort time by more than 35% from the original. (The new time was
about 40 minutes.)

DEBUG:  QUERY STATISTICS
! system usage stats:
!       2386.940418 elapsed 2155.223379 user 178.008478 system sec
!       [2155.233638 user 178.008478 sys total]
!       0/592 [0/594] filesystem blocks in/out
!       100/25113 [100/25399] page faults/reclaims, 0 [0] swaps
!       0 [0] signals rcvd, 0/0 [0/5] messages rcvd/sent
!       2376/274787 [2376/274793] voluntary/involuntary context switches
! postgres usage stats:
!       Shared blocks:     636969 read,         10 written, buffer hit rate = 0.05%
!       Local  blocks:     218962 read,     218962 written, buffer hit rate = 0.00%
!       Direct blocks:          0 read,          0 written

So what's up with that, do you think?

Anyway, over the weekend I might have time to try rebuilding an
index on the 500m row table a couple of times with different sortmem
sizes to see how well that works.

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC


Re: Non-linear Performance

From
Curt Sampson
Date:
On Thu, 30 May 2002, Tom Lane wrote:

> > Queries using that index seem to do this too, though not quite as
> > badly.  Using a very simple query such as "SELECT COUNT(*) FROM
> > table WHERE value = 12345" (where value is the last INT column
> > above that took ages to index), typical query times (including
> > connection overhead) for data not in the cache are 0.6 sec., 11
> > sec. and 72 sec.
>
> I guess that the smaller datasets would get proportionally more benefit
> from kernel disk caching.

Actually, I re-did the 100m row and 500m row queries from a cold
start of the machine, and I still get the same results: 10 sec. vs
70 sec. (Thus, 7x as long to query only 5x as much data.) So I
don't think caching is an issue here.

> Can you demonstrate that it actually did 7000 reads, and not 5000+?
> That extrapolation technique doesn't look to me like it has the
> accuracy to tell the difference.  You might try setting show_query_stats
> (note the results go into the postmaster log, not to the client;
> perhaps we ought to change that someday).

Ah, I should have looked up and used show_query_stats in the first
place. Thanks for the tip. (BTW, yes, it would be really, really cool
if we could turn this on and off on a per-client basis, and have the
results sent to the client. As it stands, this facility is rather
difficult to use to try to debug things on a production system.)

You're right, it is doing around 5000 reads, not 7000. The results for
all queries are very consistent: it does just slightly (5-20) more
reads than the number of rows returned, which is exactly what one would
expect.

So it seems I somehow go from about 100 reads per second to 70
reads per second. Some of that would no doubt be due to having five
times as much area over which to seek. I guess I'd have to play
with some random-read benchmarks to see if that's a reasonable
amount of performance degradation for this disk alone to be
responsible for.

> Also, if you've updated the table at all, there might be some fetches of
> dead tuples involved.

Hm. I did do a few updates, but not to any of the tuples I'm searching on.

> More RAM, perhaps.

Given the size of the table (25 GB for the table alone; about 60 GB with
all of the indices), I don't think more RAM is going to help all that
much. I'm not expecting to get a lot of repeated queries.

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC


Re: Non-linear Performance

From
Tom Lane
Date:
Curt Sampson <cjs@cynic.net> writes:
> What is the advantage, if any, to having postgres do the buffering
> in its shared memory rather than letting the OS do it?

Not much, if any.  I don't believe in making shared_buffers be more than
(at most) 25% of physical RAM.  In most cases it's just as effective to
keep it smaller.  I would recommend bumping up the default though ;-).
Something in the low thousands (of buffers) is probably a realistic
minimum.

            regards, tom lane

Re: sort_mem sizing (Non-linear Performance)

From
Tom Lane
Date:
Curt Sampson <cjs@cynic.net> writes:
> Next I tried to bump the sortmem up to 256 MB, half the physical RAM
> of the machine. I found out the hard way that the back-end doing the
> indexing will grow to something over three times the size of sortmem,
> and proceeded (slowly) to extricate myself from swap hell.

[ scratches head ... ]  The sort code tries very hard to keep track
of how much it's storing.  I wonder why it's so far off?  Perhaps
failing to allow for palloc overhead?  Remind me again of the exact
datatype you were indexing, please.

> But I decided I couldn't spare that much memory for a database doing
> real work, too, and backed down to 32 MB. That got the process down to
> 115 MB (100 MB resident), but oddly enough sped it up even more, shaving
> the sort time by more than 35% from the original.

Looks like the 128Mb case was swapping a lot (note the page faults
count).  That probably explains the difference.

            regards, tom lane

Re: Non-linear Performance

From
Doug Fields
Date:
> > I currently use 1.4 gigs for "shared mem" in my database (out of 2G) - I
> > couldn't get PostgreSQL to run with more than that (it might be an OS
> > limit, Linux 2.2).
>
>Does it just fail to allocate? There may be a kernel parameter you
>have to tweak.

I'm simply not sure, since I was trying to allocate 1.5gb for it, and
modified the shm-max parameter in the /proc filesystem accordingly.
However, since I got it to work for about 1.3 as it turns out (numeric
value 170000) that seemed enough for the moment and I stopped investigating.

>What is the advantage, if any, to having postgres do the buffering
>in its shared memory rather than letting the OS do it?

I monitor the sar and vmstat's regularly and have noticed that PostgreSQL
performs nicely with these settings, and no swap usage occurs. I have this
unfounded assumption that PostgreSQL will keep better track of its shared
memory than the typical LIFO disk-block-caching strategy the OS uses. E.g.,
it seems obvious to me that for tables which can't fit entirely in memory,
it makes more sense to leave its indices in the shared memory than the
table itself, and so when it needs to rearrange its memory, it would drop
the data tables first. I am no DB programmer (leave that to tom, he seems
to do a good job :) but since it is a dedicated DB machine, it's so far fine.

>One disadvantage I can think of is that when a back end (or several
>back ends) allocates a lot of memory for sorting (assuming you let
>them do that), you might end up pushing the sort memory out to your
>swap disk, whereas if the OS is doing the buffer management, it
>can just buffer fewer file blocks while you're doing the sort,
>instead.

Yes, this has been noted several times. So far, however, I have not seen
any swap usage. I would be more likely to decrease sort mem first, and
would prefer if I could do all my queries without sorts in the first place!
(wishful thinking)

Cheers,

Doug


Re: Non-linear Performance

From
Tom Lane
Date:
Curt Sampson <cjs@cynic.net> writes:
> On Thu, 30 May 2002, Tom Lane wrote:
>> I guess that the smaller datasets would get proportionally more benefit
>> from kernel disk caching.

> Actually, I re-did the 100m row and 500m row queries from a cold
> start of the machine, and I still get the same results: 10 sec. vs
> 70 sec. (Thus, 7x as long to query only 5x as much data.) So I
> don't think caching is an issue here.

But even from a cold start, there would be cache effects within the
query, viz. fetching the same table block more than once when it is
referenced from different places in the index.  On the smaller table,
the block is more likely to still be in kernel cache when it is next
wanted.

On a pure random-chance basis, you'd not expect that fetching 5k rows
out of 100m would hit the same table block twice --- but I'm wondering
if the data was somewhat clustered.  Do the system usage stats on your
machine reflect the difference between physical reads and reads
satisfied from kernel buffer cache?

Or maybe your idea about extra seek time is correct.

            regards, tom lane

Re: sort_mem sizing (Non-linear Performance)

From
Curt Sampson
Date:
On Fri, 31 May 2002, Tom Lane wrote:

> Curt Sampson <cjs@cynic.net> writes:
> > Next I tried to bump the sortmem up to 256 MB, half the physical RAM
> > of the machine. I found out the hard way that the back-end doing the
> > indexing will grow to something over three times the size of sortmem,
> > and proceeded (slowly) to extricate myself from swap hell.
>
> [ scratches head ... ]  The sort code tries very hard to keep track
> of how much it's storing.  I wonder why it's so far off?  Perhaps
> failing to allow for palloc overhead?  Remind me again of the exact
> datatype you were indexing, please.

It's just an int. The exact table definition is (int, date, int,
int). The last two are both random ints ranging from 0-99999, and
are the slow ones to index.

The working set of the backend appears to be most of what it's
allocating, too. I'd assumed that it might be allocating two or
three sorts, or something, maybe to merge the sort files.

> Looks like the 128Mb case was swapping a lot (note the page faults
> count).  That probably explains the difference.

Oh, duh. I should read this stuff a bit more closely.

But I wonder why? The working set was well under the RAM size of
the machine.  Perhaps it wasn't touching all the pages in its
working set often enough, and the I/O was driving the backend's
pages out of memory from time to time. It's a known problem with
NetBSD under certain circmustances.

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC


Re: Non-linear Performance

From
Curt Sampson
Date:
On Fri, 31 May 2002, Tom Lane wrote:

> On a pure random-chance basis, you'd not expect that fetching 5k rows
> out of 100m would hit the same table block twice --- but I'm wondering
> if the data was somewhat clustered.

I dont' think so. I generated the data myself, and the data are
entirely pseudo-random.  (Unless perl's PNRG is quite screwed,
which doesn't seem so likely.)

> Do the system usage stats on your machine reflect the difference
> between physical reads and reads satisfied from kernel buffer cache?

Well, not that I've been looking at yet, but it would definitely
be cool if I could figure out a way to do this.

> Or maybe your idea about extra seek time is correct.

If I can get a spare couple of hours, I'll cons up a little benchmark
program to help determine seek time on disks and various parts of
disks, and play a bit.

But on the other hand, it's not like I can do all that much about this
sort of problem, anyway, and I'm kind of doubting that the fault here
lies with postgres.

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC


Re: Non-linear Performance

From
Bruce Momjian
Date:
Peter A. Daly wrote:
> >
> >
> >>What is the most amount of SORT_MEM it makes sense to allocate?
> >>
> >
> >I've never experimented with it, but certainly the standard default
> >(512K = 0.5M) is pretty small for modern machines.  In a 2G machine
> >I might try settings around 100M-500M to see what works best.  (Note
> >this is just for a one-off btree creation --- for ordinary queries you
> >need to allow for multiple sorts going on in parallel, which is one
> >reason the default sort_mem is not very large.)
> >
> I will run some benchmarks and let the list know the results.  If this
> can speed it up a large amount, I can get another 6 gig of RAM into this
> machine which I hope can let me leave the SORT_MEM at a high enough
> amount to speed up our huge nightly batch database reload.
>
> It's a Dual Xeon 500Mhz Machine.

I am curious how you are going to address >4 gig of RAM on a 32-bit
system, especially if you want to address it all from the same process.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Re: Non-linear Performance

From
"Peter A. Daly"
Date:
Bruce Momjian wrote:

>
>I am curious how you are going to address >4 gig of RAM on a 32-bit
>system, especially if you want to address it all from the same process.
>
Frankly, I don't know.  I have been told the machine can hold 8 gig of
ram.  I have heard rumors about how Linux plays some games with how it
allocates memory to use over 4 gigs of memory. but have no direct
knowledge myself of the matter.  Something alongs the lines of a single
process being able to use up to 4 gig of memory.  Not sure on the
specifics.  Looks like we will not be upgrading the database server for
at least another 60 days (the postgres one at least), and the dual Xeon
(Quad soon?) may have another purpose.  We have a 1.5tb file
server/image rendering machine that is out of space for expansion.  The
Xeon has enough PCI slots to let us add more RAID arrays for a lot
longer, where the current server is out of slots.  This being the case,
I have stopped trying to milk speed out of postgres on that machine.

On the positive side of this, I may get a higher end Quad Xeon after
July for the next Postgres box.  We have our eyes on Quad 700mhx Xeon,
8gig ram boxes.

-Pete



Re: Non-linear Performance

From
Doug Fields
Date:
I have my Dual P4-Xeon 2.4 ghz 8gb RAM Adaptec SCSI RAID 5 PostgreSQL
server coming in Friday, from my vendor POGO Linux (highly recommended). I
expect most of the 8gb to be used by buffer cache directly by the operating
system. I will let you know how it compares to the dual P3-1ghz 2GB IDE
RAID 1 server we use now.

The server cost just over half the price of a DB2 enterprise single-CPU
license. Thank god for Postgres!

My biggest fear is that the added CPU speed will be sub-linear, and that
the real speed improvement would come if PG would multi-thread individual
queries.

Cheers,

Doug

At 11:42 PM 6/7/2002, Peter A. Daly wrote:
>Bruce Momjian wrote:
>
>>
>>I am curious how you are going to address >4 gig of RAM on a 32-bit
>>system, especially if you want to address it all from the same process.
>Frankly, I don't know.  I have been told the machine can hold 8 gig of
>ram.  I have heard rumors about how Linux plays some games with how it
>allocates memory to use over 4 gigs of memory. but have no direct
>knowledge myself of the matter.  Something alongs the lines of a single
>process being able to use up to 4 gig of memory.  Not sure on the
>specifics.  Looks like we will not be upgrading the database server for at
>least another 60 days (the postgres one at least), and the dual Xeon (Quad
>soon?) may have another purpose.  We have a 1.5tb file server/image
>rendering machine that is out of space for expansion.  The Xeon has enough
>PCI slots to let us add more RAID arrays for a lot longer, where the
>current server is out of slots.  This being the case, I have stopped
>trying to milk speed out of postgres on that machine.
>
>On the positive side of this, I may get a higher end Quad Xeon after July
>for the next Postgres box.  We have our eyes on Quad 700mhx Xeon, 8gig ram
>boxes.
>
>-Pete
>
>
>
>---------------------------(end of broadcast)---------------------------
>TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org



Re: Non-linear Performance

From
Neil Conway
Date:
On Sat, 08 Jun 2002 00:28:18 -0400
"Doug Fields" <dfields-pg-general@pexicom.com> wrote:
> My biggest fear is that the added CPU speed will be sub-linear, and that
> the real speed improvement would come if PG would multi-thread individual
> queries.

I'll confess that I can't understand the need for this feature
(even if one ignores the complexity of implementing it).

On most heavily loaded database servers, isn't the number of concurrent
queries far greater than the number of processors? In which case, what
is the point of allowing a single query to be executed by multiple CPUs
concurrently? If your queries are CPU-bound, the other CPUs are going
to be busy executing other queries at the same time (not to mention that
the cost of migrating processes from one CPU to another and back again
would not help).

I can see the benefit when you're executing a small number of
enormously expensive queries (and those queries are CPU-bound);
but that situation is quite rare, IME.

Cheers,

Neil

--
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC

Re: Non-linear Performance

From
Doug Fields
Date:
>I can see the benefit when you're executing a small number of
>enormously expensive queries (and those queries are CPU-bound);
>but that situation is quite rare, IME.

Bingo. I have some self-joins (Order N^2 at worst) on VARCHAR fields,
which, if you have tables with millions of rows, take full CPU of 900
seconds or more sometimes - and I have a need to expand from 7-digit to
9-digit row counts - which could mean 10,000 times slower. I can speed that
up from P3-1ghz to P4-2.4ghz, but not much.

The best way to solve the above problem is to figure out a more efficient
query, of course, or avoid the need for the query in the first place.
Sometimes easier said than done.

Also, the relatively slow speed for inserts with VARCHAR indices would
conceivably be helped; I often insert data into a temp table and then
INSERT INTO SELECT *... from the temp to the other one to get reasonable
performance. I'd love to be able to turn off MVCC for those kinds of
things; that is, instead of seeing one universal insert of 3 million rows,
I wouldn't mind seeing them dribble in one at a time, thereby saving the
overhead of MVCCing them.

Cheers,

Doug


Re: sort_mem sizing (Non-linear Performance)

From
Tom Lane
Date:
Curt Sampson <cjs@cynic.net> writes:
> On Fri, 31 May 2002, Tom Lane wrote:
>> Curt Sampson <cjs@cynic.net> writes:
> Next I tried to bump the sortmem up to 256 MB, half the physical RAM
> of the machine. I found out the hard way that the back-end doing the
> indexing will grow to something over three times the size of sortmem,
> and proceeded (slowly) to extricate myself from swap hell.
>>
>> [ scratches head ... ]  The sort code tries very hard to keep track
>> of how much it's storing.  I wonder why it's so far off?  Perhaps
>> failing to allow for palloc overhead?  Remind me again of the exact
>> datatype you were indexing, please.

> It's just an int.

Hm.  The actual size of an IndexTuple with an integer datum would be
12 bytes, which is how the tuplesort.c code would account for it.
But the palloc allocation to hold it would be 16 data bytes plus the
AllocChunkData header overhead, which could be 8, 12, or 16 bytes
depending on whether you have MEMORY_CONTEXT_CHECKING enabled
(did you configure --enable-cassert?) and on whether MAXALIGN is
4 or 8 bytes on your hardware.  So the palloc overhead could indeed
amount to a factor of nearly 3x.

While looking into this, I came across the following commentary in
tuplesort.c:

 * NOTES about memory consumption calculations:
 *
 * We count space requested for tuples against the SortMem limit.
 * Fixed-size space (primarily the LogicalTapeSet I/O buffers) is not
 * counted, nor do we count the variable-size memtuples and memtupindex
 * arrays.  (Even though those could grow pretty large, they should be
 * small compared to the tuples proper, so this is not unreasonable.)
 *
 * The major deficiency in this approach is that it ignores palloc overhead.
 * The memory space actually allocated for a palloc chunk is always more
 * than the request size, and could be considerably more (as much as 2X
 * larger, in the current aset.c implementation).  So the space used could
 * be considerably more than SortMem says.
 *
 * One way to fix this is to add a memory management function that, given
 * a pointer to a palloc'd chunk, returns the actual space consumed by the
 * chunk.  This would be very easy in the current aset.c module, but I'm
 * hesitant to do it because it might be unpleasant to support in future
 * implementations of memory management.  (For example, a direct
 * implementation of palloc as malloc could not support such a function
 * portably.)
 *
 * A cruder answer is just to apply a fudge factor, say by initializing
 * availMem to only three-quarters of what SortMem indicates.  This is
 * probably the right answer if anyone complains that SortMem is not being
 * obeyed very faithfully.

This note, and the code too, was written some years ago by yours truly;
apparently you're the first to complain about the inaccuracy.

I have changed my mind since then: I now think that adding an
actual-space inquiry function to the memory management API would be a
reasonable thing to do.  We're depending on enough other stuff that
isn't provided by plain malloc() that one more addition won't make
any difference.

Also, the two variable-size arrays mentioned above would add another 8
bytes per tuple, which really isn't negligible compared to 12-byte
tuples.  So that might account for the rest of your 3x growth
observation.  Probably tuplesort needs to include those in its space
calculation.

If we did make these changes then I'd be inclined to tweak the default
SortMem up from 512K to say 1024K; otherwise we'd effectively be
reducing the default sort size because of the extra space being charged
for.

Comments?

BTW, I'm not committing to get this done for 7.3...

            regards, tom lane

Re: sort_mem sizing (Non-linear Performance)

From
Curt Sampson
Date:
On Sun, 9 Jun 2002, Tom Lane wrote:

> ...depending on whether you have MEMORY_CONTEXT_CHECKING enabled
> (did you configure --enable-cassert?)

No, I didn't enable that.

> and on whether MAXALIGN is 4 or 8 bytes on your hardware.

i386, so probably 4 bytes.

> So the palloc overhead could indeed amount to a factor of nearly 3x.

That would seem about right. Possibly it's even more in some way;
even with sortmem set to 32 MB, my backend grows to about 115 MB,
105 MB resident. Normally it's around 3 MB.

> If we did make these changes then I'd be inclined to tweak the default
> SortMem up from 512K to say 1024K; otherwise we'd effectively be
> reducing the default sort size because of the extra space being charged
> for.

I'd be inclined to bump it up to 2-4 MB, actually; machines tend to
be large enough these days that grabbing an extra MB or two when
you need it is not a problem at all.

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC


Re: Non-linear Performance

From
Curt Sampson
Date:
On Fri, 7 Jun 2002, Bruce Momjian wrote:

> I am curious how you are going to address >4 gig of RAM on a 32-bit
> system, especially if you want to address it all from the same process.

Bank switching. :-)

But to be a bit more precise, physical addresses are more than 32
bits. This is not a problem for the kernel; when it maps a page
into an address space, it just hands over a 36-bit (or whatever)
address. It's always going and mapping and unmapping stuff all over
the place anyway, so life is not much different.

For user processes, however, there's a bit of a problem. A single
process can only address 4GB of RAM at any particular momement,
and in fact it's less under Linux (2 GB or 3 GB, depending on how
your built your kernel) because the kernel is using up address
space at the same time. So the OS would have to provide special
system calls for, essentially, bank switching memory in the processes'
address space, and the specific application (postgres in this case)
would have to know how to use them.

But, in fact, this facility already does exist, in a sort of a
kludgy way anyway.  You'd want to try it and see if it actually
works under any specific OS, of course. You ought to be able to
create, say, four or five 1GB SysV shared memory segments, and map them
individually in and out of your address space using the shmat system
call.

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC


Re: Non-linear Performance

From
Martijn van Oosterhout
Date:
On Sat, Jun 08, 2002 at 08:36:50PM -0400, Doug Fields wrote:
> Also, the relatively slow speed for inserts with VARCHAR indices would
> conceivably be helped; I often insert data into a temp table and then
> INSERT INTO SELECT *... from the temp to the other one to get reasonable
> performance. I'd love to be able to turn off MVCC for those kinds of
> things; that is, instead of seeing one universal insert of 3 million rows,
> I wouldn't mind seeing them dribble in one at a time, thereby saving the
> overhead of MVCCing them.

Interesting. I have many indexes on varchar fields and have no such issues.
Ofcourse, inserting them all within a single transaction does speed it up a
lot. But when inserting rows, all MVCC adds is a few more bytes to each row,
there is no speed overhead. I'd be interested to know how MVCC would be
slowing anything down.
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> There are 10 kinds of people in the world, those that can do binary
> arithmetic and those that can't.

Re: Non-linear Performance

From
Doug Fields
Date:
At 08:11 AM 6/10/2002, Martijn van Oosterhout wrote:
>there is no speed overhead. I'd be interested to know how MVCC would be
>slowing anything down.

I'd be interested in being able to turn it off and see if it speeds
anything up. :)

I didn't mean to imply that I _know_ that it slows things down. What I do
know is that I a_ssume_ it does, and would like to be able to selectively
disable it - in the hope of gaining speed at certain points. I figure the
biggest speed gain will be because PostgreSQL can then skip writing
anything to the transaction log. But then again, I also do not _know_ that.

Cheers,

Doug



Re: Non-linear Performance

From
Bruce Momjian
Date:
Doug Fields wrote:
> At 08:11 AM 6/10/2002, Martijn van Oosterhout wrote:
> >there is no speed overhead. I'd be interested to know how MVCC would be
> >slowing anything down.
>
> I'd be interested in being able to turn it off and see if it speeds
> anything up. :)
>
> I didn't mean to imply that I _know_ that it slows things down. What I do
> know is that I a_ssume_ it does, and would like to be able to selectively
> disable it - in the hope of gaining speed at certain points. I figure the
> biggest speed gain will be because PostgreSQL can then skip writing
> anything to the transaction log. But then again, I also do not _know_ that.

MVCC is an integral part of how PostgreSQL's no-overwriting storage
manager works.  It can't be turned of, and if it could, it wouldn't
provide _any_ speed increase.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026

Shared Memory Sizing

From
Curt Sampson
Date:
Just going back through old mail, I notice this. So I'm not the only one
with this opinion. I've seen, at least twice in the last week or so,
people make the mistake of devoting about half their memory to postgres
shared memory buffers (the wost thing you can do!). Would someone care
to go around and find all the spots that talk about this and update them
to have more reasonable advice?

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC

On Fri, 31 May 2002, Tom Lane wrote:

> Curt Sampson <cjs@cynic.net> writes:
> > What is the advantage, if any, to having postgres do the buffering
> > in its shared memory rather than letting the OS do it?
>
> Not much, if any.  I don't believe in making shared_buffers be more than
> (at most) 25% of physical RAM.  In most cases it's just as effective to
> keep it smaller.  I would recommend bumping up the default though ;-).
> Something in the low thousands (of buffers) is probably a realistic
> minimum.
>
>             regards, tom lane
>




Re: Shared Memory Sizing

From
Bruce Momjian
Date:
Curt Sampson wrote:
>
> Just going back through old mail, I notice this. So I'm not the only one
> with this opinion. I've seen, at least twice in the last week or so,
> people make the mistake of devoting about half their memory to postgres
> shared memory buffers (the wost thing you can do!). Would someone care
> to go around and find all the spots that talk about this and update them
> to have more reasonable advice?

I advise in my hardware tuning guide that it be increased until paging
starts, then reduce it, and start around 25%.  Is that OK?

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026



Re: Shared Memory Sizing

From
Doug Fields
Date:
>I advise in my hardware tuning guide that it be increased until paging
>starts, then reduce it, and start around 25%.  Is that OK?

This is fine up to a point.

My PostgreSQL server has 8gb ram (running under 2.4.18). I can't allocate
2gb shared because that basically takes up all the RAM of any individual
process, since the i386 kernel can only access 2 or 3 gigs of RAM (it saves
one or two for the kernel).

So, I've set it to 256megs of shared, and I'll increase it from there as
necessary. The rest, I hope, will be used to keep pages in cache by the
operating system, which will hopefully be able to move them into the shared
memory relatively quickly.

Cheers,

Doug





Re: Shared Memory Sizing

From
Curt Sampson
Date:
On Wed, 26 Jun 2002, Bruce Momjian wrote:

> I advise in my hardware tuning guide that it be increased until paging
> starts, then reduce it, and start around 25%.  Is that OK?

I don't think that's optimal. The general gist is that if the OS
has as much free memory as postgres has shared memory used for
buffering, it's going to be buffering pretty much all the data that
posgres is. So every block that postgres holds is using 8K of memory
that the OS could use to cache a different block.

Now it's a bit more subtle than this in that copying a buffer
between postgres's shared memory and OS memory also has a cost,
and we should try to minimize that. But still, that's a lot cheaper
than moving a buffer between memory and disk.

This sort of thing is one of the main reasons I like the idea of
moving to mmap for data files; it gets rid of both the double
buffering and the copying back-and-forth. (And even if you do end
up copying, you still save the kernel/userland transition, which
is also a little bit expensive.)

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC




Re: Shared Memory Sizing

From
Jan Wieck
Date:
Curt Sampson wrote:
>
> On Wed, 26 Jun 2002, Bruce Momjian wrote:
>
> > I advise in my hardware tuning guide that it be increased until paging
> > starts, then reduce it, and start around 25%.  Is that OK?
>
> I don't think that's optimal. The general gist is that if the OS
> has as much [...]

All right all right. So far we've seen a couple of responses from you,
all rejecting someone elses advice because of not beeing optimal or even
beeing worse.

Can you please enlighten us what the optimum is? And please don't escape
into the mmap solution if you cannot give that answer in diff format.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



Re: Shared Memory Sizing

From
Curt Sampson
Date:
On Thu, 27 Jun 2002, Jan Wieck wrote:

> Can you please enlighten us what the optimum is? And please don't escape
> into the mmap solution if you cannot give that answer in diff format.

Well, first of all, let me note that this is all theoretical work on my
part so far. If someone is actually doing some real testing of this, I'd
be interested in the results. If the results are different from what I
think they should be, something strange is going on.

But you can look back through my posts if you like and tell me if
you disagree with my analysis of why increasing shared memory
buffers is generally going to reduce performance.

The optimum will depend on the number of connections you have and the
type of workload you have. At this point, I'm not even sure about how to
go about determining precisely what would be better and worse; it would
be a lot of work. (Probably a lot more than it's worth, given the prices
of memory these days.)

I'd say, at a rough estimate, go for a number of buffers 2-3 times the
maximum number of connections you allow. Or less if you anticipate
rarely ever having that many connections.

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC




Re: Shared Memory Sizing

From
Jan Wieck
Date:
Curt Sampson wrote:
>
> On Thu, 27 Jun 2002, Jan Wieck wrote:
>
> > Can you please enlighten us what the optimum is? And please don't escape
> > into the mmap solution if you cannot give that answer in diff format.
>
> Well, first of all, let me note that this is all theoretical work on my
> part so far. If someone is actually doing some real testing of this, I'd
> be interested in the results. If the results are different from what I
> think they should be, something strange is going on.

Since none of us actually has done real benchmarks in this area, we are
all just debating out of the blue. So please don't take that personal,
okay?

> The optimum will depend on the number of connections you have and the
> type of workload you have. At this point, I'm not even sure about how to
> go about determining precisely what would be better and worse; it would
> be a lot of work. (Probably a lot more than it's worth, given the prices
> of memory these days.)

Sure, the optimum will depend on the application and it's usage profile.
But that's fine tuning, not a rough rule of thumb for general purpose,
and I think we where looking for the latter.

> I'd say, at a rough estimate, go for a number of buffers 2-3 times the
> maximum number of connections you allow. Or less if you anticipate
> rarely ever having that many connections.

Here I disagree. The more shared buffer cache you have, the bigger the
percentage of your database that neither causes read()'s nor memory
copying from the OS buffer cache.

Whatever, we can debate that forever without some numbers. Let's
continue that discussion when we have at least some benchmarking
results.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



Re: Shared Memory Sizing

From
Andrew Sullivan
Date:
On Thu, Jun 27, 2002 at 11:27:33AM -0400, Jan Wieck wrote:

> Here I disagree. The more shared buffer cache you have, the bigger the
> percentage of your database that neither causes read()'s nor memory
> copying from the OS buffer cache.

For what it's worth, truss on Solaris 7 indicates that copying from
OS buffer costs a few milliseconds (between 2 and 8 in our tests,
depending on load), at least in our tests.

A

--
----
Andrew Sullivan                               87 Mowat Avenue
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M6K 3E3
                                         +1 416 646 3304 x110




Re: Shared Memory Sizing

From
"Glen Parker"
Date:
I guess I'll add my 2cents here...  I run a Linux/Postgres server with
1GB of ram, and give postgres 48049 shared buffers.  This lets the
backends top out around 400MB each when they really get going.

The effect I've seen is that more buffers don't hurt.  However, adding
more buffers stops making a significant difference at some low number
that is a small fraction of what I use (and would obviously depend
heavily on usage).  Now assuming that the server is dedicated to
postgres (which mine is), I can't see any reason not to give almost all
available physical memory to postgres.  The whole point of the OS disk
caching is for *shared* caching, so if only one application is using the
majority of the disk cache, what does it hurt to just let that
application have it all as private memory?  If there were no cost
associated with copying pages from OS cache to user land, then I'd say
it makes more sense to give almost no memory to the postgres cache, but
obviously there is a cost; the more often postgres has to ask the OS for
a page, the worse performance will be.  The more memory you give to
postgres, the more often asking the OS for a page will result in a disk
read.  It seems optimal to me to *never* ask the OS for a page unless it
has to be pulled from the disk.

Obviously a lot of people disagree with this...  I'd like to understand
why this approach is considered incorrect when postgres rather owns the
machine?

Glen Parker
glenebob@nwlink.com


>
> Curt Sampson wrote:
> >
> > Just going back through old mail, I notice this. So I'm not
> the only one
> > with this opinion. I've seen, at least twice in the last week or so,
> > people make the mistake of devoting about half their memory
> to postgres
> > shared memory buffers (the wost thing you can do!). Would
> someone care
> > to go around and find all the spots that talk about this
> and update them
> > to have more reasonable advice?
>
> I advise in my hardware tuning guide that it be increased until paging
> starts, then reduce it, and start around 25%.  Is that OK?
>
> --
>   Bruce Momjian                        |  http://candle.pha.pa.us
>   pgman@candle.pha.pa.us               |  (610) 853-3000
>   +  If your life is a hard drive,     |  830 Blythe Avenue
>   +  Christ can be your backup.        |  Drexel Hill,
> Pennsylvania 19026
>
>
>
> ---------------------------(end of
> broadcast)---------------------------
> TIP 5: Have you checked our extensive FAQ?
>
http://www.postgresql.org/users-lounge/docs/faq.html





Re: Shared Memory Sizing

From
Andrew Sullivan
Date:
On Thu, Jun 27, 2002 at 01:44:56PM -0700, Glen Parker wrote:

> Obviously a lot of people disagree with this...  I'd like to understand
> why this approach is considered incorrect when postgres rather owns the
> machine?

If you starve other things on the machine for memory, you'll cause
swapping.

Consider all the other things you're doing on the machine -- just
little things, like cron and such.  All that takes memory.
Therefore, it's dangerous not to let the OS manage a good chunk of
memory.

There also appears to be a diminishing returns problem: at a certain
point, you're unlikely to need more shared space, and if you do
something else on the machine that could use the memory, you're
throwing it away.  But I don't see that you're wrong in principle.
Just don't get it wrong -- I _have_ caused a 16 gig machine to swap.
It's not fun.

A

--
----
Andrew Sullivan                               87 Mowat Avenue
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M6K 3E3
                                         +1 416 646 3304 x110




Re: Shared Memory Sizing

From
Curt Sampson
Date:
On Thu, 27 Jun 2002, Jan Wieck wrote:

> Since none of us actually has done real benchmarks in this area, we are
> all just debating out of the blue. So please don't take that personal,
> okay?

Well, I have a pretty good knowledge of how Unix operating systems work
internally (I've been a NetBSD developer for about six years now), so
it's not just out of the blue. However, I will always bow to hard data.

> Sure, the optimum will depend on the application and it's usage profile.
> But that's fine tuning, not a rough rule of thumb for general purpose,
> and I think we where looking for the latter.

Good. That's about all I can give.

> > I'd say, at a rough estimate, go for a number of buffers 2-3 times the
> > maximum number of connections you allow. Or less if you anticipate
> > rarely ever having that many connections.
>
> Here I disagree. The more shared buffer cache you have, the bigger the
> percentage of your database that neither causes read()'s nor memory
> copying from the OS buffer cache.

Certainly. But overall, you will cache a smaller number of blocks
because you will be buffering them twice. When you copy a block
from the OS buffer to shared memory, the copy still exists in the
OS buffer. So that block is now buffered twice.

For most workloads, in the long run, that will force you to do disk
I/O that you would not have had to do otherwise. A single disk I/O
is far more expensive than hundreds of copies between the OS buffer
cache and postgres' shared memory.

Draw your own conclusions.

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC




Re: Shared Memory Sizing

From
Curt Sampson
Date:
On Thu, 27 Jun 2002, Andrew Sullivan wrote:

> For what it's worth, truss on Solaris 7 indicates that copying from
> OS buffer costs a few milliseconds (between 2 and 8 in our tests,
> depending on load), at least in our tests.

Ouch! Are you sure you mean MILLIseconds, rather than MICROSECONDS?

    8K in 2 ms. = 4 MB/sec.
    8K in 8 usec. = 976 MB/sec.

The latter would be a much more reasonable figure for memory bandwidth...

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC




Re: Shared Memory Sizing

From
Curt Sampson
Date:
On Thu, 27 Jun 2002, Andrew Sullivan wrote:

> Consider all the other things you're doing on the machine -- just
> little things, like cron and such.  All that takes memory.
> Therefore, it's dangerous not to let the OS manage a good chunk of
> memory.

Postgres itself doesn't manage all of its own memory usage. Sorts
and things, for example, are allocated from OS-owned memory, not
postgres' shared memory. So if you give most of the machine memory
over to postgres shared buffers, postgres itself might drive the
machine into swapping because it's got memory that should be used
for the sort, but can't be. That's why it's better to move to
minimum shared memory rather than maximum.

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC




Re: Shared Memory Sizing

From
Andrew Sullivan
Date:
On Fri, Jun 28, 2002 at 01:13:01PM +0900, Curt Sampson wrote:
> On Thu, 27 Jun 2002, Andrew Sullivan wrote:
>
> > For what it's worth, truss on Solaris 7 indicates that copying from
> > OS buffer costs a few milliseconds (between 2 and 8 in our tests,
> > depending on load), at least in our tests.
>
> Ouch! Are you sure you mean MILLIseconds, rather than MICROSECONDS?

Doh.  Had a braino.   Thanks.  Yes, microseconds.

A

--
----
Andrew Sullivan                               87 Mowat Avenue
Liberty RMS                           Toronto, Ontario Canada
<andrew@libertyrms.info>                              M6K 3E3
                                         +1 416 646 3304 x110




Re: Shared Memory Sizing

From
Vivek Khera
Date:
>>>>> "CS" == Curt Sampson <cjs@cynic.net> writes:

CS> For most workloads, in the long run, that will force you to do disk
CS> I/O that you would not have had to do otherwise. A single disk I/O
CS> is far more expensive than hundreds of copies between the OS buffer
CS> cache and postgres' shared memory.

It occurs to me that these two opposing factors could be plotted and
make somewhat of a line each.  The ideal point would be where they
crossed.

Perhaps one day when I've got a few hours I'll copy over my large
dataset to my devel server and try to mimic the load with different
shm sizes...

--
=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=
Vivek Khera, Ph.D.                Khera Communications, Inc.
Internet: khera@kciLink.com       Rockville, MD       +1-240-453-8497
AIM: vivekkhera Y!: vivek_khera   http://www.khera.org/~vivek/

Re: Shared Memory Sizing

From
Bruce Momjian
Date:
Curt Sampson wrote:
> Certainly. But overall, you will cache a smaller number of blocks
> because you will be buffering them twice. When you copy a block
> from the OS buffer to shared memory, the copy still exists in the
> OS buffer. So that block is now buffered twice.

You have the block in the kernel buffer when you copy it to the
PostgreSQL buffers, but nothing says you have to keep that block in the
kernel buffers while PostgreSQL has it.  Only when it is written does it
return to the kernel, and if it is only read, it never returns to the
kernel.

> For most workloads, in the long run, that will force you to do disk
> I/O that you would not have had to do otherwise. A single disk I/O
> is far more expensive than hundreds of copies between the OS buffer
> cache and postgres' shared memory.

Yes, if you swap, you went too far.  That has always been the upper
limit.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026



Re: Shared Memory Sizing

From
Curt Sampson
Date:
On Fri, 28 Jun 2002, Bruce Momjian wrote:

> You have the block in the kernel buffer when you copy it to the
> PostgreSQL buffers, but nothing says you have to keep that block in the
> kernel buffers while PostgreSQL has it.  Only when it is written does it
> return to the kernel, and if it is only read, it never returns to the
> kernel.

Sure. But the OS doesn't know that after you read a block it may
immediately abandon buffering of that block. It's going to use the
same strategy it does for any other block, which is to keep it in
memory for a while, preferring to get rid of older blocks.

If you had a way to tell the OS, "I'm buffering block X, so you
don't need to" that would be one thing. But you don't.

> > For most workloads, in the long run, that will force you to do disk
> > I/O that you would not have had to do otherwise. A single disk I/O
> > is far more expensive than hundreds of copies between the OS buffer
> > cache and postgres' shared memory.
>
> Yes, if you swap, you went too far.  That has always been the upper
> limit.

Not just swap: disk I/O for blocks that would have been buffered
by the OS if you hadn't been using memory that it could use.

cjs
--
Curt Sampson  <cjs@cynic.net>   +81 90 7737 2974   http://www.netbsd.org
    Don't you know, in this new Dark Age, we're all light.  --XTC




Re: Shared Memory Sizing

From
Bruce Momjian
Date:
Curt Sampson wrote:
> On Fri, 28 Jun 2002, Bruce Momjian wrote:
>
> > You have the block in the kernel buffer when you copy it to the
> > PostgreSQL buffers, but nothing says you have to keep that block in the
> > kernel buffers while PostgreSQL has it.  Only when it is written does it
> > return to the kernel, and if it is only read, it never returns to the
> > kernel.
>
> Sure. But the OS doesn't know that after you read a block it may
> immediately abandon buffering of that block. It's going to use the
> same strategy it does for any other block, which is to keep it in
> memory for a while, preferring to get rid of older blocks.
>
> If you had a way to tell the OS, "I'm buffering block X, so you
> don't need to" that would be one thing. But you don't.
>
> > > For most workloads, in the long run, that will force you to do disk
> > > I/O that you would not have had to do otherwise. A single disk I/O
> > > is far more expensive than hundreds of copies between the OS buffer
> > > cache and postgres' shared memory.
> >
> > Yes, if you swap, you went too far.  That has always been the upper
> > limit.
>
> Not just swap: disk I/O for blocks that would have been buffered
> by the OS if you hadn't been using memory that it could use.

Yes, but how does a smaller PostgreSQL buffer and larger kernel buffer
fix this?  It seems it only makes it works because PostgreSQL is loading
its buffers more often.

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 853-3000
  +  If your life is a hard drive,     |  830 Blythe Avenue
  +  Christ can be your backup.        |  Drexel Hill, Pennsylvania 19026