Thread: Need help with 8.4 Performance Testing

Need help with 8.4 Performance Testing

From

Josh Berkus

Date:

07 December 2008, 15:38:13

Database performance geeks,

We have a number of patches pending for 8.4 designed to improve database
performance in a variety of circumstances.  We need as many users as possible
to build test versions of PostgreSQL with these patches, and test how well
they perform, and report back in some detail.

Particularly, users with unusual hardware architectures (16 or more cores,
ARM, Power, SSD, NFS-mounted data) or operating systems (Solaris, OSX,
Windows-64) are really helpful.  Testers need to be familiar with building
PostgreSQL from source and patching it, as well as basic PostgreSQL Tuning
(except for the Wizard Patch) and have some kind of performance test
available, ideally something based on your own application use.

If you are going to use pgbench to test, *please* read Greg Smith's notes
first:
http://www.westnet.com/~gsmith/gregsmith/content/postgresql/pgbench-scaling.htm

The Wiki (http://wiki.postgresql.org/wiki/CommitFest_2008-11) has a full list
of patches, but below are the ones in particular we could use help with.

You *do* need to read the entire mail threads which I link to below to
understand the patches.  Thanks for your help!

Proposal of PITR performance improvement (Koichi Suzuki):
http://archives.postgresql.org/message-id/a778a7260811270404g49254640x8ed58b12b7c65d0b@mail.gmail.com
http://archives.postgresql.org/message-id/a778a7260810280033n43f70d36x8c437eacf9a5461e@mail.gmail.com

Simple postgresql.conf wizard
http://archives.postgresql.org/message-id/Pine.GSO.4.64.0811012101220.17619@westnet.com
http://archives.postgresql.org/message-id/Pine.GSO.4.64.0811291403040.12885@westnet.com

Improve Performance of Multi-Batch Hash Join for Skewed Data Sets
http://archives.postgresql.org/message-id/6EEA43D22289484890D119821101B1DF2C1683@exchange20.mercury.ad.ubc.ca
http://archives.postgresql.org/message-id/1924d1180811051606w19aaf30du589e8ea10ea5534d@mail.gmail.com
http://archives.postgresql.org/message-id/22901.1227228246@sss.pgh.pa.us

Window Functions
http://archives.postgresql.org/message-id/e08cc0400810270912u49a6ec83vc23984c01f368f76@mail.gmail.com
http://archives.postgresql.org/message-id/492D1246.5070101@enterprisedb.com
http://archives.postgresql.org/message-id/839FB90FF49D4120B7107ED0D7B3E5B6@amd64

parallel restore
(especially need to test on 16+ cores)
http://archives.postgresql.org/message-id/490878AC.1@dunslane.net

B-Tree emulation for GIN
http://archives.postgresql.org/message-id/491B1888.9020903@sigaev.ru
http://archives.postgresql.org/message-id/Pine.LNX.4.64.0811191828050.7862@sn.sai.msu.ru

Also, the following patches currently still have bugs, but when the bugs are
fixed I'll be looking for performance testers, so please either watch the
wiki or watch this space:

-- Block-level CRC checks (Alvaro Herrera)
-- Auto Partitioning Patch (Nikhil Sontakke)
-- posix_fadvise (Gregory Stark)
-- Hash Join-Filter Pruning using Bloom Filters
-- On-disk bitmap indexes

Please report your results, with the patchname in the subject line, on this
mailing list or on -hackers.  Thank you, and your help will get a better 8.4
out sooner.

--
Josh Berkus
PostgreSQL
San Francisco

Re: Need help with 8.4 Performance Testing

From

Kenneth Marshall

Date:

07 December 2008, 22:09:04

Josh,

Since a number of these performance patches use our hash function, would
it make sense to apply the last patch to upgrade the hash function mix()
to the two function mix()/final()? Since the additional changes increases
the performance of the hash function by another 50% or so. My two cents.

Regards,
Ken

On Sun, Dec 07, 2008 at 11:38:01AM -0800, Josh Berkus wrote:
> Database performance geeks,
>
> We have a number of patches pending for 8.4 designed to improve database
> performance in a variety of circumstances.  We need as many users as possible
> to build test versions of PostgreSQL with these patches, and test how well
> they perform, and report back in some detail.
>
> Particularly, users with unusual hardware architectures (16 or more cores,
> ARM, Power, SSD, NFS-mounted data) or operating systems (Solaris, OSX,
> Windows-64) are really helpful.  Testers need to be familiar with building
> PostgreSQL from source and patching it, as well as basic PostgreSQL Tuning
> (except for the Wizard Patch) and have some kind of performance test
> available, ideally something based on your own application use.
>
> If you are going to use pgbench to test, *please* read Greg Smith's notes
> first:
> http://www.westnet.com/~gsmith/gregsmith/content/postgresql/pgbench-scaling.htm
>
> The Wiki (http://wiki.postgresql.org/wiki/CommitFest_2008-11) has a full list
> of patches, but below are the ones in particular we could use help with.
>
> You *do* need to read the entire mail threads which I link to below to
> understand the patches.  Thanks for your help!
>
> Proposal of PITR performance improvement (Koichi Suzuki):
> http://archives.postgresql.org/message-id/a778a7260811270404g49254640x8ed58b12b7c65d0b@mail.gmail.com
> http://archives.postgresql.org/message-id/a778a7260810280033n43f70d36x8c437eacf9a5461e@mail.gmail.com
>
> Simple postgresql.conf wizard
> http://archives.postgresql.org/message-id/Pine.GSO.4.64.0811012101220.17619@westnet.com
> http://archives.postgresql.org/message-id/Pine.GSO.4.64.0811291403040.12885@westnet.com
>
> Improve Performance of Multi-Batch Hash Join for Skewed Data Sets
> http://archives.postgresql.org/message-id/6EEA43D22289484890D119821101B1DF2C1683@exchange20.mercury.ad.ubc.ca
> http://archives.postgresql.org/message-id/1924d1180811051606w19aaf30du589e8ea10ea5534d@mail.gmail.com
> http://archives.postgresql.org/message-id/22901.1227228246@sss.pgh.pa.us
>
> Window Functions
> http://archives.postgresql.org/message-id/e08cc0400810270912u49a6ec83vc23984c01f368f76@mail.gmail.com
> http://archives.postgresql.org/message-id/492D1246.5070101@enterprisedb.com
> http://archives.postgresql.org/message-id/839FB90FF49D4120B7107ED0D7B3E5B6@amd64
>
> parallel restore
> (especially need to test on 16+ cores)
> http://archives.postgresql.org/message-id/490878AC.1@dunslane.net
>
> B-Tree emulation for GIN
> http://archives.postgresql.org/message-id/491B1888.9020903@sigaev.ru
> http://archives.postgresql.org/message-id/Pine.LNX.4.64.0811191828050.7862@sn.sai.msu.ru
>
> Also, the following patches currently still have bugs, but when the bugs are
> fixed I'll be looking for performance testers, so please either watch the
> wiki or watch this space:
>
> -- Block-level CRC checks (Alvaro Herrera)
> -- Auto Partitioning Patch (Nikhil Sontakke)
> -- posix_fadvise (Gregory Stark)
> -- Hash Join-Filter Pruning using Bloom Filters
> -- On-disk bitmap indexes
>
> Please report your results, with the patchname in the subject line, on this
> mailing list or on -hackers.  Thank you, and your help will get a better 8.4
> out sooner.
>
> --
> Josh Berkus
> PostgreSQL
> San Francisco
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance
>

Re: Need help with 8.4 Performance Testing

From

"Scott Marlowe"

Date:

08 December 2008, 01:40:26

On Sun, Dec 7, 2008 at 12:38 PM, Josh Berkus <josh@agliodbs.com> wrote:

I've got a pair of 8 core opteron 16 drive machines I would like to
test it on.  If nothing else I'll just take queries from the log to
run against an 8.4 install.  It'll have to be late at night though...

> If you are going to use pgbench to test, *please* read Greg Smith's notes
> first:
> http://www.westnet.com/~gsmith/gregsmith/content/postgresql/pgbench-scaling.htm

When I last used pgbench I wanted to test it with an extremely large
dataset, but it maxes out at -s 4xxx or so, and that's only in the
40Gigabyte range.  Is the limit raised for the pgbench included in
contrib in 8.4?  I'm guessing it's an arbitrary limit.

Re: Need help with 8.4 Performance Testing

From

Derek Lewis

Date:

08 December 2008, 02:45:26

I'll be glad to test the patches using pgbench on my POWER4 box
running AIX 5.3 and an IA64 that runs HP-UX 11.31.

Derek

On Dec 7, 2008, at 2:38 PM, Josh Berkus <josh@agliodbs.com> wrote:

> Database performance geeks,
>
> We have a number of patches pending for 8.4 designed to improve
> database
> performance in a variety of circumstances.  We need as many users as
> possible
> to build test versions of PostgreSQL with these patches, and test
> how well
> they perform, and report back in some detail.
>
> Particularly, users with unusual hardware architectures (16 or more
> cores,
> ARM, Power, SSD, NFS-mounted data) or operating systems (Solaris, OSX,
> Windows-64) are really helpful.  Testers need to be familiar with
> building
> PostgreSQL from source and patching it, as well as basic PostgreSQL
> Tuning
> (except for the Wizard Patch) and have some kind of performance test
> available, ideally something based on your own application use.
>
> If you are going to use pgbench to test, *please* read Greg Smith's
> notes
> first:
> http://www.westnet.com/~gsmith/gregsmith/content/postgresql/pgbench-scaling.htm
>
> The Wiki (http://wiki.postgresql.org/wiki/CommitFest_2008-11) has a
> full list
> of patches, but below are the ones in particular we could use help
> with.
>
> You *do* need to read the entire mail threads which I link to below to
> understand the patches.  Thanks for your help!
>
> Proposal of PITR performance improvement (Koichi Suzuki):
> http://archives.postgresql.org/message-id/a778a7260811270404g49254640x8ed58b12b7c65d0b@mail.gmail.com
> http://archives.postgresql.org/message-id/a778a7260810280033n43f70d36x8c437eacf9a5461e@mail.gmail.com
>
> Simple postgresql.conf wizard
> http://archives.postgresql.org/message-id/Pine.GSO.4.64.0811012101220.17619@westnet.com
> http://archives.postgresql.org/message-id/Pine.GSO.4.64.0811291403040.12885@westnet.com
>
> Improve Performance of Multi-Batch Hash Join for Skewed Data Sets
> http://archives.postgresql.org/message-id/6EEA43D22289484890D119821101B1DF2C1683@exchange20.mercury.ad.ubc.ca
> http://archives.postgresql.org/message-id/1924d1180811051606w19aaf30du589e8ea10ea5534d@mail.gmail.com
> http://archives.postgresql.org/message-id/22901.1227228246@sss.pgh.pa.us
>
> Window Functions
> http://archives.postgresql.org/message-id/e08cc0400810270912u49a6ec83vc23984c01f368f76@mail.gmail.com
> http://archives.postgresql.org/message-id/492D1246.5070101@enterprisedb.com
> http://archives.postgresql.org/message-id/839FB90FF49D4120B7107ED0D7B3E5B6@amd64
>
> parallel restore
> (especially need to test on 16+ cores)
> http://archives.postgresql.org/message-id/490878AC.1@dunslane.net
>
> B-Tree emulation for GIN
> http://archives.postgresql.org/message-id/491B1888.9020903@sigaev.ru
> http://archives.postgresql.org/message-id/Pine.LNX.4.64.0811191828050.7862@sn.sai.msu.ru
>
> Also, the following patches currently still have bugs, but when the
> bugs are
> fixed I'll be looking for performance testers, so please either
> watch the
> wiki or watch this space:
>
> -- Block-level CRC checks (Alvaro Herrera)
> -- Auto Partitioning Patch (Nikhil Sontakke)
> -- posix_fadvise (Gregory Stark)
> -- Hash Join-Filter Pruning using Bloom Filters
> -- On-disk bitmap indexes
>
> Please report your results, with the patchname in the subject line,
> on this
> mailing list or on -hackers.  Thank you, and your help will get a
> better 8.4
> out sooner.
>
> --
> Josh Berkus
> PostgreSQL
> San Francisco
>
> --
> Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org
> )
> To make changes to your subscription:
> http://www.postgresql.org/mailpref/pgsql-performance

Re: Need help with 8.4 Performance Testing

From

Greg Smith

Date:

08 December 2008, 16:15:48

On Sun, 7 Dec 2008, Scott Marlowe wrote:

> When I last used pgbench I wanted to test it with an extremely large
> dataset, but it maxes out at -s 4xxx or so, and that's only in the
> 40Gigabyte range.  Is the limit raised for the pgbench included in
> contrib in 8.4?  I'm guessing it's an arbitrary limit.

There's no artificial limit, just ones that result from things like
integer overflow.  I don't think has been an issue so far because pgbench
becomes seek limited and stops producing interesting results once the
database exceeds the sum of all available caching, which means you'd need
more than 32GB of RAM in the system running pgbench before this is an
issue.  Which happens to be the largest size system I've ever ran it on...

I'd expect this statement around line 1060 of pgbench.c
to overflow first:

     for (i = 0; i < naccounts * scale; i++)

Where i is an integer, naccounts=100,000 , and scale is what you input.
That will overflow a signed 32-bit integer at a scale of 2147.  If you had
tests that failed at twice that, I wonder if you were actually executing
against a weird data set for scales of (2148..4294).

It's not completely trivial to fix (I just tried), the whole loop needs to
be restructured to run against scale and 100,000 separately while keeping
the current progress report intact.  I'll take a shot at fixing it
correctly, this is a bug that should be corrected before 8.4 goes out.  I
guarantee that version will be used on systems with 64GB of RAM where this
matters.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Need help with 8.4 Performance Testing

From

"Scott Marlowe"

Date:

08 December 2008, 16:37:01

On Mon, Dec 8, 2008 at 1:15 PM, Greg Smith <gsmith@gregsmith.com> wrote:
> On Sun, 7 Dec 2008, Scott Marlowe wrote:
>
>> When I last used pgbench I wanted to test it with an extremely large
>> dataset, but it maxes out at -s 4xxx or so, and that's only in the
>> 40Gigabyte range.  Is the limit raised for the pgbench included in
>> contrib in 8.4?  I'm guessing it's an arbitrary limit.
>
> There's no artificial limit, just ones that result from things like integer
> overflow.  I don't think has been an issue so far because pgbench becomes
> seek limited and stops producing interesting results once the database
> exceeds the sum of all available caching, which means you'd need more than
> 32GB of RAM in the system running pgbench before this is an issue.  Which
> happens to be the largest size system I've ever ran it on...

Well, I have 32 Gig of ram and wanted to test it against a database
that was at least twice as big as memory.  I'm not sure why you'd
consider the results uninteresting though, I'd think knowing how the
db will perform with a very large transactional store that is twice or
more the size of memory would be when it starts getting interesting.

Re: Need help with 8.4 Performance Testing

From

Greg Smith

Date:

08 December 2008, 18:52:38

On Mon, 8 Dec 2008, Scott Marlowe wrote:

> Well, I have 32 Gig of ram and wanted to test it against a database
> that was at least twice as big as memory.  I'm not sure why you'd
> consider the results uninteresting though, I'd think knowing how the
> db will perform with a very large transactional store that is twice or
> more the size of memory would be when it starts getting interesting.

If you refer back to the picture associated with the link Josh suggested:

http://www.westnet.com/~gsmith/gregsmith/content/postgresql/scaling.png

You'll see that pgbench results hit a big drop once you clear the amount
of memory being used to cache the accounts table.  This curve isn't unique
to what I did; I've seen the same basic shape traced out by multiple other
testers on different hardware, independent of me.  It just expands to the
right based on the amount of RAM available.

All I was trying to suggest was that even if you've got 32GB of RAM, you
may already be into the more flat right section of that curve even with a
40GB database.  That was a little system with 1GB of RAM+256MB of disk
cache, and it was already toast at 750MB of database.  Once you've gotten
a database big enough to reach that point, results degrade to something
related to database seeks/second rather than anything else, and further
increases don't give you that much more info.  This is why I'm not sure if
the current limit really matters with 32GB of RAM, but it sure will be
important if you want any sort of useful pgbench results at 64GB.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Need help with 8.4 Performance Testing

From

"Merlin Moncure"

Date:

08 December 2008, 21:23:22

On Mon, Dec 8, 2008 at 5:52 PM, Greg Smith <gsmith@gregsmith.com> wrote:
> On Mon, 8 Dec 2008, Scott Marlowe wrote:
>
>> Well, I have 32 Gig of ram and wanted to test it against a database
>> that was at least twice as big as memory.  I'm not sure why you'd
>> consider the results uninteresting though, I'd think knowing how the
>> db will perform with a very large transactional store that is twice or
>> more the size of memory would be when it starts getting interesting.
>
> If you refer back to the picture associated with the link Josh suggested:
>
> http://www.westnet.com/~gsmith/gregsmith/content/postgresql/scaling.png
>
> You'll see that pgbench results hit a big drop once you clear the amount of
> memory being used to cache the accounts table.  This curve isn't unique to
> what I did; I've seen the same basic shape traced out by multiple other
> testers on different hardware, independent of me.  It just expands to the
> right based on the amount of RAM available.
>
> All I was trying to suggest was that even if you've got 32GB of RAM, you may
> already be into the more flat right section of that curve even with a 40GB
> database.  That was a little system with 1GB of RAM+256MB of disk cache, and
> it was already toast at 750MB of database.  Once you've gotten a database
> big enough to reach that point, results degrade to something related to
> database seeks/second rather than anything else, and further increases don't
> give you that much more info.  This is why I'm not sure if the current limit
> really matters with 32GB of RAM, but it sure will be important if you want
> any sort of useful pgbench results at 64GB.


I wonder if shared_buffers has any effect on how far you can go before
you hit the 'tipping point'.

merlin

Re: Need help with 8.4 Performance Testing

From

Greg Smith

Date:

09 December 2008, 00:00:47

On Mon, 8 Dec 2008, Merlin Moncure wrote:

> I wonder if shared_buffers has any effect on how far you can go before
> you hit the 'tipping point'.

If your operating system has any reasonable caching itself, not so much at
first.  As long as the index on the account table fits in shared_buffers,
even the basic sort of caching logic an OS uses is perfectly functional
for swapping the individual pages of the account table in and out, the
main limiting factor on pgbench performance.

There is a further out tipping point I've theorized about but not really
explored:  the point where even the database indexes stop fitting in
memory usefully.  As you get closer to that, I'd expect that the clock
sweep algorithm used by shared_buffers should make it a bit more likely
that those important blocks would hang around usefully if you put them
there, rather than giving most of the memory to the OS to manage.

Since the data is about 7.5X as large as the indexes, that point is way
further out than the basic bottlenecks.  And if you graph pgbench results
on a scale that usefully shows the results for in-memory TPS scores, you
can barely see that part of the chart a well.  One day I may get to
mapping that out better, and if I do it will be interesting to see if the
balance of shared_buffers to OS cache works the way I expect.  I was
waiting until I finished the pgtune program for that, that's building some
of the guts I wanted to make it easier to tweak postgresql.conf settings
programmatically in between pgbench runs.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Need help with 8.4 Performance Testing

From

Gregory Stark

Date:

09 December 2008, 09:08:05

Greg Smith <gsmith@gregsmith.com> writes:

> On Mon, 8 Dec 2008, Merlin Moncure wrote:
>
>> I wonder if shared_buffers has any effect on how far you can go before
>> you hit the 'tipping point'.
>
> If your operating system has any reasonable caching itself, not so much at
> first.  As long as the index on the account table fits in shared_buffers, even
> the basic sort of caching logic an OS uses is perfectly functional for swapping
> the individual pages of the account table in and out, the main limiting factor
> on pgbench performance.

I would expect higher shared_buffers to raise the curve before the first
breakpoint but after the first breakpoint make the drop steeper and deeper.
The equilibrium where the curve becomes flatter should be lower.

That is, as long as the database fits entirely in RAM having more of the
buffers be immediately in shared buffers is better. Once there's contention
for the precious cache stealing some of it for duplicated buffers will only
hurt.

> There is a further out tipping point I've theorized about but not really
> explored:  the point where even the database indexes stop fitting in memory
> usefully.  As you get closer to that, I'd expect that the clock sweep algorithm
> used by shared_buffers should make it a bit more likely that those important
> blocks would hang around usefully if you put them there, rather than giving
> most of the memory to the OS to manage.

Hm, sounds logical. At that point the slow drop-off should become even
shallower and possibly become completely flat. Greater shared_buffers might
start helping again at that point.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's RemoteDBA services!

Re: Need help with 8.4 Performance Testing

From

Jean-David Beyer

Date:

09 December 2008, 09:09:02

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Greg Smith wrote:
| On Mon, 8 Dec 2008, Merlin Moncure wrote:
|
|> I wonder if shared_buffers has any effect on how far you can go before
|> you hit the 'tipping point'.
|
| If your operating system has any reasonable caching itself, not so much at
| first.  As long as the index on the account table fits in shared_buffers,
| even the basic sort of caching logic an OS uses is perfectly functional
| for swapping the individual pages of the account table in and out, the
| main limiting factor on pgbench performance.
|
| There is a further out tipping point I've theorized about but not really
| explored:  the point where even the database indexes stop fitting in
| memory usefully.  As you get closer to that, I'd expect that the clock
| sweep algorithm used by shared_buffers should make it a bit more likely
| that those important blocks would hang around usefully if you put them
| there, rather than giving most of the memory to the OS to manage.

I am by no means an expert at this.

But one thing that can matter is whether you want to improve just the
performance of the dbms, or the performance of the entire system, on which
the dbms runs. Because if you want to improve the whole system, you would
want as much of the caching to take place in the system's buffers so the use
of the memory could be optimized over the entire workload, not just the load
of the dbms itself. I suppose on a dedicated system with only one dbms
running with only one database open (at a time, anyway), this might be moot,
but not otherwise.

Now I agree that it would be good to get the entire index (or at least the
working set of the index) into the memory of the computer. But does it
really matter if it is in the system's cache, or the postgres cache? Is it
any more likely to be in postgres's cache than in the system cache if the
system is hurting for memory? I would think the system would be equally
likely to page out "idle" pages no matter where they are unless they are
locked to memory, and I do not know if all operating systems can do this,
and even if they can, I do not know if postgres uses that ability. I doubt
it, since I believe (at least in Linux) a process can do that only if run as
root, which I imagine few (if any) users do.
|
| Since the data is about 7.5X as large as the indexes, that point is way
| further out than the basic bottlenecks.  And if you graph pgbench results
| on a scale that usefully shows the results for in-memory TPS scores, you
| can barely see that part of the chart a well.  One day I may get to
| mapping that out better, and if I do it will be interesting to see if the
| balance of shared_buffers to OS cache works the way I expect.  I was
| waiting until I finished the pgtune program for that, that's building some
| of the guts I wanted to make it easier to tweak postgresql.conf settings
| programmatically in between pgbench runs.
|
| --
| * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
|


- --
~  .~.  Jean-David Beyer          Registered Linux User 85642.
~  /V\  PGP-Key: 9A2FC99A         Registered Machine   241939.
~ /( )\ Shrewsbury, New Jersey    http://counter.li.org
~ ^^-^^ 07:55:02 up 5 days, 18:13, 4 users, load average: 4.18, 4.17, 4.11
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.5 (GNU/Linux)
Comment: Using GnuPG with CentOS - http://enigmail.mozdev.org

iD8DBQFJPm2+Ptu2XpovyZoRAlcJAKCIN098quZKZ7MfAs3MOkuL3WWxrQCdHCVl
sUQoIVleRWVLvcMZoihztpE=
=n6uO
-----END PGP SIGNATURE-----

Re: Need help with 8.4 Performance Testing

From

Simon Waters

Date:

09 December 2008, 09:58:56

On Tuesday 09 December 2008 13:08:14 Jean-David Beyer wrote:
>
> and even if they can, I do not know if postgres uses that ability. I doubt
> it, since I believe (at least in Linux) a process can do that only if run
> as root, which I imagine few (if any) users do.

Disclaimer: I'm not a system programmer...

I believe that at Linux kernel revision 2.6.8 and before processes need Posix
capability CAP_IPC_LOCK, and 2.6.9 and after they need CAP_IPC_LOCK to lock
more than RLIMIT_MEMLOCK.

It is a capability, so a process can run as any user assuming it is started
with or gained the capability.

No idea if Postgres uses any of this, other than to protect security of
certain password operations there is probably not much point. If key parts of
your database are being paged out, get more RAM, if idle parts of your
database are paged out, you probably could more usefully apply that RAM for
something else.

The Varnish cache design is the place to look for enlightenment on relying on
the kernel paging (using memory mapped files) rather than trying to do it
yourself, but then a proxy server is a lot simpler than a RDBMS. That said,
Varnish is fast at what it does (reverse HTTP proxy) !

Re: Need help with 8.4 Performance Testing

From

"Greg Stark"

Date:

09 December 2008, 10:56:38

On Sun, Dec 7, 2008 at 7:38 PM, Josh Berkus <josh@agliodbs.com> wrote:
>
> Also, the following patches currently still have bugs, but when the bugs are
> fixed I'll be looking for performance testers, so please either watch the
> wiki or watch this space:
>...
> -- posix_fadvise (Gregory Stark)

Eh? Quite possibly but none that I'm aware of. The only problem is a
couple of trivial bits of bitrot. I'll a post an update now if you
want.

--
greg

Re: Need help with 8.4 Performance Testing

From

Scott Carey

Date:

09 December 2008, 12:37:26

> ________________________________________
> From: pgsql-performance-owner@postgresql.org [pgsql-performance-owner@postgresql.org] On Behalf Of > Jean-David Beyer
[jeandavid8@verizon.net]
> Sent: Tuesday, December 09, 2008 5:08 AM
> To: pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] Need help with 8.4 Performance Testing
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1

> But one thing that can matter is whether you want to improve just the
> performance of the dbms, or the performance of the entire system, on which
> the dbms runs. Because if you want to improve the whole system, you would
> want as much of the caching to take place in the system's buffers so the use
> of the memory could be optimized over the entire workload, not just the load
> of the dbms itself. I suppose on a dedicated system with only one dbms
> running with only one database open (at a time, anyway), this might be moot,
> but not otherwise.

Yes, the OS is in better position to arbitrate between multiple things.  Of course, we aren't talking about the highest
performancedatabases if we are talking about mixed use systems though. 
Additionally, the OS can never really get it right, with a DB or other apps.  Any app can behave badly and grab too
muchRAM and access it regularly enough for it to not be 'idle' much but give the OS VM fits trying to figure out if its
importantor not versus other processes. 

> Now I agree that it would be good to get the entire index (or at least the
> working set of the index) into the memory of the computer. But does it
> really matter if it is in the system's cache, or the postgres cache? Is it
> any more likely to be in postgres's cache than in the system cache if the
> system is hurting for memory? I would think the system would be equally
> likely to page out "idle" pages no matter where they are unless they are
> locked to memory, and I do not know if all operating systems can do this,
> and even if they can, I do not know if postgres uses that ability. I doubt
> it, since I believe (at least in Linux) a process can do that only if run as
> root, which I imagine few (if any) users do.

The problem, is when none of them are really 'idle'.  When the OS has to decide which pages, all of which have been
accessedrecently, to evict.  Most OS's will make bad choices if the load is mixed random and sequential access, as they
treatall pages equally with respect to freshness versus eviction. 
Another problem is that there IS a difference between being in postgres' cache and the OS cache.  One is more expensive
toretrieve than the other.  Significantly. 

Aaccessing buffers in shared_buffers, in process, uses a good chunk less CPU (and data copy and shared buffer eviction
overhead)than going over the sys call to the OS. 

And as far as I can tell, even after the 8.4 fadvise patch, all I/O is in block_size chunks.  (hopefully I am wrong)
My system is now CPU bound, the I/O can do sequential reads of more than 1.2GB/sec but Postgres can't do a seqscan 30%
asfast because it eats up CPU like crazy just reading and identifying tuples.  It does seqscans ~ 25% faster if its
fromshared_buffers than from the OS's page cache though.   Seqscans are between 250MB/sec and 400MB/sec peak, from mem
ordisk, typically showing no more than 35% iostat utilization of the array if off disk -- so we run a few concurrently
wherewe can. 

In addition to the fadvise patch, postgres needs to merge adjacent I/O's into larger ones to reduce the overhead.  It
onlyreally needs to merge up to sizes of about 128k or 256k, and gain a 8x to 16x drop in syscall overhead, and
additionallypotentially save code trips down the shared buffer management code paths.  At lest, thats my guess I
haven'tlooked at any code and could be wrong. 


Additionally, the "If your operating system has any reasonable caching itself" comment earlier in this conversation ---
Linux (2.6.18, Centos 5.2) does NOT.  I can easily make it spend 100% CPU in system time trying to figure out what to
dowith the system cache for an hour.  Just do large seqscans with memory pressure from work_mem or other forces that
theOS will not deem 'idle'.  Once the requested memory is ~75% of the system total, it will freak out.  Linux simply
willnot give up that last 25% or so of the RAM for anything but page cache, even though the disk subsustem is very fast
andmost of the access is sequential, marginalizing the benefit of the cache.  Depending on how you tune it, it will
eitherspin system cpu or swap storm, but the system cpu spin times for the same work load are a lot shorter than an
equivalentswap storm. 
Mount the data drive in O_DIRECT and the problem vanishes.  I've been told that this problem may be gone in some of the
latestkernels.  I have seriously considered bumping shared_buffers up a lot and mounting the thing direct -- but then
welose the useful scheduler and readahead algorithms.  The other way around (small shared_buffers, let the OS do it)
hurtsperformance overall quite a bit -- randomly accessed pages get pushed out to the OS cache more often, and the OS
tossesthouse out when a big seqscan occurs, resulting in a lot more random access from disk and more disk bound periods
oftime. Great wonder, this operating system caching, eh? 

In any event, don't hold up these OS page cache things as if they're the best thing in the world for a database, they
haveserious flaws themselves and typically are difficult or impossible to tune to be ideal for a database. 

Its one thing to propose that a database build its own file system (hard, and why bother?) versus have a database
manageits own page cache intelligently and access the OS file system as optimally as it can.  In both of the latter,
theDB knows much more about what data is really important than the OS (and could for example, prioritize cache versus
work_memintelligently while the OS can get that one horribly wrong in my experience, and knows when a huge seqscan
occursto make caching those results low priority).  No matter how you do it using the OS cache, you cache twice and
copytwice.  O_DIRECT isn't usually an option for other reasons, the OS disk scheduler, readahead, and other benefits of
afile system are real and substantial.  If you are caching twice, you might as well have the "closer" copy of that data
bethe larger, more efficient pool. 

As for tipping points and pg_bench -- It doesn't seem to reflect the kind of workload we use postgres for at all,
thoughmy workload does a lot of big hashes and seqscans, and I'm curious how much improved those may be due to the hash
improvements. 32GB RAM and 3TB data (about 250GB scanned regularly) here.  And yes, we are almost completely CPU bound
nowexcept for a few tasks.  Iostat only reports above 65% disk utilization for about 5% of the workload duty-cycle, and
isregularly < 20%.  COPY doesn't get anywhere near platter speeds, on indexless bulk transfer.  The highest disk usage
spikesoccur when some of our radom-access data/indexes get shoved out of cache.  These aren't too large, but high
enoughseqscan load will cause postgres and the OS to dump them from cache.  If we put these on some SSD's the disk
utilization% would drop a lot further. 

I feel confident in saying that in about a year, I could spec out a medium sized budget for hardware ($25k) for almost
anypostgres setup and make it almost pure CPU bound. 
SSDs and hybrid tech such as ZFS L2ARC make this possible with easy access to 10k+ iops, and it it will take no more
than12 SATA drives in raid 10 next year (and a good controller or software raid) to get 1GB/sec sequential reads.

Re: Need help with 8.4 Performance Testing

From

"Scott Marlowe"

Date:

09 December 2008, 13:21:50

On Tue, Dec 9, 2008 at 9:37 AM, Scott Carey <scott@richrelevance.com> wrote:

> As for tipping points and pg_bench -- It doesn't seem to reflect the kind of workload we use postgres for at all,
thoughmy workload does a lot of big hashes and seqscans, and I'm curious how much improved those may be due to the hash
improvements. 32GB RAM and 3TB data (about 250GB scanned regularly) here.  And yes, we are almost completely CPU bound
nowexcept for a few tasks.  Iostat only reports above 65% disk utilization for about 5% of the workload duty-cycle, and
isregularly < 20%.  COPY doesn't get anywhere near platter speeds, on indexless bulk transfer.  The highest disk usage
spikesoccur when some of our radom-access data/indexes get shoved out of cache.  These aren't too large, but high
enoughseqscan load will cause postgres and the OS to dump them from cache.  If we put these on some SSD's the disk
utilization% would drop a lot further. 

It definitely reflects our usage pattern, which is very random and
involves tiny bits of data scattered throughout the database.  Our
current database is about 20-25 Gig, which means it's quickly reaching
the point where it will not fit in our 32G of ram, and it's likely to
grow too big for 64Gig before a year or two is out.

> I feel confident in saying that in about a year, I could spec out a medium sized budget for hardware ($25k) for
almostany postgres setup and make it almost pure CPU bound. 
> SSDs and hybrid tech such as ZFS L2ARC make this possible with easy access to 10k+ iops, and it it will take no more
than12 SATA drives in raid 10 next year (and a good controller or software raid) to get 1GB/sec sequential reads. 

Lucky you, having needs that are fulfilled by sequential reads.  :)

I wonder how many hard drives it would take to be CPU bound on random
access patterns?  About 40 to 60?  And probably 15k / SAS drives to
boot.  Cause that's what we're looking at in the next few years where
I work.

Re: Need help with 8.4 Performance Testing

From

"Joshua D. Drake"

Date:

09 December 2008, 13:26:24

On Tue, 2008-12-09 at 10:21 -0700, Scott Marlowe wrote:
> On Tue, Dec 9, 2008 at 9:37 AM, Scott Carey <scott@richrelevance.com> wrote:

> Lucky you, having needs that are fulfilled by sequential reads.  :)
>
> I wonder how many hard drives it would take to be CPU bound on random
> access patterns?  About 40 to 60?  And probably 15k / SAS drives to
> boot.  Cause that's what we're looking at in the next few years where
> I work.

I was able to achieve only 10-20% IO/Wait even after beating the heck
out of the machine with 50 spindles (of course it does have 16 CPUs):


http://www.commandprompt.com/blogs/joshua_drake/2008/04/is_that_performance_i_smell_ext2_vs_ext3_on_50_spindles_testing_for_postgresql/


>
--
PostgreSQL
   Consulting, Development, Support, Training
   503-667-4564 - http://www.commandprompt.com/
   The PostgreSQL Company, serving since 1997

Re: Need help with 8.4 Performance Testing

From

Scott Carey

Date:

09 December 2008, 13:30:52

> Lucky you, having needs that are fulfilled by sequential reads.  :)

> I wonder how many hard drives it would take to be CPU bound on random
> access patterns?  About 40 to 60?  And probably 15k / SAS drives to
> boot.  Cause that's what we're looking at in the next few years where
> I work.

About $3000 worth of Intel --- mainstream SSD's = 240GB space (6 in raid 10) today, 2x to 3x that storage area in 1
year.

Random reads are even easier, provided you don't need more than 500GB or so.

And with something like ZFS + L2ARC you can back your data with large slow iops disks and have cache access to data
withoutrequiring mirrors on the cache ($3k of ssds for that covers 2x the area, then).

Re: Need help with 8.4 Performance Testing

From

Matthew Wakeling

Date:

09 December 2008, 13:35:27

On Tue, 9 Dec 2008, Scott Marlowe wrote:
> I wonder how many hard drives it would take to be CPU bound on random
> access patterns?  About 40 to 60?  And probably 15k / SAS drives to
> boot.  Cause that's what we're looking at in the next few years where
> I work.

There's a problem with that thinking. That is, in order to exercise many
spindles, you will need to have just as many (if not more) concurrent
requests. And if you have many concurrent requests, then you can spread
them over multiple CPUs. So it's more a case of "How many hard drives PER
CPU". It also becomes a matter of whether Postgres can scale that well.

Matthew

--
 Those who do not understand Unix are condemned to reinvent it, poorly.
                -- Henry Spencer

Re: Need help with 8.4 Performance Testing

From

"Scott Marlowe"

Date:

09 December 2008, 13:46:42

On Tue, Dec 9, 2008 at 10:35 AM, Matthew Wakeling <matthew@flymine.org> wrote:
> On Tue, 9 Dec 2008, Scott Marlowe wrote:
>>
>> I wonder how many hard drives it would take to be CPU bound on random
>> access patterns?  About 40 to 60?  And probably 15k / SAS drives to
>> boot.  Cause that's what we're looking at in the next few years where
>> I work.
>
> There's a problem with that thinking. That is, in order to exercise many
> spindles, you will need to have just as many (if not more) concurrent
> requests. And if you have many concurrent requests, then you can spread them
> over multiple CPUs. So it's more a case of "How many hard drives PER CPU".
> It also becomes a matter of whether Postgres can scale that well.

For us, all that is true.  We typically have a dozen or more
concurrent requests running at once.  We'll likely see that increase
linearly with our increase in users over the next year or so.  We
bought the machines with dual quad core opterons knowing the 6,8 and
12 core opterons were due out on the same socket design in the next
year or so and we could upgrade those too if needed.  PostgreSQL seems
to scale well in most tests I've seen to at least 16 cores, and after
that it's anyone's guess.  The Sparc Niagra seems capable of scaling
to 32 threads on 8 cores with pgsql 8.2 quite well.

I worry about the linux kernel scaling that well, and we might have to
look at open solaris or something like the solaris kernel under ubuntu
distro to get better scaling.

Re: Need help with 8.4 Performance Testing

From

Scott Carey

Date:

09 December 2008, 14:01:43

Let me re-phrase this.

For today, at 200GB or less of required space, and 500GB or less next year.

“Where we’re going, we don’t NEED spindles.”

Seriously, go down to the store and get 6 X25-M’s, they’re as cheap as $550 each and will be sub $500 soon. These are more than sufficient for all but heavy write workloads (each can withstand ~600+ TB of writes in a lifetime, and SMART will tell you before they go). 6 in a RAID 10 will give you 750MB/sec read, and equivalent MB/sec in random reads. I’ve tested them. Random writes are very very fast too, faster than any SAS drive.
Put this in your current system, and you won’t need to upgrade the RAM unless you need items in cache to reduce CPU load or need it for the work_mem space.

Spindles will soon be only for capacity and sequential access performance requirements. Solid state will be for IOPS, and I would argue that for most Postgres installations, already is (now that the Intel SSD drive, which does random writes and read/write concurrency well, has arrived — more such next gen drives are on the way).

On 12/9/08 9:28 AM, "Scott Carey" <scott@richrelevance.com> wrote:

> Lucky you, having needs that are fulfilled by sequential reads. :)

> I wonder how many hard drives it would take to be CPU bound on random
> access patterns? About 40 to 60? And probably 15k / SAS drives to
> boot. Cause that's what we're looking at in the next few years where
> I work.

About $3000 worth of Intel --- mainstream SSD's = 240GB space (6 in raid 10) today, 2x to 3x that storage area in 1 year.

Random reads are even easier, provided you don't need more than 500GB or so.

And with something like ZFS + L2ARC you can back your data with large slow iops disks and have cache access to data without requiring mirrors on the cache ($3k of ssds for that covers 2x the area, then).

--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

Re: Need help with 8.4 Performance Testing

From

"Scott Marlowe"

Date:

09 December 2008, 14:08:08

On Tue, Dec 9, 2008 at 11:01 AM, Scott Carey <scott@richrelevance.com> wrote:
> Let me re-phrase this.
>
> For today, at 200GB or less of required space, and 500GB or less next year.
>
> "Where we're going, we don't NEED spindles."

Those intel SSDs sound compelling.  I've been waiting for SSDs to get
competitive price and performance wise for a while, and when the
intels came out and I read the first benchmarks I immediately began
scheming.  Sadly, that was right after we're ordered our new 16 drive
servers, and I didn't have time to try something new and hope it would
work.  Now that the servers are up and running, we'll probably look at
adding the SSDs next summer before our high load period begins.

Re: Need help with 8.4 Performance Testing

From

"Joshua D. Drake"

Date:

09 December 2008, 14:11:38

On Tue, 2008-12-09 at 11:08 -0700, Scott Marlowe wrote:
> On Tue, Dec 9, 2008 at 11:01 AM, Scott Carey <scott@richrelevance.com> wrote:
> > Let me re-phrase this.
> >
> > For today, at 200GB or less of required space, and 500GB or less next year.
> >
> > "Where we're going, we don't NEED spindles."
>
> Those intel SSDs sound compelling.  I've been waiting for SSDs to get
> competitive price and performance wise for a while, and when the
> intels came out and I read the first benchmarks I immediately began
> scheming.  Sadly, that was right after we're ordered our new 16 drive
> servers, and I didn't have time to try something new and hope it would
> work.  Now that the servers are up and running, we'll probably look at
> adding the SSDs next summer before our high load period begins.
>

The idea of SSDs is interesting. However I think I will wait for all the
other early adopters to figure out the problems before I start
suggesting them to clients.

Hard drives work, their cheap and fast. I can get 25 spindles, 15k in a
3U with controller and battery backed cache for <$10k.

Joshua D. Drake



--
PostgreSQL
   Consulting, Development, Support, Training
   503-667-4564 - http://www.commandprompt.com/
   The PostgreSQL Company, serving since 1997

Re: Need help with 8.4 Performance Testing

From

"Merlin Moncure"

Date:

09 December 2008, 16:12:43

On Tue, Dec 9, 2008 at 1:11 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
> Hard drives work, their cheap and fast. I can get 25 spindles, 15k in a
> 3U with controller and battery backed cache for <$10k.

While I agree with your general sentiments about early adoption, etc
(the intel ssd products are the first flash drives that appear to have
real promise in the enterprise), the numbers tell a different story.
A *single* X25-E will give similar sustained write IOPS as your tray
for far less price and a much better worst case read latency.  All
this without the 25 sets of whizzing ball bearings, painful spin-up
times, fanning, RAID controller firmware, and various other sundry
technologies to make the whole thing work.

The main issue that I see with flash SSD is if the promised wear
lifetimes are believable in high load environments and the mechanism
of failure (slowly degrade into read only) is accurate.

So, at least in relative terms, 15k sas drives are not 'fast'. They
are terribly, awfully, painfully slow.  They are also not cheap in
terms of $/IOPS.  The end is near.

merlin

Re: Need help with 8.4 Performance Testing

From

"Joshua D. Drake"

Date:

09 December 2008, 16:20:53

On Tue, 2008-12-09 at 15:07 -0500, Merlin Moncure wrote:
> On Tue, Dec 9, 2008 at 1:11 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
> > Hard drives work, their cheap and fast. I can get 25 spindles, 15k in a
> > 3U with controller and battery backed cache for <$10k.
>
> While I agree with your general sentiments about early adoption, etc
> (the intel ssd products are the first flash drives that appear to have
> real promise in the enterprise), the numbers tell a different story.

Oh I have read about them and I am excited. I am just saying that there
are plenty of people who can take advantage of the unknown without the
worry of the pain that can cause. My client, can't.

>
> The main issue that I see with flash SSD is if the promised wear
> lifetimes are believable in high load environments and the mechanism
> of failure (slowly degrade into read only) is accurate.
>

Right.

> So, at least in relative terms, 15k sas drives are not 'fast'. They
> are terribly, awfully, painfully slow.  They are also not cheap in
> terms of $/IOPS.  The end is near.
>

No doubt about it. I give it 24 months tops.

Joshua D. Drake


> merlin
>
--
PostgreSQL
   Consulting, Development, Support, Training
   503-667-4564 - http://www.commandprompt.com/
   The PostgreSQL Company, serving since 1997

Re: Need help with 8.4 Performance Testing

From

Scott Carey

Date:

09 December 2008, 18:05:44

Which brings this back around to the point I care the most about:

I/O per second will diminish as the most common database performance limiting factor in Postgres 8.4’s lifetime, and become almost irrelevant in 8.5’s.
Becoming more CPU efficient will become very important, and for some, already is. The community needs to be proactive on this front.
This turns a lot of old assumptions on their head, from the database down through the OS and filesystem. We’re bound to run into many surprises due to this major shift in something that has had its performance characteristics taken for granted for decades.

> On 12/9/08 12:20 PM, "Joshua D. Drake" <jd@commandprompt.com> wrote:

> > So, at least in relative terms, 15k sas drives are not 'fast'. They
> > are terribly, awfully, painfully slow. They are also not cheap in
> > terms of $/IOPS. The end is near.
> >

> No doubt about it. I give it 24 months tops.
>
> Joshua D. Drake

Re: Need help with 8.4 Performance Testing

From

Tom Lane

Date:

09 December 2008, 18:38:45

Scott Carey <scott@richrelevance.com> writes:
> Which brings this back around to the point I care the most about:
> I/O per second will diminish as the most common database performance limiting factor in Postgres 8.4's lifetime, and
becomealmost irrelevant in 8.5's. 
> Becoming more CPU efficient will become very important, and for some, already is.  The community needs to be
proactiveon this front. 
> This turns a lot of old assumptions on their head, from the database down through the OS and filesystem.  We're bound
torun into many surprises due to this major shift in something that has had its performance characteristics taken for
grantedfor decades. 

Hmm ... I wonder whether this means that the current work on
parallelizing I/O (the posix_fadvise patch in particular) is a dead
end.  Because what that is basically going to do is expend more CPU
to improve I/O efficiency.  If you believe this thesis then that's
not the road we want to go down.

            regards, tom lane

Re: Need help with 8.4 Performance Testing

From

"Robert Haas"

Date:

09 December 2008, 18:58:50

> Hmm ... I wonder whether this means that the current work on
> parallelizing I/O (the posix_fadvise patch in particular) is a dead
> end.  Because what that is basically going to do is expend more CPU
> to improve I/O efficiency.  If you believe this thesis then that's
> not the road we want to go down.

I don't believe the thesis.  The gap between disk speeds and memory
speeds may narrow over time, but I doubt it's likely to disappear
altogether any time soon, and certainly not for all users.

Besides which, I believe the CPU overhead of that patch is pretty darn
small when the feature is not enabled.

...Robert

Re: Need help with 8.4 Performance Testing

From

"Joshua D. Drake"

Date:

09 December 2008, 19:05:28

On Tue, 2008-12-09 at 17:38 -0500, Tom Lane wrote:
> Scott Carey <scott@richrelevance.com> writes:
> > Which brings this back around to the point I care the most about:
> > I/O per second will diminish as the most common database performance limiting factor in Postgres 8.4's lifetime,
andbecome almost irrelevant in 8.5's. 
> > Becoming more CPU efficient will become very important, and for some, already is.  The community needs to be
proactiveon this front. 
> > This turns a lot of old assumptions on their head, from the database down through the OS and filesystem.  We're
boundto run into many surprises due to this major shift in something that has had its performance characteristics taken
forgranted for decades. 
>
> Hmm ... I wonder whether this means that the current work on
> parallelizing I/O (the posix_fadvise patch in particular) is a dead
> end.  Because what that is basically going to do is expend more CPU
> to improve I/O efficiency.  If you believe this thesis then that's
> not the road we want to go down.

The per cpu performance increase against the per I/O system increase
line is going to be vastly different. Anything that reduces overall I/O
is going to help (remember, you can never go too fast).

The idea that somehow I/O per second will diminish as the most common
database performance factor is IMO a pipe dream. Even as good as SSDs
are getting, they still have to go through the bus. Something CPUs are
better at (especially those CPUs that connect to memory directly without
the bus).

In 5 years maybe, in the next two postgresql releases, not likely. Not
to mention all of this is around the idea of a different class of
hardware than 99% of our community will be running.

Sincerely,

Joshua D. Drake

>
>             regards, tom lane
>
--
PostgreSQL
   Consulting, Development, Support, Training
   503-667-4564 - http://www.commandprompt.com/
   The PostgreSQL Company, serving since 1997

Re: Need help with 8.4 Performance Testing

From

Ben Chobot

Date:

09 December 2008, 19:13:22

On Tue, 9 Dec 2008, Robert Haas wrote:

> I don't believe the thesis.  The gap between disk speeds and memory
> speeds may narrow over time, but I doubt it's likely to disappear
> altogether any time soon, and certainly not for all users.

I think the "not for all users" is the critical part. In 2 years, we may
(or may not) start using SSD instead of traditional drives for new
installs, but we certainly won't be throwing out our existing servers any
time soon just because something (much) better is now available.

Re: Need help with 8.4 Performance Testing

From

justin

Date:

09 December 2008, 19:29:02

Tom Lane wrote:

Scott Carey <scott@richrelevance.com> writes:

Which brings this back around to the point I care the most about:
I/O per second will diminish as the most common database performance limiting factor in Postgres 8.4's lifetime, and become almost irrelevant in 8.5's.
Becoming more CPU efficient will become very important, and for some, already is.  The community needs to be proactive on this front.
This turns a lot of old assumptions on their head, from the database down through the OS and filesystem.  We're bound to run into many surprises due to this major shift in something that has had its performance characteristics taken for granted for decades.

Hmm ... I wonder whether this means that the current work on
parallelizing I/O (the posix_fadvise patch in particular) is a dead
end.  Because what that is basically going to do is expend more CPU
to improve I/O efficiency.  If you believe this thesis then that's
not the road we want to go down.
		regards, tom lane

What does the CPU/ Memory/Bus performance road map look like?

Is the IO performance for storage device for what ever it be, going to be on par with the above to cause this problem?

Once IO performance numbers start jumping up I think DBA will have the temptation start leaving more and more data in the production database instead of moving it out of the production database. Or start consolidating databases onto fewer servers . Again pushing more load onto the IO.

Re: Need help with 8.4 Performance Testing

From

Scott Carey

Date:

09 December 2008, 19:39:18

Prefetch CPU cost should be rather low in the grand scheme of things, and does help performance even for very fast I/O. I would not expect a very large CPU use increase from that sort of patch in the grand scheme of things — there is a lot that is more expensive to do on a per block basis.

There are two ways to look at non-I/O bound performance:
* Aggregate performance across many concurrent activities — here you want the least CPU used possible per action, and the least collisions on locks or shared data structures. Using resources for as short of an interval as possible also helps a lot here.
* Single query performance, where you want to shorten the query time, perhaps at the cost of more average CPU. Here, something like the fadvise stuff helps — as would any thread parallelism. Perhaps less efficient in aggregate, but more efficient for a single query.

Overall CPU cost of accessing and reading data. If this comes from disk, the big gains will be along the whole chain: Driver to file system cache, file system cache to process, process specific tasks (cache eviction, placement, tracking), examining page tuples, locating tuples within pages, etc. Anything that currently occurs on a per-block basis that could be done in a larger batch or set of blocks may be a big gain. Another place that commonly consumes CPU in larger software projects is memory allocation if more advanced allocation techniques are not used. I have no idea what Postgres uses here however. I do know that commercial databases have extensive work in this area for performance, as well as reliability (harder to cause a leak, or easier to detect) and ease of use (don’t have to even bother to free in certain contexts).

> On 12/9/08 2:58 PM, "Robert Haas" <robertmhaas@gmail.com> wrote:

> I don't believe the thesis. The gap between disk speeds and memory
> speeds may narrow over time, but I doubt it's likely to disappear
> altogether any time soon, and certainly not for all users.

Well, when select count(1) reads pages slower than my disk, its 16x + slower than my RAM. Until one can demonstrate that the system can even read pages in RAM faster than what disks will do next year, it doesn’t matter much that RAM is faster. It does matter that RAM is faster for sorts, hashes, and other operations, but at the current time it does not for the raw pages themselves, from what I can measure.

This is in fact, central to my point. Things will be CPU bound, not I/O bound. It is mentioned that we still have to access things over the bus, and memory is faster, etc. But Postgres is too CPU bound on page access to take advantage of the fact that memory is faster (for reading data pages).

The biggest change is not just that disks are getting closer to RAM, but that the random I/O penalty is diminishing significantly. Low latencies makes seek-driven queries that used to consume mostly disk time consume CPU time instead. High CPU costs for accessing pages makes a fast disk surprisingly close to RAM speed.

> Besides which, I believe the CPU overhead of that patch is pretty darn
> small when the feature is not enabled.

> ...Robert

I doubt it is much CPU, on or off. It will help with SSD’s when optimizing a single query, it may not help much if a system has enough ‘natural’ parallelism from other concurrent queries. However there is a clear CPU benefit for getting individual queries out of the way faster, and occupying precious work_mem or other resources for a shorter time. Occupying resources for a shorter period always translates to some CPU savings on a machine running at its limit with high concurrency.

Re: Need help with 8.4 Performance Testing

From

Gregory Stark

Date:

09 December 2008, 20:45:33

Scott Carey <scott@richrelevance.com> writes:

> And as far as I can tell, even after the 8.4 fadvise patch, all I/O is in
> block_size chunks. (hopefully I am wrong)
>...
> In addition to the fadvise patch, postgres needs to merge adjacent I/O's
> into larger ones to reduce the overhead. It only really needs to merge up to
> sizes of about 128k or 256k, and gain a 8x to 16x drop in syscall overhead,
> and additionally potentially save code trips down the shared buffer
> management code paths. At lest, thats my guess I haven't looked at any code
> and could be wrong.

There are a lot of assumptions here that I would be interested in seeing
experiments to back up.

FWIW when I was doing testing of posix_fadvise I did a *lot* of experiments
though only with a couple systems. One had a 3-drive array and one with a
15-drive array, both running Linux. I sometimes could speed up the sequential
scan by about 10% but not consistently. It was never more than about 15% shy
of the highest throughput from dd. And incidentally the throughput from dd
didn't seem to depend much at all on the blocksize.

On your system does "dd bs=8k" and "dd bs=128k" really have an 8x performance
difference?

In short, at least from the evidence available, this all seems like it might
be holdover beliefs from the olden days of sysadmining where syscalls were
much slower and OS filesystem caches much dumber.

I'm still interested in looking into it but I'll have to see actual vmstat or
iostat output while it's happening, preferably some oprofile results too. And
how many drives do you actually need to get into this situation. Also, what is
the output of "vacuum verbose" on the table?

> Additionally, the "If your operating system has any reasonable caching
> itself" comment earlier in this conversation --- Linux (2.6.18, Centos 5.2)
> does NOT. I can easily make it spend 100% CPU in system time trying to
> figure out what to do with the system cache for an hour. Just do large
> seqscans with memory pressure from work_mem or other forces that the OS will
> not deem 'idle'. Once the requested memory is ~75% of the system total, it
> will freak out. Linux simply will not give up that last 25% or so of the RAM
> for anything but page cache

This seems like just a misconfigured system. Linux and most Unixen definitely
expect to have a substantial portion of RAM dedicated to disk cache. Keep in
mind all your executable pages count towards this page cache too. You can
adjust this to some extent with the "swappiness" variable in Linux -- but I
doubt you'll be happy with the results regardless.

> The other way around (small shared_buffers, let the OS do it) hurts
> performance overall quite a bit -- randomly accessed pages get pushed out to
> the OS cache more often, and the OS tosses thouse out when a big seqscan
> occurs, resulting in a lot more random access from disk and more disk bound
> periods of time. Great wonder, this operating system caching, eh?

How do you observe this?

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Get trained by Bruce Momjian - ask me about EnterpriseDB's PostgreSQL training!

Re: Need help with 8.4 Performance Testing

From

Gregory Stark

Date:

09 December 2008, 20:54:46

Matthew Wakeling <matthew@flymine.org> writes:

> On Tue, 9 Dec 2008, Scott Marlowe wrote:
>> I wonder how many hard drives it would take to be CPU bound on random
>> access patterns?  About 40 to 60?  And probably 15k / SAS drives to
>> boot.  Cause that's what we're looking at in the next few years where
>> I work.
>
> There's a problem with that thinking. That is, in order to exercise many
> spindles, you will need to have just as many (if not more) concurrent requests.
> And if you have many concurrent requests, then you can spread them over
> multiple CPUs. So it's more a case of "How many hard drives PER CPU". It also
> becomes a matter of whether Postgres can scale that well.

Well:

$ units
2445 units, 71 prefixes, 33 nonlinear units
You have: 8192 byte/5ms
You want: MB/s
    * 1.6384
    / 0.61035156

At 1.6MB/s per drive if find Postgres is cpu-bound doing sequential scans at
1GB/s you'll need about 640 drives to keep one cpu satisfied doing random I/O
-- assuming you have perfect read-ahead and the read-ahead itself doesn't add
cpu overhead. Both of which are false of course, but at least in theory that's
what it'll take.


--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com
  Ask me about EnterpriseDB's On-Demand Production Tuning

Re: Need help with 8.4 Performance Testing

From

Ron Mayer

Date:

09 December 2008, 21:37:46

Tom Lane wrote:
> Scott Carey <scott@richrelevance.com> writes:
>> Which brings this back around to the point I care the most about:
>> I/O per second will diminish as the most common database performance limiting factor in Postgres 8.4's lifetime, and
becomealmost irrelevant in 8.5's. 
>> Becoming more CPU efficient will become very important, and for some, already is.  The community needs to be
proactiveon this front. 
>> This turns a lot of old assumptions on their head, from the database down through the OS and filesystem.  We're
boundto run into many surprises due to this major shift in something that has had its performance characteristics taken
forgranted for decades. 
>
> Hmm ... I wonder whether this means that the current work on
> parallelizing I/O (the posix_fadvise patch in particular) is a dead
> end.  Because what that is basically going to do is expend more CPU
> to improve I/O efficiency.  If you believe this thesis then that's
> not the road we want to go down.
>

I imagine the larger postgres installations will still benefit from
this patch - because I imagine they will stay on hard disks for
quite some time; simply because the cost of 70TB of disks seems like
it'll be lower than RAM for at least the intermediate term.

I imagine the smaller postgres installations will also still benefit
from this patch - because my postgres installations with the most
painful I/O bottlenecks are small virtual machines without dedicated
drives where I/O (I guess emulated by the virtual machine software)
is very painful.

Perhaps there's a mid-sized system that won't benefit from fadvise()
in the intermediate term -- where the size of the database is about
the same size as a cost-effective flash drive -- but I don't have
any databases in that range now.

Re: Need help with 8.4 Performance Testing

From

Craig James

Date:

09 December 2008, 21:57:08

justin wrote:
> Tom Lane wrote:
>> Hmm ... I wonder whether this means that the current work on
>> parallelizing I/O (the posix_fadvise patch in particular) is a dead
>> end.  Because what that is basically going to do is expend more CPU
>> to improve I/O efficiency.  If you believe this thesis then that's
>> not the road we want to go down.
>>
>>             regards, tom lane
>>
> What does the CPU/ Memory/Bus performance road map look like?
>
> Is the IO performance for storage device for what ever it be, going to
> be on par with the above to cause this problem?

Flash memory will become just a fourth layer in the memory caching system (on-board CPU, high-speed secondary cache,
mainmemory, and persistent memory).  The idea of external storage will probably disappear altogether -- computers will
justhave memory, and won't forget anything when you turn them off.  Since most computers are 64 bits these days, all
dataand programs will just hang out in memory at all times, and be directly addressable by the CPU. 

The distinction between disk and memory arose from the fact that disks were large, slow devices relative to "core"
memoryand had to be connected by long wires, hence the need for I/O subsystems.  As flash memory becomes mainstream, I
expectthis distinction to disappear. 

Craig

Re: Need help with 8.4 Performance Testing

From

"Robert Haas"

Date:

09 December 2008, 23:06:42

> Well, when select count(1) reads pages slower than my disk, its 16x + slower
> than my RAM.  Until one can demonstrate that the system can even read pages
> in RAM faster than what disks will do next year, it doesn't matter much that
> RAM is faster.   It does matter that RAM is faster for sorts, hashes, and
> other operations, but at the current time it does not for the raw pages
> themselves, from what I can measure.
>
> This is in fact, central to my point.  Things will be CPU bound, not I/O
> bound.  It is mentioned that we still have to access things over the bus,
> and memory is faster, etc.  But Postgres is too CPU bound on page access to
> take advantage of the fact that memory is faster (for reading data pages).

As I understand it, a big part of the reason for the posix_fadvise
patch is that the current system doesn't do a good job leveraging many
spindles in the service of a single query.  So the problem is not that
the CPU overhead is too large in some general sense but that the disk
and CPU operations get serialized, leading to an overall loss of
performance.  On the other hand, there are certainly cases (such as a
database which is entirely in RAM, or all the commonly used parts are
in RAM) where there really isn't very much I/O, and in those cases of
course the CPU cost will dominate.

...Robert

Re: Need help with 8.4 Performance Testing

From

Scott Carey

Date:

10 December 2008, 01:48:11

Just to clarify, I’m not talking about random I/O bound loads today, on hard drives, targetted by the fadvise stuff — these aren’t CPU bound, and they will be helped by it.

For sequential scans, this situation is different, since the OS has sufficient read-ahead prefetching algorithms of its own for sequential reads, and the CPU work and I/O work ends up happening in parallel due to that.

For what it is worth, you can roughly double to triple the iops of an Intel X-25M on pure random reads if you queue up multiple concurrent reads rather than serialize them. But it is not due to spindles, it is due to the latency of the SATA interface and the ability of the controller chip to issue reads to flash devices on different banks concurrently to some extent.

On 12/9/08 7:06 PM, "Robert Haas" <robertmhaas@gmail.com> wrote:

> Well, when select count(1) reads pages slower than my disk, its 16x + slower
> than my RAM. Until one can demonstrate that the system can even read pages
> in RAM faster than what disks will do next year, it doesn't matter much that
> RAM is faster. It does matter that RAM is faster for sorts, hashes, and
> other operations, but at the current time it does not for the raw pages
> themselves, from what I can measure.
>
> This is in fact, central to my point. Things will be CPU bound, not I/O
> bound. It is mentioned that we still have to access things over the bus,
> and memory is faster, etc. But Postgres is too CPU bound on page access to
> take advantage of the fact that memory is faster (for reading data pages).

As I understand it, a big part of the reason for the posix_fadvise
patch is that the current system doesn't do a good job leveraging many
spindles in the service of a single query. So the problem is not that
the CPU overhead is too large in some general sense but that the disk
and CPU operations get serialized, leading to an overall loss of
performance. On the other hand, there are certainly cases (such as a
database which is entirely in RAM, or all the commonly used parts are
in RAM) where there really isn't very much I/O, and in those cases of
course the CPU cost will dominate.

...Robert

Re: Need help with 8.4 Performance Testing

From

Scott Carey

Date:

10 December 2008, 04:20:33

I did some further tests, that alter some of my statements below.

Mainly:

* I can do select count(1) queries at closer to disk speeds than my older tests (on a different machine) indicated. I can get ~800MB/sec where the disks can do 1200MB/sec and other single process tasks can go 1100MB/sec. It is mostly CPU bound during these tests.
Accessing from within shared_buffers rather than from the OS cache is faster 20% ish, but it is hard to measure consistently (hard for me to be certain its in that cache).
Slightly more complicated scan queries (simple group by aggregates, returning < 1000 rows) slow down to ~1/3 disk speed, depending on various factors that are not worth getting into. So, postgres is a less CPU bound currently than I thought.
* The linux caching 100% system CPU spin issue seems related to something pinning memory on my machine. Not sure what it is yet, the only thing running on the machine is postgres, and the size of the pinned memory is roughly equal to shared_buffers.
* I have some tests that conflict with prior results, that seem to depend on whether postgres is on or off when I run the disk benchmark. This may be related to the pinned memory above. Specifically, larger block size reads reduce overall CPU usage more if there isn’t anything else running on the system than if there is — attributable to more time spent in kswapd when postgres is on (and mostly idle). Further investigation is needed here.

On 12/9/08 8:37 AM, "Scott Carey" <scott@richrelevance.com> wrote:

> ________________________________________
> From: pgsql-performance-owner@postgresql.org [pgsql-performance-owner@postgresql.org] On Behalf Of > Jean-David Beyer [jeandavid8@verizon.net]
> Sent: Tuesday, December 09, 2008 5:08 AM
> To: pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] Need help with 8.4 Performance Testing
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1

> But one thing that can matter is whether you want to improve just the
> performance of the dbms, or the performance of the entire system, on which
> the dbms runs. Because if you want to improve the whole system, you would
> want as much of the caching to take place in the system's buffers so the use
> of the memory could be optimized over the entire workload, not just the load
> of the dbms itself. I suppose on a dedicated system with only one dbms
> running with only one database open (at a time, anyway), this might be moot,
> but not otherwise.

Yes, the OS is in better position to arbitrate between multiple things. Of course, we aren't talking about the highest performance databases if we are talking about mixed use systems though.
Additionally, the OS can never really get it right, with a DB or other apps. Any app can behave badly and grab too much RAM and access it regularly enough for it to not be 'idle' much but give the OS VM fits trying to figure out if its important or not versus other processes.

> Now I agree that it would be good to get the entire index (or at least the
> working set of the index) into the memory of the computer. But does it
> really matter if it is in the system's cache, or the postgres cache? Is it
> any more likely to be in postgres's cache than in the system cache if the
> system is hurting for memory? I would think the system would be equally
> likely to page out "idle" pages no matter where they are unless they are
> locked to memory, and I do not know if all operating systems can do this,
> and even if they can, I do not know if postgres uses that ability. I doubt
> it, since I believe (at least in Linux) a process can do that only if run as
> root, which I imagine few (if any) users do.

The problem, is when none of them are really 'idle'. When the OS has to decide which pages, all of which have been accessed recently, to evict. Most OS's will make bad choices if the load is mixed random and sequential access, as they treat all pages equally with respect to freshness versus eviction.
Another problem is that there IS a difference between being in postgres' cache and the OS cache. One is more expensive to retrieve than the other. Significantly.

Aaccessing buffers in shared_buffers, in process, uses a good chunk less CPU (and data copy and shared buffer eviction overhead) than going over the sys call to the OS.

And as far as I can tell, even after the 8.4 fadvise patch, all I/O is in block_size chunks. (hopefully I am wrong)
My system is now CPU bound, the I/O can do sequential reads of more than 1.2GB/sec but Postgres can't do a seqscan 30% as fast because it eats up CPU like crazy just reading and identifying tuples. It does seqscans ~ 25% faster if its from shared_buffers than from the OS's page cache though. Seqscans are between 250MB/sec and 400MB/sec peak, from mem or disk, typically showing no more than 35% iostat utilization of the array if off disk -- so we run a few concurrently where we can.

In addition to the fadvise patch, postgres needs to merge adjacent I/O's into larger ones to reduce the overhead. It only really needs to merge up to sizes of about 128k or 256k, and gain a 8x to 16x drop in syscall overhead, and additionally potentially save code trips down the shared buffer management code paths. At lest, thats my guess I haven't looked at any code and could be wrong.

Additionally, the "If your operating system has any reasonable caching itself" comment earlier in this conversation --- Linux (2.6.18, Centos 5.2) does NOT. I can easily make it spend 100% CPU in system time trying to figure out what to do with the system cache for an hour. Just do large seqscans with memory pressure from work_mem or other forces that the OS will not deem 'idle'. Once the requested memory is ~75% of the system total, it will freak out. Linux simply will not give up that last 25% or so of the RAM for anything but page cache, even though the disk subsustem is very fast and most of the access is sequential, marginalizing the benefit of the cache. Depending on how you tune it, it will either spin system cpu or swap storm, but the system cpu spin times for the same work load are a lot shorter than an equivalent swap storm.
Mount the data drive in O_DIRECT and the problem vanishes. I've been told that this problem may be gone in some of the latest kernels. I have seriously considered bumping shared_buffers up a lot and mounting the thing direct -- but then we lose the useful scheduler and readahead algorithms. The other way around (small shared_buffers, let the OS do it) hurts performance overall quite a bit -- randomly accessed pages get pushed out to the OS cache more often, and the OS tosses thouse out when a big seqscan occurs, resulting in a lot more random access from disk and more disk bound periods of time. Great wonder, this operating system caching, eh?

In any event, don't hold up these OS page cache things as if they're the best thing in the world for a database, they have serious flaws themselves and typically are difficult or impossible to tune to be ideal for a database.

Its one thing to propose that a database build its own file system (hard, and why bother?) versus have a database manage its own page cache intelligently and access the OS file system as optimally as it can. In both of the latter, the DB knows much more about what data is really important than the OS (and could for example, prioritize cache versus work_mem intelligently while the OS can get that one horribly wrong in my experience, and knows when a huge seqscan occurs to make caching those results low priority). No matter how you do it using the OS cache, you cache twice and copy twice. O_DIRECT isn't usually an option for other reasons, the OS disk scheduler, readahead, and other benefits of a file system are real and substantial. If you are caching twice, you might as well have the "closer" copy of that data be the larger, more efficient pool.

As for tipping points and pg_bench -- It doesn't seem to reflect the kind of workload we use postgres for at all, though my workload does a lot of big hashes and seqscans, and I'm curious how much improved those may be due to the hash improvements. 32GB RAM and 3TB data (about 250GB scanned regularly) here. And yes, we are almost completely CPU bound now except for a few tasks. Iostat only reports above 65% disk utilization for about 5% of the workload duty-cycle, and is regularly < 20%. COPY doesn't get anywhere near platter speeds, on indexless bulk transfer. The highest disk usage spikes occur when some of our radom-access data/indexes get shoved out of cache. These aren't too large, but high enough seqscan load will cause postgres and the OS to dump them from cache. If we put these on some SSD's the disk utilization % would drop a lot further.

I feel confident in saying that in about a year, I could spec out a medium sized budget for hardware ($25k) for almost any postgres setup and make it almost pure CPU bound.
SSDs and hybrid tech such as ZFS L2ARC make this possible with easy access to 10k+ iops, and it it will take no more than 12 SATA drives in raid 10 next year (and a good controller or software raid) to get 1GB/sec sequential reads.

Re: Need help with 8.4 Performance Testing

From

Matthew Wakeling

Date:

10 December 2008, 09:41:35

On Tue, 9 Dec 2008, Scott Carey wrote:
> For what it is worth, you can roughly double to triple the iops of an Intel X-25M on pure random reads if you queue
up
> multiple concurrent reads rather than serialize them.  But it is not due to spindles, it is due to the latency of the
> SATA interface and the ability of the controller chip to issue reads to flash devices on different banks concurrently
> to some extent.  

Spindles, banks. What's the difference? The fact is that you have multiple
"things", and queuing up requests in the controller means that more than
one at a time can be active. The X-25M is basically a RAID controller in
its own right, connected to ten flash devices.

Matthew

--
 Don't worry!  The world can't end today because it's already tomorrow
 in Australia.

Re: Need help with 8.4 Performance Testing

From

Josh Berkus

Date:

11 December 2008, 16:05:17

> I would expect higher shared_buffers to raise the curve before the first
> breakpoint but after the first breakpoint make the drop steeper and deeper.
> The equilibrium where the curve becomes flatter should be lower.

On SpecJAppserver specifically, I remember seeing a drop when the
database size grew beyond the size of shared_buffers.

--Josh

Re: Need help with 8.4 Performance Testing

From

Josh Berkus

Date:

11 December 2008, 16:07:54

Greg Stark wrote:
> On Sun, Dec 7, 2008 at 7:38 PM, Josh Berkus <josh@agliodbs.com> wrote:
>> Also, the following patches currently still have bugs, but when the bugs are
>> fixed I'll be looking for performance testers, so please either watch the
>> wiki or watch this space:
>> ...
>> -- posix_fadvise (Gregory Stark)
>
> Eh? Quite possibly but none that I'm aware of. The only problem is a
> couple of trivial bits of bitrot. I'll a post an update now if you
> want.

I'm just going off the status on hackers archives.  I didn't actually
try to build it before posting that.

If you have an updated patch, link on CommitFest page?  Thanks.

--Josh

Re: Need help with 8.4 Performance Testing

From

Josh Berkus

Date:

11 December 2008, 16:15:16

Tom,

> Hmm ... I wonder whether this means that the current work on
> parallelizing I/O (the posix_fadvise patch in particular) is a dead
> end.  Because what that is basically going to do is expend more CPU
> to improve I/O efficiency.  If you believe this thesis then that's
> not the road we want to go down.

Nope.  People who adminster small databases keep forgetting that there
is another class of users with multiple terabytes of data.  Those users
aren't getting away from spinning disk anytime in the next 5 years.

Additionally, but making PostgreSQL work better with OS-based FS
optimization, we are well positioned to take advantage of any special
features which Linux, Solaris, BSD etc. add to utilize new hardware like
SSDs.  posix_fadvise is a great example of this.

--Josh

Re: Need help with 8.4 Performance Testing

From

James Mansion

Date:

11 December 2008, 19:08:26

Scott Marlowe wrote:
> involves tiny bits of data scattered throughout the database.  Our
> current database is about 20-25 Gig, which means it's quickly reaching
> the point where it will not fit in our 32G of ram, and it's likely to
> grow too big for 64Gig before a year or two is out.
>
...
> I wonder how many hard drives it would take to be CPU bound on random
> access patterns?  About 40 to 60?  And probably 15k / SAS drives to
>
Well, its not a very big database and you're seek bound - so what's
wrong with the latest
generation flash drives?  They're perfect for what you want to do it
seems, and you can
probably get what you need using the new ARC cache on flash stuff in ZFS.

Re: Need help with 8.4 Performance Testing

From

Greg Smith

Date:

12 December 2008, 03:55:58

On Tue, 9 Dec 2008, Scott Carey wrote:

> My system is now CPU bound, the I/O can do sequential reads of more than
> 1.2GB/sec but Postgres can't do a seqscan 30% as fast because it eats up
> CPU like crazy just reading and identifying tuples... In addition to the
> fadvise patch, postgres needs to merge adjacent I/O's into larger ones
> to reduce the overhead.

Do you have any profile data to back that up?  I think it's more likely
that bottlenecks are on the tuple processing side of things as you also
suggested.  There's really no sense guessing; one quick session with
something like oprofile would be more informative than any amount of
speculation on what's going on.

> Additionally, the "If your operating system has any reasonable caching
> itself" comment earlier in this conversation --- Linux (2.6.18, Centos
> 5.2) does NOT.  I can easily make it spend 100% CPU in system time
> trying to figure out what to do with the system cache for an hour.

Have you ever looked into how much memory ends up showing up as
"Writeback" in /proc/meminfo when this happens?  The biggest problem with
that kernel out of the box on the kind of workload you're describing is
that it will try and buffer way too much on the write side by default,
which can easily get you into the sort of ugly situations you describe.
I regularly adjust that kernel to lower dirty_ratio in particular
dramatically from the default of 40 to keep that from happening.  I did a
whole blog entry on one of those if you're not familiar with this
particular bit of painful defaults already:
http://notemagnet.blogspot.com/2008/08/linux-write-cache-mystery.html

> I feel confident in saying that in about a year, I could spec out a
> medium sized budget for hardware ($25k) for almost any postgres setup
> and make it almost pure CPU bound.

The largest database I manage is running on a Sun X4500, which is right at
that price point.  I've never seen it not be CPU bound.  Even though
people are pulling data that's spread across a few TB of disk, the only
time I ever see it straining to keep up with something there's always a
single CPU pegged.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Need help with 8.4 Performance Testing

From

"Robert Haas"

Date:

13 December 2008, 09:23:00

On Tue, Dec 9, 2008 at 1:11 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
>> Those intel SSDs sound compelling.  I've been waiting for SSDs to get
>> competitive price and performance wise for a while, and when the
>> intels came out and I read the first benchmarks I immediately began
>> scheming.  Sadly, that was right after we're ordered our new 16 drive
>> servers, and I didn't have time to try something new and hope it would
>> work.  Now that the servers are up and running, we'll probably look at
>> adding the SSDs next summer before our high load period begins.
>
> The idea of SSDs is interesting. However I think I will wait for all the
> other early adopters to figure out the problems before I start
> suggesting them to clients.
>
> Hard drives work, their cheap and fast. I can get 25 spindles, 15k in a
> 3U with controller and battery backed cache for <$10k.

This may be a little off-topic, but I'd be interested in hearing more
details about how you (or others) would do this...  manufacturer,
model, configuration?  How many hard drives do you need to get 25
spindles?  And where can you get that many 15K hard drives for under
$10K?  My lack of experience in this area is showing here, but,
seriously, any suggestions appreciated.

...Robert

Re: Need help with 8.4 Performance Testing

From

david@lang.hm

Date:

13 December 2008, 10:43:48

On Sat, 13 Dec 2008, Robert Haas wrote:

> On Tue, Dec 9, 2008 at 1:11 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
>>> Those intel SSDs sound compelling.  I've been waiting for SSDs to get
>>> competitive price and performance wise for a while, and when the
>>> intels came out and I read the first benchmarks I immediately began
>>> scheming.  Sadly, that was right after we're ordered our new 16 drive
>>> servers, and I didn't have time to try something new and hope it would
>>> work.  Now that the servers are up and running, we'll probably look at
>>> adding the SSDs next summer before our high load period begins.
>>
>> The idea of SSDs is interesting. However I think I will wait for all the
>> other early adopters to figure out the problems before I start
>> suggesting them to clients.
>>
>> Hard drives work, their cheap and fast. I can get 25 spindles, 15k in a
>> 3U with controller and battery backed cache for <$10k.
>
> This may be a little off-topic, but I'd be interested in hearing more
> details about how you (or others) would do this...  manufacturer,
> model, configuration?  How many hard drives do you need to get 25
> spindles?  And where can you get that many 15K hard drives for under
> $10K?  My lack of experience in this area is showing here, but,
> seriously, any suggestions appreciated.

1 spindle == 1 hard drive

David Lang

Re: Need help with 8.4 Performance Testing

From

"Scott Marlowe"

Date:

13 December 2008, 11:15:41

On Sat, Dec 13, 2008 at 6:22 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Tue, Dec 9, 2008 at 1:11 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
>>> Those intel SSDs sound compelling.  I've been waiting for SSDs to get
>>> competitive price and performance wise for a while, and when the
>>> intels came out and I read the first benchmarks I immediately began
>>> scheming.  Sadly, that was right after we're ordered our new 16 drive
>>> servers, and I didn't have time to try something new and hope it would
>>> work.  Now that the servers are up and running, we'll probably look at
>>> adding the SSDs next summer before our high load period begins.
>>
>> The idea of SSDs is interesting. However I think I will wait for all the
>> other early adopters to figure out the problems before I start
>> suggesting them to clients.
>>
>> Hard drives work, their cheap and fast. I can get 25 spindles, 15k in a
>> 3U with controller and battery backed cache for <$10k.
>
> This may be a little off-topic, but I'd be interested in hearing more
> details about how you (or others) would do this...  manufacturer,
> model, configuration?  How many hard drives do you need to get 25
> spindles?  And where can you get that many 15K hard drives for under
> $10K?  My lack of experience in this area is showing here, but,
> seriously, any suggestions appreciated.

I don't know about fitting in a 3U, but you can get a 24 15k 73G SAS
drive machine with 8 gig ram and a single quad core CPU from
aberdeeninc.com for $10,055.00.

Re: Need help with 8.4 Performance Testing

From

"Joshua D. Drake"

Date:

13 December 2008, 14:38:06

On Sat, 2008-12-13 at 07:44 -0800, david@lang.hm wrote:
> On Sat, 13 Dec 2008, Robert Haas wrote:

> > This may be a little off-topic, but I'd be interested in hearing more
> > details about how you (or others) would do this...  manufacturer,
> > model, configuration?  How many hard drives do you need to get 25
> > spindles?  And where can you get that many 15K hard drives for under
> > $10K?  My lack of experience in this area is showing here, but,
> > seriously, any suggestions appreciated.
>


http://h71016.www7.hp.com/ctoBases.asp?oi=E9CED&BEID=19701&SBLID=&ProductLineId=450&FamilyId=2570&LowBaseId=15222&LowPrice=$1,899.00&familyviewgroup=757&viewtype=Matrix

Retail cost, 15k, 36GB drives, ~ 12k. A phone call and threat of buying
dell gets its for ~ 10k.

Joshua D. Drake


--
PostgreSQL
   Consulting, Development, Support, Training
   503-667-4564 - http://www.commandprompt.com/
   The PostgreSQL Company, serving since 1997

Re: Need help with 8.4 Performance Testing

From

"Scott Marlowe"

Date:

13 December 2008, 15:45:51

On Sat, Dec 13, 2008 at 11:37 AM, Joshua D. Drake <jd@commandprompt.com> wrote:
> On Sat, 2008-12-13 at 07:44 -0800, david@lang.hm wrote:
>> On Sat, 13 Dec 2008, Robert Haas wrote:
>
>> > This may be a little off-topic, but I'd be interested in hearing more
>> > details about how you (or others) would do this...  manufacturer,
>> > model, configuration?  How many hard drives do you need to get 25
>> > spindles?  And where can you get that many 15K hard drives for under
>> > $10K?  My lack of experience in this area is showing here, but,
>> > seriously, any suggestions appreciated.
>>
>
>
http://h71016.www7.hp.com/ctoBases.asp?oi=E9CED&BEID=19701&SBLID=&ProductLineId=450&FamilyId=2570&LowBaseId=15222&LowPrice=$1,899.00&familyviewgroup=757&viewtype=Matrix
>
> Retail cost, 15k, 36GB drives, ~ 12k. A phone call and threat of buying
> dell gets its for ~ 10k.

I prefer to deal with companies that I don't have to horse trade with
to get a good deal.  You can threaten dell and get good deals, but if
you're not watching, and let your boss handle an order suddenly you've
got an $800 Perc 5e on your hands.

Re: Need help with 8.4 Performance Testing

From

"Joshua D. Drake"

Date:

13 December 2008, 15:47:50

On Sat, 2008-12-13 at 12:45 -0700, Scott Marlowe wrote:
> On Sat, Dec 13, 2008 at 11:37 AM, Joshua D. Drake <jd@commandprompt.com> wrote:
> > On Sat, 2008-12-13 at 07:44 -0800, david@lang.hm wrote:
> >> On Sat, 13 Dec 2008, Robert Haas wrote:

> >
http://h71016.www7.hp.com/ctoBases.asp?oi=E9CED&BEID=19701&SBLID=&ProductLineId=450&FamilyId=2570&LowBaseId=15222&LowPrice=$1,899.00&familyviewgroup=757&viewtype=Matrix
> >
> > Retail cost, 15k, 36GB drives, ~ 12k. A phone call and threat of buying
> > dell gets its for ~ 10k.
>
> I prefer to deal with companies that I don't have to horse trade with
> to get a good deal.  You can threaten dell and get good deals, but if

And what company would that be? There is zero major server manufacturer
that doesn't do the, "Oh you have a competitive bid... let's just lower
that quote for you"

Note: HP can beat Dell, every time on an apples to apples quote. At
least when I have done it.

Joshua D. Drake





>
--
PostgreSQL
   Consulting, Development, Support, Training
   503-667-4564 - http://www.commandprompt.com/
   The PostgreSQL Company, serving since 1997

Re: Need help with 8.4 Performance Testing

From

"Scott Marlowe"

Date:

13 December 2008, 15:57:43

On Sat, Dec 13, 2008 at 12:47 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
> On Sat, 2008-12-13 at 12:45 -0700, Scott Marlowe wrote:
>> On Sat, Dec 13, 2008 at 11:37 AM, Joshua D. Drake <jd@commandprompt.com> wrote:
>> > On Sat, 2008-12-13 at 07:44 -0800, david@lang.hm wrote:
>> >> On Sat, 13 Dec 2008, Robert Haas wrote:
>
>> >
http://h71016.www7.hp.com/ctoBases.asp?oi=E9CED&BEID=19701&SBLID=&ProductLineId=450&FamilyId=2570&LowBaseId=15222&LowPrice=$1,899.00&familyviewgroup=757&viewtype=Matrix
>> >
>> > Retail cost, 15k, 36GB drives, ~ 12k. A phone call and threat of buying
>> > dell gets its for ~ 10k.
>>
>> I prefer to deal with companies that I don't have to horse trade with
>> to get a good deal.  You can threaten dell and get good deals, but if
>
> And what company would that be? There is zero major server manufacturer
> that doesn't do the, "Oh you have a competitive bid... let's just lower
> that quote for you"
>
> Note: HP can beat Dell, every time on an apples to apples quote. At
> least when I have done it.

Aberdeen inc, the one I listed in my previous response.  Their on site
price for a single quad core xeon, 8 gig 800MHZ ram and 24 15k5 SAS
drives is $10,080 or so.  I've never had to tell them I was getting a
better price anywhere else.  They just give me a great quote each
time, they have very fast and efficient customer service, and they
give a 5 year warranty on everything they custom build for you.  I'm a
very satisfied customer.

BTW, they seem to have saved a fair bit of money and passed the
savings along to you and me, by designing their website in one whole
afternoon.  :)  Yeah, it's ugly, but it works well enough to get an
idea what you want and then ask for a quote.

We got our 8 core opteron 32 Gig ram areca 1680i and 16 15k4 Seagates
for $11,000 or so at the end of summer.  When we first plugged them
in, one machine started hanging after 24-48 hours of steady load
testing.  CAlled them, they said it sounded like a RAID card and
they'd ship me a new one next day business.  Let me make it clear, I
didn't tell them I thought it was the RAID card, or that I needed it
the next day, they decided that's what it sounded like, and they
shipped it to me, FOC.  Arrived the next business day.  Which was a
monday.  Between friday and monday I swapped the RAID controllers
between the two machines, and after 36 hours of testing, the problem
followed the RAID controller and the other machine locked up.  Got the
new controller, put it in and ran a week long test, no failures.  We
have since had one drive fail (each machine has two hot spares) and
they cross shipped the replacement and we got it in 3 days.

Few places have made me so happy with a decision to spend $22k on
servers as these guys have.

Re: Need help with 8.4 Performance Testing

From

"Joshua D. Drake"

Date:

13 December 2008, 16:04:00

On Sat, 2008-12-13 at 12:57 -0700, Scott Marlowe wrote:
> On Sat, Dec 13, 2008 at 12:47 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
> > On Sat, 2008-12-13 at 12:45 -0700, Scott Marlowe wrote:
> >> On Sat, Dec 13, 2008 at 11:37 AM, Joshua D. Drake <jd@commandprompt.com> wrote:
> >> > On Sat, 2008-12-13 at 07:44 -0800, david@lang.hm wrote:
> >> >> On Sat, 13 Dec 2008, Robert Haas wrote:

> > Note: HP can beat Dell, every time on an apples to apples quote. At
> > least when I have done it.
>
> Aberdeen inc, the one I listed in my previous response.

Sorry didn't see it.

>  Their on site
> price for a single quad core xeon, 8 gig 800MHZ ram and 24 15k5 SAS
> drives is $10,080 or so.  I've never had to tell them I was getting a
> better price anywhere else.  They just give me a great quote each
> time, they have very fast and efficient customer service, and they
> give a 5 year warranty on everything they custom build for you.  I'm a
> very satisfied customer.

Well that would work for CMD but CMDs customer's would say, "Who the
heck is Aberdeen?"


> Few places have made me so happy with a decision to spend $22k on
> servers as these guys have.
>

Well that is definitely a solid testimony. I don't have any experience
with them but I know that the smaller companies always provide better
service so I have no reason to doubt you. The larger the company gets
the harder it is to get through the muck of bureaucracy.

I have to be honest though, unless the customer explicitly states we
don't have a problem with white box, its going to be HP, DELL or IBM,
occasionally SUN but once they see how ridiculous the initial quotes
from Sun are they generally don't want to deal with them anymore.

Sincerely,

Joshua D. Drake



--
PostgreSQL
   Consulting, Development, Support, Training
   503-667-4564 - http://www.commandprompt.com/
   The PostgreSQL Company, serving since 1997

Re: Need help with 8.4 Performance Testing

From

Angel Alvarez

Date:

13 December 2008, 17:00:49

El Sábado, 13 de Diciembre de 2008 Scott Marlowe escribió:
> On Sat, Dec 13, 2008 at 11:37 AM, Joshua D. Drake <jd@commandprompt.com> wrote:
> > On Sat, 2008-12-13 at 07:44 -0800, david@lang.hm wrote:
> >> On Sat, 13 Dec 2008, Robert Haas wrote:
> >
> >> > This may be a little off-topic, but I'd be interested in hearing more
> >> > details about how you (or others) would do this...  manufacturer,
> >> > model, configuration?  How many hard drives do you need to get 25
> >> > spindles?  And where can you get that many 15K hard drives for under
> >> > $10K?  My lack of experience in this area is showing here, but,
> >> > seriously, any suggestions appreciated.
> >>
> >
> >
http://h71016.www7.hp.com/ctoBases.asp?oi=E9CED&BEID=19701&SBLID=&ProductLineId=450&FamilyId=2570&LowBaseId=15222&LowPrice=$1,899.00&familyviewgroup=757&viewtype=Matrix
> >
> > Retail cost, 15k, 36GB drives, ~ 12k. A phone call and threat of buying
> > dell gets its for ~ 10k.
>
> I prefer to deal with companies that I don't have to horse trade with
> to get a good deal.  You can threaten dell and get good deals, but if
> you're not watching, and let your boss handle an order suddenly you've
> got an $800 Perc 5e on your hands.
>
are Perc 5  bad cards? is that for being sata?

And what about perc 4 ? they are scsi



Thanks in advance


--
No imprima este correo si no es necesario, necesitará el dinero del papel para pagar las charlas del señor Gore.
->>-----------------------------------------------
    Clist UAH a.k.a Angel
---------------------------------[www.uah.es]-<<--
"Los politicos y los pañales han de cambiarse cada cierto tiempo."

Re: Need help with 8.4 Performance Testing

From

"Scott Marlowe"

Date:

13 December 2008, 22:16:28

On Sat, Dec 13, 2008 at 1:03 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
> On Sat, 2008-12-13 at 12:57 -0700, Scott Marlowe wrote:
>> On Sat, Dec 13, 2008 at 12:47 PM, Joshua D. Drake <jd@commandprompt.com> wrote:
>> > On Sat, 2008-12-13 at 12:45 -0700, Scott Marlowe wrote:
>> >> On Sat, Dec 13, 2008 at 11:37 AM, Joshua D. Drake <jd@commandprompt.com> wrote:
>> >> > On Sat, 2008-12-13 at 07:44 -0800, david@lang.hm wrote:
>> >> >> On Sat, 13 Dec 2008, Robert Haas wrote:
>
>> > Note: HP can beat Dell, every time on an apples to apples quote. At
>> > least when I have done it.
>>
>> Aberdeen inc, the one I listed in my previous response.
>
> Sorry didn't see it.
>
>>  Their on site
>> price for a single quad core xeon, 8 gig 800MHZ ram and 24 15k5 SAS
>> drives is $10,080 or so.  I've never had to tell them I was getting a
>> better price anywhere else.  They just give me a great quote each
>> time, they have very fast and efficient customer service, and they
>> give a 5 year warranty on everything they custom build for you.  I'm a
>> very satisfied customer.
>
> Well that would work for CMD but CMDs customer's would say, "Who the
> heck is Aberdeen?"

For me, it's the company I recommend.

Here's the point I make.  If there's any problem, whatsoever, with
aberdeen equipment when I'm putting it in, I'll come fix it and
interface.  If you pick someone else, YOU get to interface and replace
parts, I'm not doing it.  Their time versus my time.  The same is true
with anything else I that I recommend.

>> Few places have made me so happy with a decision to spend $22k on
>> servers as these guys have.
>
> Well that is definitely a solid testimony. I don't have any experience
> with them but I know that the smaller companies always provide better
> service so I have no reason to doubt you. The larger the company gets
> the harder it is to get through the muck of bureaucracy.
>
> I have to be honest though, unless the customer explicitly states we
> don't have a problem with white box, its going to be HP, DELL or IBM,
> occasionally SUN but once they see how ridiculous the initial quotes
> from Sun are they generally don't want to deal with them anymore.

I've made it quite clear to bosses in the past that if they stick me
with dud hardware (i.e. dell 2650s with adaptec RAID controllers, et.
al.) that I will not be handling the long, drudging fight with the
vendor.  I'll just sell it on ebay if it won't work and we'll order
what I said the first time.  One or two times of that happening and
they don't second guess me much anymore.  My current boss is much more
progressive, and while he had his doubts, he too is now sold.

Isn't it amazing how many small businesses won't buy from other small
businesses?  They'd much rather give their money to a company they
don't like because "they'll be around a while" (the big company).

Re: Need help with 8.4 Performance Testing

From

"Joshua D. Drake"

Date:

13 December 2008, 23:09:13

On Sat, 2008-12-13 at 19:16 -0700, Scott Marlowe wrote:

> Isn't it amazing how many small businesses won't buy from other small
> businesses?  They'd much rather give their money to a company they
> don't like because "they'll be around a while" (the big company).
>

True enough!

Joshua D. Drake


--
PostgreSQL
   Consulting, Development, Support, Training
   503-667-4564 - http://www.commandprompt.com/
   The PostgreSQL Company, serving since 1997

Re: Need help with 8.4 Performance Testing

From

Tom Lane

Date:

14 December 2008, 01:49:00

"Scott Marlowe" <scott.marlowe@gmail.com> writes:
> Isn't it amazing how many small businesses won't buy from other small
> businesses?

It's entertaining that Dell is considered one of the "safe choices"
in this thread.  They were a pretty small business not so long ago
(and remain a lot smaller than IBM or HP) ...

            regards, tom lane