Thread: Postgres configuration for 64 CPUs, 128 GB RAM...

Postgres configuration for 64 CPUs, 128 GB RAM...

From
"Marc Mamin"
Date:

Postgres configuration for 64 CPUs, 128 GB RAM...

Hello,

We have the oppotunity to benchmark our application on a large server. I have to prepare the Postgres configuration and I'd appreciate some comments on it as I am not experienced with servers of such a scale. Moreover the configuration should be fail-proof as I won't be able to attend the tests.

Our application (java + perl) and Postgres will run on the same server, whereas the application activity is low when Postgres has large transactions to process.

There is a large gap between our current produtcion server (Linux, 4GB RAM, 4 cpus) and the benchmark server; one of the target of this  benchmark is to verify the scalability of our application.

And you have no reason to be envious as the server doesn't belong us :-)

Thanks for your comments,

Marc Mamin



Posgres version: 8.2.1

Server Specifications:
----------------------

Sun SPARC Enterprise M8000 Server:

http://www.sun.com/servers/highend/m8000/specs.xml

File system:

http://en.wikipedia.org/wiki/ZFS

Planned configuration:
--------------------------------

# we don't expect more than 150 parallel connections,
# but I suspect a leak in our application that let some idle connections open

max_connections=2000

ssl = off

#maximum allowed
shared_buffers= 262143

# on our current best production server with 4GB RAM (not dedicated to Postgres), work_mem is set to 600 MB
# this limitation is probably the bottleneck for our application as the files in pgsql_tmp grows up to 15 GB
# during large aggregations (we have a locking mechanismus to avoid parallel processing of such transactions)
work_mem = 31457280  # (30 GB)

# index creation time is also an issue for us; the process is locking other large processes too.
# our largest table so far is 13 GB + 11 GB indexes
maintenance_work_mem = 31457280  # (30 GB)

# more than the max number of tables +indexes expected during the benchmark
max_fsm_relations = 100000

max_fsm_pages = 1800000

# don't know if I schoud modify this.
# seems to be sufficient on our production servers
max_stack_depth = 2MB

# vacuum will be done per hand between each test session
autovacuum = off

# required to analyse the benchmark
log_min_duration_statement = 1000

max_prepared_transaction = 100

# seems to be required to drop schema/roles containing large number of objects
max_locks_per_transaction = 128


# I use the default for the bgwriter as I couldnt find recommendation on those

#bgwriter_delay = 200ms                 # 10-10000ms between rounds
#bgwriter_lru_percent = 1.0             # 0-100% of LRU buffers scanned/round
#bgwriter_lru_maxpages = 5              # 0-1000 buffers max written/round
#bgwriter_all_percent = 0.333           # 0-100% of all buffers scanned/round
#bgwriter_all_maxpages = 5              # 0-1000 buffers max written/round

#WAL

fsync = on

#use default
#wal_sync_method

# we are using 32 on our production system
wal_buffers=64

# we didn't make any testing with this parameter until now, but this should'nt be a relevant
# point as our performance focus is on large transactions
commit_delay = 0

#CHECKPOINT

# xlog will be  on a separate disk
checkpoint_segments=256

checkpoint_timeout = 5min

Re: Postgres configuration for 64 CPUs, 128 GB RAM...

From
Stefan Kaltenbrunner
Date:
Marc Mamin wrote:
>
> Postgres configuration for 64 CPUs, 128 GB RAM...

there are probably not that much installation out there that large -
comments below

>
> Hello,
>
> We have the oppotunity to benchmark our application on a large server. I
> have to prepare the Postgres configuration and I'd appreciate some
> comments on it as I am not experienced with servers of such a scale.
> Moreover the configuration should be fail-proof as I won't be able to
> attend the tests.
>
> Our application (java + perl) and Postgres will run on the same server,
> whereas the application activity is low when Postgres has large
> transactions to process.
>
> There is a large gap between our current produtcion server (Linux, 4GB
> RAM, 4 cpus) and the benchmark server; one of the target of this
> benchmark is to verify the scalability of our application.
>

[...]
> Posgres version: 8.2.1

upgrade to 8.2.4

> File system:
>
> _http://en.wikipedia.org/wiki/ZFS_

way more important is what kind of disk-IO subsystem you have attached ...

>
>
>
> Planned configuration:
> --------------------------------
>
> # we don't expect more than 150 parallel connections,
> # but I suspect a leak in our application that let some idle connections
> open
>
> max_connections=2000
>
> ssl = off
>
> #maximum allowed
> shared_buffers= 262143

this is probably on the lower side for a 128GB box

>
> # on our current best production server with 4GB RAM (not dedicated to
> Postgres), work_mem is set to 600 MB
> # this limitation is probably the bottleneck for our application as the
> files in pgsql_tmp grows up to 15 GB
> # during large aggregations (we have a locking mechanismus to avoid
> parallel processing of such transactions)
> work_mem = 31457280  # (30 GB)

this is simply ridiculous - work_mem is PER SORT - so if your query
requires 8 sorts it will feel free to use 8x30GB and needs to be
multiplied by the number of concurrent connections.

>
> # index creation time is also an issue for us; the process is locking
> other large processes too.
> # our largest table so far is 13 GB + 11 GB indexes
> maintenance_work_mem = 31457280  # (30 GB)

this is ridiculous too - testing has shown that there is not much point
in going beyond 1GB or so

>
> # more than the max number of tables +indexes expected during the benchmark
> max_fsm_relations = 100000
>
> max_fsm_pages = 1800000

this is probably way to low for a database the size of yours - watch the
oputput of VACUUM VERBOSE on a database wide vacuum for some stats on that.

>
> # don't know if I schoud modify this.
> # seems to be sufficient on our production servers
> max_stack_depth = 2MB
>
> # vacuum will be done per hand between each test session
> autovacuum = off
>
>
>
> # required to analyse the benchmark
> log_min_duration_statement = 1000
>
>
> max_prepared_transaction = 100
>
>
> # seems to be required to drop schema/roles containing large number of
> objects
> max_locks_per_transaction = 128
>
>
>
>
> # I use the default for the bgwriter as I couldnt find recommendation on
> those
>
> #bgwriter_delay = 200ms                 # 10-10000ms between rounds
> #bgwriter_lru_percent = 1.0             # 0-100% of LRU buffers
> scanned/round
> #bgwriter_lru_maxpages = 5              # 0-1000 buffers max written/round
> #bgwriter_all_percent = 0.333           # 0-100% of all buffers
> scanned/round
> #bgwriter_all_maxpages = 5              # 0-1000 buffers max written/round
>
>
> #WAL
>
> fsync = on
>
> #use default
> #wal_sync_method
>
> # we are using 32 on our production system
> wal_buffers=64

values up to 512 or so have been reported to help on systems with very
high concurrency


what is missing here is your settings for:

effective_cache_size

and

random_page_cost



Stefan

Re: Postgres configuration for 64 CPUs, 128 GB RAM...

From
"Jim C. Nasby"
Date:
On Tue, Jul 17, 2007 at 04:10:30PM +0200, Marc Mamin wrote:
> shared_buffers= 262143

You should at least try some runs with this set far, far larger. At
least 10% of memory, but it'd be nice to see what happens with this set
to 50% or higher as well (though don't set it larger than the database
since it'd be a waste).

How big is the database, anyway?

> # on our current best production server with 4GB RAM (not dedicated to
> Postgres), work_mem is set to 600 MB
> # this limitation is probably the bottleneck for our application as the
> files in pgsql_tmp grows up to 15 GB
> # during large aggregations (we have a locking mechanismus to avoid
> parallel processing of such transactions)

Keep in mind that a good filesystem will be caching most of pgsql_tmp if
it can.

> max_prepared_transaction = 100

Are you using 2PC? If not, there's no reason to touch this (could could
just set it to 0).

> # I use the default for the bgwriter as I couldnt find recommendation on
> those
>
> #bgwriter_delay = 200ms                 # 10-10000ms between rounds
> #bgwriter_lru_percent = 1.0             # 0-100% of LRU buffers
> scanned/round
> #bgwriter_lru_maxpages = 5              # 0-1000 buffers max
> written/round
> #bgwriter_all_percent = 0.333           # 0-100% of all buffers
> scanned/round
> #bgwriter_all_maxpages = 5              # 0-1000 buffers max
> written/round

You'll probably want to increase both maxpages parameters substantially,
assuming that you've got good IO hardware.

> #CHECKPOINT
>
> # xlog will be  on a separate disk
> checkpoint_segments=256
>
> checkpoint_timeout = 5min

The further apart your checkpoints, the better. Might want to look at 10
minutes. I'd also set checkpoint_warning to just a bit below
checkpoint_timeout and watch for warnings to make sure you're not
checkpointing a lot more frequently than you're expecting.
--
Jim Nasby                                      decibel@decibel.org
EnterpriseDB      http://enterprisedb.com      512.569.9461 (cell)

Attachment

Re: Postgres configuration for 64 CPUs, 128 GB RAM...

From
Gregory Stark
Date:
"Marc Mamin" <M.Mamin@intershop.de> writes:

> We have the oppotunity to benchmark our application on a large server. I
> have to prepare the Postgres configuration and I'd appreciate some
> comments on it as I am not experienced with servers of such a scale.
> Moreover the configuration should be fail-proof as I won't be able to
> attend the tests.

I really think that's a recipe for disaster. Even on a regular machine you
need to treat tuning as an on-going feedback process. There's no such thing as
a fail-proof configuration since every application is different.

On an exotic machine like this you're going to run into unique problems that
nobody here can anticipate with certainty.

--
  Gregory Stark
  EnterpriseDB          http://www.enterprisedb.com


Re: Postgres configuration for 64 CPUs, 128 GB RAM...

From
Greg Smith
Date:
On Tue, 17 Jul 2007, Marc Mamin wrote:

> Moreover the configuration should be fail-proof as I won't be able to
> attend the tests.

This is unreasonable.  The idea that you'll get a magic perfect
configuration in one shot suggests a fundamental misunderstanding of how
work like this is done.  If there's any way you could adjust things so
that, say, you were allowed to give at least 4 different tuning setups and
you got a report back with each of the results for them, that would let
you design a much better test set.

> Posgres version: 8.2.1

This has already been mentioned, but it really is critical for your type
of test to run 8.2.4 instead so I wanted to emphasize it.  There is a
major scalability bug in 8.2.1.  I'm going to ignore the other things that
other people have already commented on (all the suggestions Stephan and
Jim already made are good ones you should heed) and try to fill in the
remaining gaps instead.

> # I use the default for the bgwriter as I couldnt find recommendation on
> those

The defaults are so small that it will barely do anything on a server of
your size.  Tuning it properly so that it's effective but doesn't waste a
lot of resources is tricky, which is why you haven't found such
recommendations--they're fairly specific to what you're doing and require
some testing to get right.  If you want to see an example from a big
server, look at

http://www.spec.org/jAppServer2004/results/res2007q3/jAppServer2004-20070606-00065.html#DBDatabase_SW_Config0

That's tuned for a very specific benchmark though.  Here's a fairly
generic set of parameters that would be much more aggressive than the
defaults, while not going so far as to waste too many resources if the
writer is just getting in the way on your server:

bgwriter_delay = 200ms
bgwriter_lru_percent = 3.0
bgwriter_lru_maxpages = 500
bgwriter_all_percent = 1.0
bgwriter_all_maxpages = 250

> #WAL
> fsync = on
> #use default
> #wal_sync_method

I'd expect wal_sync_method=open_datasync would outperfom the default, but
you'd really want to test both ways here to be sure.  The fact that the
Sun results I referenced above use the default of fdatasync makes me
hesitate to recommend that change too strongly, as I haven't worked with
this particular piece of hardware. See
http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm for more
information about this parameter.

--
* Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD

Re: Postgres configuration for 64 CPUs, 128 GB RAM...

From
"Strong, David"
Date:
>"Marc Mamin" <M.Mamin@intershop.de> writes:
>
>> We have the oppotunity to benchmark our application on a large
server. I
>> have to prepare the Postgres configuration and I'd appreciate some
>> comments on it as I am not experienced with servers of such a scale.
>> Moreover the configuration should be fail-proof as I won't be able to
>> attend the tests.
>
>I really think that's a recipe for disaster. Even on a regular machine
you
>need to treat tuning as an on-going feedback process. There's no such
thing >as
>a fail-proof configuration since every application is different.
>
>On an exotic machine like this you're going to run into unique problems
>that
>nobody here can anticipate with certainty.
>
>--
>  Gregory Stark
>  EnterpriseDB          http://www.enterprisedb.com
>

Marc,

You're getting a lot of good advice for your project. Let me be another
to reiterate that upgrading to Postgres 8.2.4 will bring added
performance and scalability benefits.

Others have mentioned that you do have to be data driven and
unfortunately that is true. All you can really do is pick a reasonable
starting point and run a test to create a baseline number. Then,
monitor, make small changes and test again. That's the only way you're
going to find the best configuration for your system. This will take
time and effort.

In addition, everything involved in your testing must scale - not just
Postgres. For example, if your driver hardware or driver software does
not scale, you won't be able to generate enough throughput for your
application or Postgres. The same goes for your all of your networking
equipment and any other hardware servers/software that might be involved
in the test environment. So, you really have to monitor at all levels
i.e. don't just focus on the database platform.

I took a quick look at the Sun M8000 server link you provided. I don't
know the system specifically so I might be mistaken, but it looks like
it is configured with 4 sockets per CPU board and 4 CPU boards per
system. Each CPU board looks like it has the ability to take 128GB RAM.
In this case, you will have to keep an eye on how Solaris is binding
(affinitizing) processes to CPU cores and/or boards. Any time a process
is bound to a new CPU board it's likely that there will be a number of
cache invalidations to move data the process was working on from the old
board to the new board. In addition, the moved process may still
continue to refer to memory it allocated on the old board. This can be
quite expensive. Typically, the more CPU cores/CPU boards you have, the
more likely this will happen. I'm no Solaris expert so I don't know if
there is a better way of doing this, but you might consider using the
psrset or pbind commands to bind Postgres backend processes to a
specific CPU core or range of cores. If choosing a range of cores, these
should be on the same CPU board. Again, through monitoring, you'll have
to determine how many CPU cores each backend really needs and then
you'll have to determine how best to spread the backends out over each
of the CPU boards.

Good luck.

David

Re: Postgres configuration for 64 CPUs, 128 GB RAM...

From
PFC
Date:
On Tue, 17 > We have the oppotunity to benchmark our application on a
large server. I
> have to prepare the Postgres configuration and I'd appreciate some
> comments on it as I am not experienced with servers of such a scale.
> Moreover the configuration should be fail-proof as I won't be able to
> attend the tests.
>
> Our application (java + perl) and Postgres will run on the same server,
> whereas the application activity is low when Postgres has large
> transactions to process.

    Please, can you be more specific about your application :

    - what does it do ?
    - what kind of workload does it generate ?
    [ie: many concurrent small queries (website) ; few huge queries,
reporting, warehousing, all of the above, something else ?]
    - percentage and size of update queries ?
    - how many concurrent threads / connections / clients do you serve on a
busy day ?
    (I don't mean online users on a website, but ACTIVE concurrent database
connections)

    I assume you find your current server is too slow or foresee it will
become too slow soon and want to upgrade, so :

    - what makes the current server's performance inadequate ? is it IO, CPU,
RAM, a mix ? which proportions in the mix ?

    This is very important. If you go to the dealer and ask "I need a better
vehicle", he'll sell you a Porsche. But if you say "I need a better vehcle
to carry two tons of cinderblocks" he'll sell you something else I guess.
Same with database servers. You could need some humongous CPU power, but
you might as well not. Depends.

> There is a large gap between our current produtcion server (Linux, 4GB
> RAM, 4 cpus) and the benchmark server; one of the target of this
> benchmark is to verify the scalability of our application.

    Define scalability. (no this isn't a joke, I mean, you know your
application, how would you like it to "scale" ? How do you think it will
scale ? Why ? What did you do so it would scale well ? etc.)




Re: Postgres configuration for 64 CPUs, 128 GB RAM...

From
Josh Berkus
Date:
Marc,

> Server Specifications:
> ----------------------
>
> Sun SPARC Enterprise M8000 Server:
>
> http://www.sun.com/servers/highend/m8000/specs.xml
>
> File system:
>
> http://en.wikipedia.org/wiki/ZFS

There are some specific tuning parameters you need for ZFS or performance
is going to suck.

http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide
(scroll down to "PostgreSQL")
http://www.sun.com/servers/coolthreads/tnb/applications_postgresql.jsp
http://bugs.opensolaris.org/view_bug.do?bug_id=6437054

You also don't say anything about what kind of workload you're running.


--
--Josh

Josh Berkus
PostgreSQL @ Sun
San Francisco

Re: Postgres configuration for 64 CPUs, 128 GB RAM...

From
"Gavin M. Roy"
Date:
Having done something similar recently, I would recommend that you look at adding connection pooling using pgBouncer transaction pooling between your benchmark app and PgSQL.  In our application we have about 2000 clients funneling down to 30 backends and are able to sustain large transaction per second volume.  This has been the #1 key to success for us in running on monster hardware.

Regards,

Gavin

On 7/17/07, Marc Mamin <M.Mamin@intershop.de> wrote:

Postgres configuration for 64 CPUs, 128 GB RAM...

Hello,

We have the oppotunity to benchmark our application on a large server. I have to prepare the Postgres configuration and I'd appreciate some comments on it as I am not experienced with servers of such a scale. Moreover the configuration should be fail-proof as I won't be able to attend the tests.

Our application (java + perl) and Postgres will run on the same server, whereas the application activity is low when Postgres has large transactions to process.

There is a large gap between our current produtcion server (Linux, 4GB RAM, 4 cpus) and the benchmark server; one of the target of this  benchmark is to verify the scalability of our application.

And you have no reason to be envious as the server doesn't belong us :-)

Thanks for your comments,

Marc Mamin



Posgres version: 8.2.1

Server Specifications:
----------------------

Sun SPARC Enterprise M8000 Server:

http://www.sun.com/servers/highend/m8000/specs.xml

File system:

http://en.wikipedia.org/wiki/ZFS

Planned configuration:
--------------------------------

# we don't expect more than 150 parallel connections,
# but I suspect a leak in our application that let some idle connections open

max_connections=2000

ssl = off

#maximum allowed
shared_buffers= 262143

# on our current best production server with 4GB RAM (not dedicated to Postgres), work_mem is set to 600 MB
# this limitation is probably the bottleneck for our application as the files in pgsql_tmp grows up to 15 GB
# during large aggregations (we have a locking mechanismus to avoid parallel processing of such transactions)
work_mem = 31457280  # (30 GB)

# index creation time is also an issue for us; the process is locking other large processes too.
# our largest table so far is 13 GB + 11 GB indexes
maintenance_work_mem = 31457280  # (30 GB)

# more than the max number of tables +indexes expected during the benchmark
max_fsm_relations = 100000

max_fsm_pages = 1800000

# don't know if I schoud modify this.
# seems to be sufficient on our production servers
max_stack_depth = 2MB

# vacuum will be done per hand between each test session
autovacuum = off

# required to analyse the benchmark
log_min_duration_statement = 1000

max_prepared_transaction = 100

# seems to be required to drop schema/roles containing large number of objects
max_locks_per_transaction = 128


# I use the default for the bgwriter as I couldnt find recommendation on those

#bgwriter_delay = 200ms                 # 10-10000ms between rounds
#bgwriter_lru_percent = 1.0             # 0-100% of LRU buffers scanned/round
#bgwriter_lru_maxpages = 5              # 0-1000 buffers max written/round
#bgwriter_all_percent = 0.333           # 0-100% of all buffers scanned/round
#bgwriter_all_maxpages = 5              # 0-1000 buffers max written/round

#WAL

fsync = on

#use default
#wal_sync_method

# we are using 32 on our production system
wal_buffers=64

# we didn't make any testing with this parameter until now, but this should'nt be a relevant
# point as our performance focus is on large transactions
commit_delay = 0

#CHECKPOINT

# xlog will be  on a separate disk
checkpoint_segments=256

checkpoint_timeout = 5min


 

Re: Postgres configuration for 64 CPUs, 128 GB RAM...

From
"Luke Lonergan"
Date:
Josh,

On 7/20/07 4:26 PM, "Josh Berkus" <josh@agliodbs.com> wrote:

> There are some specific tuning parameters you need for ZFS or performance
> is going to suck.
>
> http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide
> (scroll down to "PostgreSQL")
> http://www.sun.com/servers/coolthreads/tnb/applications_postgresql.jsp
> http://bugs.opensolaris.org/view_bug.do?bug_id=6437054
>
> You also don't say anything about what kind of workload you're running.


I think we're assuming that the workload is OLTP when putting these tuning
guidelines forward.  Note that the ZFS tuning guidance referred to in this
bug article recommend "turning vdev prefetching off" for "random I/O
(databases)".  This is exactly the opposite of what we should do for OLAP
workloads.

Also, the lore that setting recordsize on ZFS is mandatory for good database
performance is similarly not appropriate for OLAP work.

If the workload is OLAP / Data Warehousing, I'd suggest ignoring all of the
tuning information from Sun that refers generically to "database".  The
untuned ZFS performance should be far better in those cases.  Specifically,
these three should be ignored:
- (ignore this) limit ARC memory use
- (ignore this) set recordsize to 8K
- (ignore this) turn off vdev prefetch

- Luke



Re: Postgres configuration for 64 CPUs, 128 GB RAM...

From
"Marc Mamin"
Date:
Hello,

thank you for all your comments and recommendations.

I'm aware that the conditions for this benchmark are not ideal, mostly
due to the lack of time to prepare it. We will also need an additional
benchmark on a less powerful - more realistic - server to better
understand the scability of our application.


Our application is based on java and is generating dynamic reports from
log files content. Dynamic means here that a repor will be calculated
from the postgres data the first time it is requested (it will  then be
cached). Java is used to drive the data preparation and to
handle/generate the reports requests.

This is much more an OLAP system then an OLTP, at least for our
performance concern.




Data preparation:

1) parsing the log files with a heavy use of perl (regular expressions)
to generate csv files. Prepared statements also maintain reference
tables in the DB. Postgres performance is not an issue for this first
step.

2) loading the csv files with COPY. As around 70% of the data to load
come in a single daily table, we don't allow concurrent jobs for this
step. We have between a few and a few hundreds files to load into a
single table; they are processed one after the other. A primary key is
always defined; for the case when the required indexes are alreay built
and when the new data are above a given size, we are using a "shadow"
table  instead (without the indexes) , build the index after the import
and then replace the live table with the shadow one.
For example, we a have a table of 13 GB + 11 GB indexes (5 pieces).

Performances :

    a) is there an "ideal" size to consider for our csv files (100 x 10
MB or better 1 x 1GB ?)
    b) maintenance_work_mem: I'll use around 1 GB as recommended by
Stefan

3) Data agggregation. This is the heaviest part for Postgres. On our
current system some queries need above one hour, with phases of around
100% cpu use, alterning with times of heavy i/o load when temporary
results are written/read to the plate (pgsql_tmp). During the
aggregation, other postgres activities are low (at least should be) as
this should take place at night. Currently we have a locking mechanism
to avoid having more than one of such queries running concurently. This
may be to strict for the benchmark server but better reflect our current
hardware capabilities.

Performances : Here we should favorise a single huge transaction and
consider a low probability to have another transaction requiring large
sort space. Considering this, is it reasonable to define work_mem being
3GB (I guess I should raise this parameter dynamically before running
the aggregation queries)

4) Queries (report generation)

We have only few requests which are not satisfying while requiring large
sort operations. The data are structured in different aggregation levels
(minutes, hours, days) with logical time based partitions in oder to
limit the data size to compute for a given report. Moreover we can scale
our infrastrucure while using different or dedicated Postgres servers
for different customers. Smaller customers may share a same instance,
each of them having its own schema (The lock mechanism for large
aggregations apply to a whole Postgres instance, not to a single
customer) . The benchmark will help us to plan such distribution.

During the benchmark, we will probably not have more than 50 not idle
connections simultaneously. It is a bit too early for us to fine tune
this part. The benchmark will mainly focus on the steps 1 to 3

During the benchmark, the Db will reach a size of about 400 GB,
simulating 3 different customers, also with data quite equally splitted
in 3 scheemas.



I will post our configuration(s) later on.



Thanks again for all your valuable input.

Marc Mamin

Re: Postgres configuration for 64 CPUs, 128 GB RAM...

From
Dimitri
Date:
Luke,

ZFS tuning is not coming from general suggestion ideas, but from real
practice...

So,
  - limit ARC is the MUST for the moment to keep your database running
comfortable (specially DWH!)
  - 8K blocksize is chosen to read exactly one page when PG ask to
read one page - don't mix it with prefetch! when prefetch is detected,
ZFS will read next blocks without any demand from PG; but otherwise
why you need to read more  pages each time PG asking only one?...
  - prefetch of course not needed for OLTP, but helps on OLAP/DWH, agree :)

Rgds,
-Dimitri


On 7/22/07, Luke Lonergan <llonergan@greenplum.com> wrote:
> Josh,
>
> On 7/20/07 4:26 PM, "Josh Berkus" <josh@agliodbs.com> wrote:
>
> > There are some specific tuning parameters you need for ZFS or performance
> > is going to suck.
> >
> > http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide
> > (scroll down to "PostgreSQL")
> > http://www.sun.com/servers/coolthreads/tnb/applications_postgresql.jsp
> > http://bugs.opensolaris.org/view_bug.do?bug_id=6437054
> >
> > You also don't say anything about what kind of workload you're running.
>
>
> I think we're assuming that the workload is OLTP when putting these tuning
> guidelines forward.  Note that the ZFS tuning guidance referred to in this
> bug article recommend "turning vdev prefetching off" for "random I/O
> (databases)".  This is exactly the opposite of what we should do for OLAP
> workloads.
>
> Also, the lore that setting recordsize on ZFS is mandatory for good database
> performance is similarly not appropriate for OLAP work.
>
> If the workload is OLAP / Data Warehousing, I'd suggest ignoring all of the
> tuning information from Sun that refers generically to "database".  The
> untuned ZFS performance should be far better in those cases.  Specifically,
> these three should be ignored:
> - (ignore this) limit ARC memory use
> - (ignore this) set recordsize to 8K
> - (ignore this) turn off vdev prefetch
>
> - Luke
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match
>

Re: Postgres configuration for 64 CPUs, 128 GB RAM...

From
"Luke Lonergan"
Date:

Hi Dimitri,

Can you post some experimental evidence that these settings matter?

At this point we have several hundred terabytes of PG databases running on ZFS, all of them setting speed records for data warehouses.

We did testing on these settings last year on S10U2, perhaps things have changed since then.

- Luke

Msg is shrt cuz m on ma treo

 -----Original Message-----
From:   Dimitri [mailto:dimitrik.fr@gmail.com]
Sent:   Monday, July 30, 2007 05:26 PM Eastern Standard Time
To:     Luke Lonergan
Cc:     Josh Berkus; pgsql-performance@postgresql.org; Marc Mamin
Subject:        Re: [PERFORM] Postgres configuration for 64 CPUs, 128 GB RAM...

Luke,

ZFS tuning is not coming from general suggestion ideas, but from real
practice...

So,
  - limit ARC is the MUST for the moment to keep your database running
comfortable (specially DWH!)
  - 8K blocksize is chosen to read exactly one page when PG ask to
read one page - don't mix it with prefetch! when prefetch is detected,
ZFS will read next blocks without any demand from PG; but otherwise
why you need to read more  pages each time PG asking only one?...
  - prefetch of course not needed for OLTP, but helps on OLAP/DWH, agree :)

Rgds,
-Dimitri


On 7/22/07, Luke Lonergan <llonergan@greenplum.com> wrote:
> Josh,
>
> On 7/20/07 4:26 PM, "Josh Berkus" <josh@agliodbs.com> wrote:
>
> > There are some specific tuning parameters you need for ZFS or performance
> > is going to suck.
> >
> > http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide
> > (scroll down to "PostgreSQL")
> > http://www.sun.com/servers/coolthreads/tnb/applications_postgresql.jsp
> > http://bugs.opensolaris.org/view_bug.do?bug_id=6437054
> >
> > You also don't say anything about what kind of workload you're running.
>
>
> I think we're assuming that the workload is OLTP when putting these tuning
> guidelines forward.  Note that the ZFS tuning guidance referred to in this
> bug article recommend "turning vdev prefetching off" for "random I/O
> (databases)".  This is exactly the opposite of what we should do for OLAP
> workloads.
>
> Also, the lore that setting recordsize on ZFS is mandatory for good database
> performance is similarly not appropriate for OLAP work.
>
> If the workload is OLAP / Data Warehousing, I'd suggest ignoring all of the
> tuning information from Sun that refers generically to "database".  The
> untuned ZFS performance should be far better in those cases.  Specifically,
> these three should be ignored:
> - (ignore this) limit ARC memory use
> - (ignore this) set recordsize to 8K
> - (ignore this) turn off vdev prefetch
>
> - Luke
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match
>

Re: Postgres configuration for 64 CPUs, 128 GB RAM...

From
Dimitri
Date:
Hi Luke,

On the same page of Solaris internals wiki you may find links to the
study with db_STRESS benchmark (done on UFS and ZFS with PostgreSQL,
MySQL and Oracle (well, Oracle results are removed, but at least I may
say it entered into the same tuning as PgSQL). Tests were done on
Sol10u3 (as well you may find any other platform details in report
document)...

Also, if block size adjustment  is less or more transparent (don't
read 32K if you need only 8K - with huge data volume you'll simply
waste your cache; in  case you're doing full scan - leave prefetch
algorithm to work for you); probably ARC (cache) limitation need more
light. Well, I even cannot say there is any problem, etc. with it - it
just has too much aggressive implementation :)) If all your running
programs fitting into 1GB of RAM - you may leave ARC size by default
(leaves 1GB free of system RAM). Otherwise, you should limit ARC to
keep your workload execution comfortable: ARC allocating memory very
quickly and every time your program need more RAM - it entering into
concurrency with ARC... In my tests I observed short workload freezes
during such periods and I did not like it too much :)) specially with
high connection numbers :))

well, we may spend hours to discuss :)  (sorry to be short, I have a
very limited mail access for the moment)...

However, ZFS is improving all the time and works better and better
with every Solaris release, so probably all current tuning will be
different or obsolete at the end of this year :))

BTW, forgot to mention, you'll need Solaris 10u4 or at least 10u3 but
with all recent patches applied to run M8000 on full power.

Best regards!
-Dimitri


On 7/30/07, Luke Lonergan <LLonergan@greenplum.com> wrote:
> Hi Dimitri,
>
> Can you post some experimental evidence that these settings matter?
>
> At this point we have several hundred terabytes of PG databases running on
> ZFS, all of them setting speed records for data warehouses.
>
> We did testing on these settings last year on S10U2, perhaps things have
> changed since then.
>
> - Luke
>
> Msg is shrt cuz m on ma treo
>
>  -----Original Message-----
> From:     Dimitri [mailto:dimitrik.fr@gmail.com]
> Sent:    Monday, July 30, 2007 05:26 PM Eastern Standard Time
> To:    Luke Lonergan
> Cc:    Josh Berkus; pgsql-performance@postgresql.org; Marc Mamin
> Subject:    Re: [PERFORM] Postgres configuration for 64 CPUs, 128 GB RAM...
>
> Luke,
>
> ZFS tuning is not coming from general suggestion ideas, but from real
> practice...
>
> So,
>   - limit ARC is the MUST for the moment to keep your database running
> comfortable (specially DWH!)
>   - 8K blocksize is chosen to read exactly one page when PG ask to
> read one page - don't mix it with prefetch! when prefetch is detected,
> ZFS will read next blocks without any demand from PG; but otherwise
> why you need to read more  pages each time PG asking only one?...
>   - prefetch of course not needed for OLTP, but helps on OLAP/DWH, agree :)
>
> Rgds,
> -Dimitri
>
>
> On 7/22/07, Luke Lonergan <llonergan@greenplum.com> wrote:
> > Josh,
> >
> > On 7/20/07 4:26 PM, "Josh Berkus" <josh@agliodbs.com> wrote:
> >
> > > There are some specific tuning parameters you need for ZFS or
> performance
> > > is going to suck.
> > >
> > > http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide
> > > (scroll down to "PostgreSQL")
> > > http://www.sun.com/servers/coolthreads/tnb/applications_postgresql.jsp
> > > http://bugs.opensolaris.org/view_bug.do?bug_id=6437054
> > >
> > > You also don't say anything about what kind of workload you're running.
> >
> >
> > I think we're assuming that the workload is OLTP when putting these tuning
> > guidelines forward.  Note that the ZFS tuning guidance referred to in this
> > bug article recommend "turning vdev prefetching off" for "random I/O
> > (databases)".  This is exactly the opposite of what we should do for OLAP
> > workloads.
> >
> > Also, the lore that setting recordsize on ZFS is mandatory for good
> database
> > performance is similarly not appropriate for OLAP work.
> >
> > If the workload is OLAP / Data Warehousing, I'd suggest ignoring all of
> the
> > tuning information from Sun that refers generically to "database".  The
> > untuned ZFS performance should be far better in those cases.
> Specifically,
> > these three should be ignored:
> > - (ignore this) limit ARC memory use
> > - (ignore this) set recordsize to 8K
> > - (ignore this) turn off vdev prefetch
> >
> > - Luke
> >
> >
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 9: In versions below 8.0, the planner will ignore your desire to
> >        choose an index scan if your joining column's datatypes do not
> >        match
> >
>

Re: Postgres configuration for 64 CPUs, 128 GB RAM...

From
"Luke Lonergan"
Date:
Marc,

You should expect that for the kind of OLAP workload you describe in steps 2
and 3 you will have exactly one CPU working for you in Postgres.

If you want to accelerate the speed of this processing by a factor of 100 or
more on this machine, you should try Greenplum DB which is Postgres 8.2
compatible.  Based on the overall setup you describe, you may have a hybrid
installation with GPDB doing the reporting / OLAP workload and the other
Postgres databases handling the customer workloads.

- Luke


On 7/24/07 7:38 AM, "Marc Mamin" <M.Mamin@intershop.de> wrote:

>
> Hello,
>
> thank you for all your comments and recommendations.
>
> I'm aware that the conditions for this benchmark are not ideal, mostly
> due to the lack of time to prepare it. We will also need an additional
> benchmark on a less powerful - more realistic - server to better
> understand the scability of our application.
>
>
> Our application is based on java and is generating dynamic reports from
> log files content. Dynamic means here that a repor will be calculated
> from the postgres data the first time it is requested (it will  then be
> cached). Java is used to drive the data preparation and to
> handle/generate the reports requests.
>
> This is much more an OLAP system then an OLTP, at least for our
> performance concern.
>
>
>
>
> Data preparation:
>
> 1) parsing the log files with a heavy use of perl (regular expressions)
> to generate csv files. Prepared statements also maintain reference
> tables in the DB. Postgres performance is not an issue for this first
> step.
>
> 2) loading the csv files with COPY. As around 70% of the data to load
> come in a single daily table, we don't allow concurrent jobs for this
> step. We have between a few and a few hundreds files to load into a
> single table; they are processed one after the other. A primary key is
> always defined; for the case when the required indexes are alreay built
> and when the new data are above a given size, we are using a "shadow"
> table  instead (without the indexes) , build the index after the import
> and then replace the live table with the shadow one.
> For example, we a have a table of 13 GB + 11 GB indexes (5 pieces).
>
> Performances :
>
>     a) is there an "ideal" size to consider for our csv files (100 x 10
> MB or better 1 x 1GB ?)
>     b) maintenance_work_mem: I'll use around 1 GB as recommended by
> Stefan
>
> 3) Data agggregation. This is the heaviest part for Postgres. On our
> current system some queries need above one hour, with phases of around
> 100% cpu use, alterning with times of heavy i/o load when temporary
> results are written/read to the plate (pgsql_tmp). During the
> aggregation, other postgres activities are low (at least should be) as
> this should take place at night. Currently we have a locking mechanism
> to avoid having more than one of such queries running concurently. This
> may be to strict for the benchmark server but better reflect our current
> hardware capabilities.
>
> Performances : Here we should favorise a single huge transaction and
> consider a low probability to have another transaction requiring large
> sort space. Considering this, is it reasonable to define work_mem being
> 3GB (I guess I should raise this parameter dynamically before running
> the aggregation queries)
>
> 4) Queries (report generation)
>
> We have only few requests which are not satisfying while requiring large
> sort operations. The data are structured in different aggregation levels
> (minutes, hours, days) with logical time based partitions in oder to
> limit the data size to compute for a given report. Moreover we can scale
> our infrastrucure while using different or dedicated Postgres servers
> for different customers. Smaller customers may share a same instance,
> each of them having its own schema (The lock mechanism for large
> aggregations apply to a whole Postgres instance, not to a single
> customer) . The benchmark will help us to plan such distribution.
>
> During the benchmark, we will probably not have more than 50 not idle
> connections simultaneously. It is a bit too early for us to fine tune
> this part. The benchmark will mainly focus on the steps 1 to 3
>
> During the benchmark, the Db will reach a size of about 400 GB,
> simulating 3 different customers, also with data quite equally splitted
> in 3 scheemas.
>
>
>
> I will post our configuration(s) later on.
>
>
>
> Thanks again for all your valuable input.
>
> Marc Mamin
>
> ---------------------------(end of broadcast)---------------------------
> TIP 5: don't forget to increase your free space map settings