Thread: Postgres configuration for 64 CPUs, 128 GB RAM...
Postgres configuration for 64 CPUs, 128 GB RAM...
Hello,
We have the oppotunity to benchmark our application on a large server. I have to prepare the Postgres configuration and I'd appreciate some comments on it as I am not experienced with servers of such a scale. Moreover the configuration should be fail-proof as I won't be able to attend the tests.
Our application (java + perl) and Postgres will run on the same server, whereas the application activity is low when Postgres has large transactions to process.
There is a large gap between our current produtcion server (Linux, 4GB RAM, 4 cpus) and the benchmark server; one of the target of this benchmark is to verify the scalability of our application.
And you have no reason to be envious as the server doesn't belong us :-)
Thanks for your comments,
Marc Mamin
Posgres version: 8.2.1
Server Specifications:
----------------------
Sun SPARC Enterprise M8000 Server:
http://www.sun.com/servers/highend/m8000/specs.xml
File system:
http://en.wikipedia.org/wiki/ZFS
Planned configuration:
--------------------------------
# we don't expect more than 150 parallel connections,
# but I suspect a leak in our application that let some idle connections open
max_connections=2000
ssl = off
#maximum allowed
shared_buffers= 262143
# on our current best production server with 4GB RAM (not dedicated to Postgres), work_mem is set to 600 MB
# this limitation is probably the bottleneck for our application as the files in pgsql_tmp grows up to 15 GB
# during large aggregations (we have a locking mechanismus to avoid parallel processing of such transactions)
work_mem = 31457280 # (30 GB)
# index creation time is also an issue for us; the process is locking other large processes too.
# our largest table so far is 13 GB + 11 GB indexes
maintenance_work_mem = 31457280 # (30 GB)
# more than the max number of tables +indexes expected during the benchmark
max_fsm_relations = 100000
max_fsm_pages = 1800000
# don't know if I schoud modify this.
# seems to be sufficient on our production servers
max_stack_depth = 2MB
# vacuum will be done per hand between each test session
autovacuum = off
# required to analyse the benchmark
log_min_duration_statement = 1000
max_prepared_transaction = 100
# seems to be required to drop schema/roles containing large number of objects
max_locks_per_transaction = 128
# I use the default for the bgwriter as I couldnt find recommendation on those
#bgwriter_delay = 200ms # 10-10000ms between rounds
#bgwriter_lru_percent = 1.0 # 0-100% of LRU buffers scanned/round
#bgwriter_lru_maxpages = 5 # 0-1000 buffers max written/round
#bgwriter_all_percent = 0.333 # 0-100% of all buffers scanned/round
#bgwriter_all_maxpages = 5 # 0-1000 buffers max written/round
#WAL
fsync = on
#use default
#wal_sync_method
# we are using 32 on our production system
wal_buffers=64
# we didn't make any testing with this parameter until now, but this should'nt be a relevant
# point as our performance focus is on large transactions
commit_delay = 0
#CHECKPOINT
# xlog will be on a separate disk
checkpoint_segments=256
checkpoint_timeout = 5min
Marc Mamin wrote: > > Postgres configuration for 64 CPUs, 128 GB RAM... there are probably not that much installation out there that large - comments below > > Hello, > > We have the oppotunity to benchmark our application on a large server. I > have to prepare the Postgres configuration and I'd appreciate some > comments on it as I am not experienced with servers of such a scale. > Moreover the configuration should be fail-proof as I won't be able to > attend the tests. > > Our application (java + perl) and Postgres will run on the same server, > whereas the application activity is low when Postgres has large > transactions to process. > > There is a large gap between our current produtcion server (Linux, 4GB > RAM, 4 cpus) and the benchmark server; one of the target of this > benchmark is to verify the scalability of our application. > [...] > Posgres version: 8.2.1 upgrade to 8.2.4 > File system: > > _http://en.wikipedia.org/wiki/ZFS_ way more important is what kind of disk-IO subsystem you have attached ... > > > > Planned configuration: > -------------------------------- > > # we don't expect more than 150 parallel connections, > # but I suspect a leak in our application that let some idle connections > open > > max_connections=2000 > > ssl = off > > #maximum allowed > shared_buffers= 262143 this is probably on the lower side for a 128GB box > > # on our current best production server with 4GB RAM (not dedicated to > Postgres), work_mem is set to 600 MB > # this limitation is probably the bottleneck for our application as the > files in pgsql_tmp grows up to 15 GB > # during large aggregations (we have a locking mechanismus to avoid > parallel processing of such transactions) > work_mem = 31457280 # (30 GB) this is simply ridiculous - work_mem is PER SORT - so if your query requires 8 sorts it will feel free to use 8x30GB and needs to be multiplied by the number of concurrent connections. > > # index creation time is also an issue for us; the process is locking > other large processes too. > # our largest table so far is 13 GB + 11 GB indexes > maintenance_work_mem = 31457280 # (30 GB) this is ridiculous too - testing has shown that there is not much point in going beyond 1GB or so > > # more than the max number of tables +indexes expected during the benchmark > max_fsm_relations = 100000 > > max_fsm_pages = 1800000 this is probably way to low for a database the size of yours - watch the oputput of VACUUM VERBOSE on a database wide vacuum for some stats on that. > > # don't know if I schoud modify this. > # seems to be sufficient on our production servers > max_stack_depth = 2MB > > # vacuum will be done per hand between each test session > autovacuum = off > > > > # required to analyse the benchmark > log_min_duration_statement = 1000 > > > max_prepared_transaction = 100 > > > # seems to be required to drop schema/roles containing large number of > objects > max_locks_per_transaction = 128 > > > > > # I use the default for the bgwriter as I couldnt find recommendation on > those > > #bgwriter_delay = 200ms # 10-10000ms between rounds > #bgwriter_lru_percent = 1.0 # 0-100% of LRU buffers > scanned/round > #bgwriter_lru_maxpages = 5 # 0-1000 buffers max written/round > #bgwriter_all_percent = 0.333 # 0-100% of all buffers > scanned/round > #bgwriter_all_maxpages = 5 # 0-1000 buffers max written/round > > > #WAL > > fsync = on > > #use default > #wal_sync_method > > # we are using 32 on our production system > wal_buffers=64 values up to 512 or so have been reported to help on systems with very high concurrency what is missing here is your settings for: effective_cache_size and random_page_cost Stefan
On Tue, Jul 17, 2007 at 04:10:30PM +0200, Marc Mamin wrote: > shared_buffers= 262143 You should at least try some runs with this set far, far larger. At least 10% of memory, but it'd be nice to see what happens with this set to 50% or higher as well (though don't set it larger than the database since it'd be a waste). How big is the database, anyway? > # on our current best production server with 4GB RAM (not dedicated to > Postgres), work_mem is set to 600 MB > # this limitation is probably the bottleneck for our application as the > files in pgsql_tmp grows up to 15 GB > # during large aggregations (we have a locking mechanismus to avoid > parallel processing of such transactions) Keep in mind that a good filesystem will be caching most of pgsql_tmp if it can. > max_prepared_transaction = 100 Are you using 2PC? If not, there's no reason to touch this (could could just set it to 0). > # I use the default for the bgwriter as I couldnt find recommendation on > those > > #bgwriter_delay = 200ms # 10-10000ms between rounds > #bgwriter_lru_percent = 1.0 # 0-100% of LRU buffers > scanned/round > #bgwriter_lru_maxpages = 5 # 0-1000 buffers max > written/round > #bgwriter_all_percent = 0.333 # 0-100% of all buffers > scanned/round > #bgwriter_all_maxpages = 5 # 0-1000 buffers max > written/round You'll probably want to increase both maxpages parameters substantially, assuming that you've got good IO hardware. > #CHECKPOINT > > # xlog will be on a separate disk > checkpoint_segments=256 > > checkpoint_timeout = 5min The further apart your checkpoints, the better. Might want to look at 10 minutes. I'd also set checkpoint_warning to just a bit below checkpoint_timeout and watch for warnings to make sure you're not checkpointing a lot more frequently than you're expecting. -- Jim Nasby decibel@decibel.org EnterpriseDB http://enterprisedb.com 512.569.9461 (cell)
Attachment
"Marc Mamin" <M.Mamin@intershop.de> writes: > We have the oppotunity to benchmark our application on a large server. I > have to prepare the Postgres configuration and I'd appreciate some > comments on it as I am not experienced with servers of such a scale. > Moreover the configuration should be fail-proof as I won't be able to > attend the tests. I really think that's a recipe for disaster. Even on a regular machine you need to treat tuning as an on-going feedback process. There's no such thing as a fail-proof configuration since every application is different. On an exotic machine like this you're going to run into unique problems that nobody here can anticipate with certainty. -- Gregory Stark EnterpriseDB http://www.enterprisedb.com
On Tue, 17 Jul 2007, Marc Mamin wrote: > Moreover the configuration should be fail-proof as I won't be able to > attend the tests. This is unreasonable. The idea that you'll get a magic perfect configuration in one shot suggests a fundamental misunderstanding of how work like this is done. If there's any way you could adjust things so that, say, you were allowed to give at least 4 different tuning setups and you got a report back with each of the results for them, that would let you design a much better test set. > Posgres version: 8.2.1 This has already been mentioned, but it really is critical for your type of test to run 8.2.4 instead so I wanted to emphasize it. There is a major scalability bug in 8.2.1. I'm going to ignore the other things that other people have already commented on (all the suggestions Stephan and Jim already made are good ones you should heed) and try to fill in the remaining gaps instead. > # I use the default for the bgwriter as I couldnt find recommendation on > those The defaults are so small that it will barely do anything on a server of your size. Tuning it properly so that it's effective but doesn't waste a lot of resources is tricky, which is why you haven't found such recommendations--they're fairly specific to what you're doing and require some testing to get right. If you want to see an example from a big server, look at http://www.spec.org/jAppServer2004/results/res2007q3/jAppServer2004-20070606-00065.html#DBDatabase_SW_Config0 That's tuned for a very specific benchmark though. Here's a fairly generic set of parameters that would be much more aggressive than the defaults, while not going so far as to waste too many resources if the writer is just getting in the way on your server: bgwriter_delay = 200ms bgwriter_lru_percent = 3.0 bgwriter_lru_maxpages = 500 bgwriter_all_percent = 1.0 bgwriter_all_maxpages = 250 > #WAL > fsync = on > #use default > #wal_sync_method I'd expect wal_sync_method=open_datasync would outperfom the default, but you'd really want to test both ways here to be sure. The fact that the Sun results I referenced above use the default of fdatasync makes me hesitate to recommend that change too strongly, as I haven't worked with this particular piece of hardware. See http://www.westnet.com/~gsmith/content/postgresql/TuningPGWAL.htm for more information about this parameter. -- * Greg Smith gsmith@gregsmith.com http://www.gregsmith.com Baltimore, MD
>"Marc Mamin" <M.Mamin@intershop.de> writes: > >> We have the oppotunity to benchmark our application on a large server. I >> have to prepare the Postgres configuration and I'd appreciate some >> comments on it as I am not experienced with servers of such a scale. >> Moreover the configuration should be fail-proof as I won't be able to >> attend the tests. > >I really think that's a recipe for disaster. Even on a regular machine you >need to treat tuning as an on-going feedback process. There's no such thing >as >a fail-proof configuration since every application is different. > >On an exotic machine like this you're going to run into unique problems >that >nobody here can anticipate with certainty. > >-- > Gregory Stark > EnterpriseDB http://www.enterprisedb.com > Marc, You're getting a lot of good advice for your project. Let me be another to reiterate that upgrading to Postgres 8.2.4 will bring added performance and scalability benefits. Others have mentioned that you do have to be data driven and unfortunately that is true. All you can really do is pick a reasonable starting point and run a test to create a baseline number. Then, monitor, make small changes and test again. That's the only way you're going to find the best configuration for your system. This will take time and effort. In addition, everything involved in your testing must scale - not just Postgres. For example, if your driver hardware or driver software does not scale, you won't be able to generate enough throughput for your application or Postgres. The same goes for your all of your networking equipment and any other hardware servers/software that might be involved in the test environment. So, you really have to monitor at all levels i.e. don't just focus on the database platform. I took a quick look at the Sun M8000 server link you provided. I don't know the system specifically so I might be mistaken, but it looks like it is configured with 4 sockets per CPU board and 4 CPU boards per system. Each CPU board looks like it has the ability to take 128GB RAM. In this case, you will have to keep an eye on how Solaris is binding (affinitizing) processes to CPU cores and/or boards. Any time a process is bound to a new CPU board it's likely that there will be a number of cache invalidations to move data the process was working on from the old board to the new board. In addition, the moved process may still continue to refer to memory it allocated on the old board. This can be quite expensive. Typically, the more CPU cores/CPU boards you have, the more likely this will happen. I'm no Solaris expert so I don't know if there is a better way of doing this, but you might consider using the psrset or pbind commands to bind Postgres backend processes to a specific CPU core or range of cores. If choosing a range of cores, these should be on the same CPU board. Again, through monitoring, you'll have to determine how many CPU cores each backend really needs and then you'll have to determine how best to spread the backends out over each of the CPU boards. Good luck. David
On Tue, 17 > We have the oppotunity to benchmark our application on a large server. I > have to prepare the Postgres configuration and I'd appreciate some > comments on it as I am not experienced with servers of such a scale. > Moreover the configuration should be fail-proof as I won't be able to > attend the tests. > > Our application (java + perl) and Postgres will run on the same server, > whereas the application activity is low when Postgres has large > transactions to process. Please, can you be more specific about your application : - what does it do ? - what kind of workload does it generate ? [ie: many concurrent small queries (website) ; few huge queries, reporting, warehousing, all of the above, something else ?] - percentage and size of update queries ? - how many concurrent threads / connections / clients do you serve on a busy day ? (I don't mean online users on a website, but ACTIVE concurrent database connections) I assume you find your current server is too slow or foresee it will become too slow soon and want to upgrade, so : - what makes the current server's performance inadequate ? is it IO, CPU, RAM, a mix ? which proportions in the mix ? This is very important. If you go to the dealer and ask "I need a better vehicle", he'll sell you a Porsche. But if you say "I need a better vehcle to carry two tons of cinderblocks" he'll sell you something else I guess. Same with database servers. You could need some humongous CPU power, but you might as well not. Depends. > There is a large gap between our current produtcion server (Linux, 4GB > RAM, 4 cpus) and the benchmark server; one of the target of this > benchmark is to verify the scalability of our application. Define scalability. (no this isn't a joke, I mean, you know your application, how would you like it to "scale" ? How do you think it will scale ? Why ? What did you do so it would scale well ? etc.)
Marc, > Server Specifications: > ---------------------- > > Sun SPARC Enterprise M8000 Server: > > http://www.sun.com/servers/highend/m8000/specs.xml > > File system: > > http://en.wikipedia.org/wiki/ZFS There are some specific tuning parameters you need for ZFS or performance is going to suck. http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide (scroll down to "PostgreSQL") http://www.sun.com/servers/coolthreads/tnb/applications_postgresql.jsp http://bugs.opensolaris.org/view_bug.do?bug_id=6437054 You also don't say anything about what kind of workload you're running. -- --Josh Josh Berkus PostgreSQL @ Sun San Francisco
Postgres configuration for 64 CPUs, 128 GB RAM...
Hello,
We have the oppotunity to benchmark our application on a large server. I have to prepare the Postgres configuration and I'd appreciate some comments on it as I am not experienced with servers of such a scale. Moreover the configuration should be fail-proof as I won't be able to attend the tests.
Our application (java + perl) and Postgres will run on the same server, whereas the application activity is low when Postgres has large transactions to process.
There is a large gap between our current produtcion server (Linux, 4GB RAM, 4 cpus) and the benchmark server; one of the target of this benchmark is to verify the scalability of our application.
And you have no reason to be envious as the server doesn't belong us :-)
Thanks for your comments,
Marc Mamin
Posgres version: 8.2.1
Server Specifications:
----------------------Sun SPARC Enterprise M8000 Server:
http://www.sun.com/servers/highend/m8000/specs.xml
File system:
http://en.wikipedia.org/wiki/ZFS
Planned configuration:
--------------------------------# we don't expect more than 150 parallel connections,
# but I suspect a leak in our application that let some idle connections openmax_connections=2000
ssl = off
#maximum allowed
shared_buffers= 262143# on our current best production server with 4GB RAM (not dedicated to Postgres), work_mem is set to 600 MB
# this limitation is probably the bottleneck for our application as the files in pgsql_tmp grows up to 15 GB
# during large aggregations (we have a locking mechanismus to avoid parallel processing of such transactions)
work_mem = 31457280 # (30 GB)# index creation time is also an issue for us; the process is locking other large processes too.
# our largest table so far is 13 GB + 11 GB indexes
maintenance_work_mem = 31457280 # (30 GB)# more than the max number of tables +indexes expected during the benchmark
max_fsm_relations = 100000max_fsm_pages = 1800000
# don't know if I schoud modify this.
# seems to be sufficient on our production servers
max_stack_depth = 2MB# vacuum will be done per hand between each test session
autovacuum = off# required to analyse the benchmark
log_min_duration_statement = 1000max_prepared_transaction = 100
# seems to be required to drop schema/roles containing large number of objects
max_locks_per_transaction = 128# I use the default for the bgwriter as I couldnt find recommendation on those
#bgwriter_delay = 200ms # 10-10000ms between rounds
#bgwriter_lru_percent = 1.0 # 0-100% of LRU buffers scanned/round
#bgwriter_lru_maxpages = 5 # 0-1000 buffers max written/round
#bgwriter_all_percent = 0.333 # 0-100% of all buffers scanned/round
#bgwriter_all_maxpages = 5 # 0-1000 buffers max written/round#WAL
fsync = on
#use default
#wal_sync_method# we are using 32 on our production system
wal_buffers=64# we didn't make any testing with this parameter until now, but this should'nt be a relevant
# point as our performance focus is on large transactions
commit_delay = 0#CHECKPOINT
# xlog will be on a separate disk
checkpoint_segments=256checkpoint_timeout = 5min
Josh, On 7/20/07 4:26 PM, "Josh Berkus" <josh@agliodbs.com> wrote: > There are some specific tuning parameters you need for ZFS or performance > is going to suck. > > http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide > (scroll down to "PostgreSQL") > http://www.sun.com/servers/coolthreads/tnb/applications_postgresql.jsp > http://bugs.opensolaris.org/view_bug.do?bug_id=6437054 > > You also don't say anything about what kind of workload you're running. I think we're assuming that the workload is OLTP when putting these tuning guidelines forward. Note that the ZFS tuning guidance referred to in this bug article recommend "turning vdev prefetching off" for "random I/O (databases)". This is exactly the opposite of what we should do for OLAP workloads. Also, the lore that setting recordsize on ZFS is mandatory for good database performance is similarly not appropriate for OLAP work. If the workload is OLAP / Data Warehousing, I'd suggest ignoring all of the tuning information from Sun that refers generically to "database". The untuned ZFS performance should be far better in those cases. Specifically, these three should be ignored: - (ignore this) limit ARC memory use - (ignore this) set recordsize to 8K - (ignore this) turn off vdev prefetch - Luke
Hello, thank you for all your comments and recommendations. I'm aware that the conditions for this benchmark are not ideal, mostly due to the lack of time to prepare it. We will also need an additional benchmark on a less powerful - more realistic - server to better understand the scability of our application. Our application is based on java and is generating dynamic reports from log files content. Dynamic means here that a repor will be calculated from the postgres data the first time it is requested (it will then be cached). Java is used to drive the data preparation and to handle/generate the reports requests. This is much more an OLAP system then an OLTP, at least for our performance concern. Data preparation: 1) parsing the log files with a heavy use of perl (regular expressions) to generate csv files. Prepared statements also maintain reference tables in the DB. Postgres performance is not an issue for this first step. 2) loading the csv files with COPY. As around 70% of the data to load come in a single daily table, we don't allow concurrent jobs for this step. We have between a few and a few hundreds files to load into a single table; they are processed one after the other. A primary key is always defined; for the case when the required indexes are alreay built and when the new data are above a given size, we are using a "shadow" table instead (without the indexes) , build the index after the import and then replace the live table with the shadow one. For example, we a have a table of 13 GB + 11 GB indexes (5 pieces). Performances : a) is there an "ideal" size to consider for our csv files (100 x 10 MB or better 1 x 1GB ?) b) maintenance_work_mem: I'll use around 1 GB as recommended by Stefan 3) Data agggregation. This is the heaviest part for Postgres. On our current system some queries need above one hour, with phases of around 100% cpu use, alterning with times of heavy i/o load when temporary results are written/read to the plate (pgsql_tmp). During the aggregation, other postgres activities are low (at least should be) as this should take place at night. Currently we have a locking mechanism to avoid having more than one of such queries running concurently. This may be to strict for the benchmark server but better reflect our current hardware capabilities. Performances : Here we should favorise a single huge transaction and consider a low probability to have another transaction requiring large sort space. Considering this, is it reasonable to define work_mem being 3GB (I guess I should raise this parameter dynamically before running the aggregation queries) 4) Queries (report generation) We have only few requests which are not satisfying while requiring large sort operations. The data are structured in different aggregation levels (minutes, hours, days) with logical time based partitions in oder to limit the data size to compute for a given report. Moreover we can scale our infrastrucure while using different or dedicated Postgres servers for different customers. Smaller customers may share a same instance, each of them having its own schema (The lock mechanism for large aggregations apply to a whole Postgres instance, not to a single customer) . The benchmark will help us to plan such distribution. During the benchmark, we will probably not have more than 50 not idle connections simultaneously. It is a bit too early for us to fine tune this part. The benchmark will mainly focus on the steps 1 to 3 During the benchmark, the Db will reach a size of about 400 GB, simulating 3 different customers, also with data quite equally splitted in 3 scheemas. I will post our configuration(s) later on. Thanks again for all your valuable input. Marc Mamin
Luke, ZFS tuning is not coming from general suggestion ideas, but from real practice... So, - limit ARC is the MUST for the moment to keep your database running comfortable (specially DWH!) - 8K blocksize is chosen to read exactly one page when PG ask to read one page - don't mix it with prefetch! when prefetch is detected, ZFS will read next blocks without any demand from PG; but otherwise why you need to read more pages each time PG asking only one?... - prefetch of course not needed for OLTP, but helps on OLAP/DWH, agree :) Rgds, -Dimitri On 7/22/07, Luke Lonergan <llonergan@greenplum.com> wrote: > Josh, > > On 7/20/07 4:26 PM, "Josh Berkus" <josh@agliodbs.com> wrote: > > > There are some specific tuning parameters you need for ZFS or performance > > is going to suck. > > > > http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide > > (scroll down to "PostgreSQL") > > http://www.sun.com/servers/coolthreads/tnb/applications_postgresql.jsp > > http://bugs.opensolaris.org/view_bug.do?bug_id=6437054 > > > > You also don't say anything about what kind of workload you're running. > > > I think we're assuming that the workload is OLTP when putting these tuning > guidelines forward. Note that the ZFS tuning guidance referred to in this > bug article recommend "turning vdev prefetching off" for "random I/O > (databases)". This is exactly the opposite of what we should do for OLAP > workloads. > > Also, the lore that setting recordsize on ZFS is mandatory for good database > performance is similarly not appropriate for OLAP work. > > If the workload is OLAP / Data Warehousing, I'd suggest ignoring all of the > tuning information from Sun that refers generically to "database". The > untuned ZFS performance should be far better in those cases. Specifically, > these three should be ignored: > - (ignore this) limit ARC memory use > - (ignore this) set recordsize to 8K > - (ignore this) turn off vdev prefetch > > - Luke > > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match >
Hi Dimitri,
Can you post some experimental evidence that these settings matter?
At this point we have several hundred terabytes of PG databases running on ZFS, all of them setting speed records for data warehouses.
We did testing on these settings last year on S10U2, perhaps things have changed since then.
- Luke
Msg is shrt cuz m on ma treo
-----Original Message-----
From: Dimitri [mailto:dimitrik.fr@gmail.com]
Sent: Monday, July 30, 2007 05:26 PM Eastern Standard Time
To: Luke Lonergan
Cc: Josh Berkus; pgsql-performance@postgresql.org; Marc Mamin
Subject: Re: [PERFORM] Postgres configuration for 64 CPUs, 128 GB RAM...
Luke,
ZFS tuning is not coming from general suggestion ideas, but from real
practice...
So,
- limit ARC is the MUST for the moment to keep your database running
comfortable (specially DWH!)
- 8K blocksize is chosen to read exactly one page when PG ask to
read one page - don't mix it with prefetch! when prefetch is detected,
ZFS will read next blocks without any demand from PG; but otherwise
why you need to read more pages each time PG asking only one?...
- prefetch of course not needed for OLTP, but helps on OLAP/DWH, agree :)
Rgds,
-Dimitri
On 7/22/07, Luke Lonergan <llonergan@greenplum.com> wrote:
> Josh,
>
> On 7/20/07 4:26 PM, "Josh Berkus" <josh@agliodbs.com> wrote:
>
> > There are some specific tuning parameters you need for ZFS or performance
> > is going to suck.
> >
> > http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide
> > (scroll down to "PostgreSQL")
> > http://www.sun.com/servers/coolthreads/tnb/applications_postgresql.jsp
> > http://bugs.opensolaris.org/view_bug.do?bug_id=6437054
> >
> > You also don't say anything about what kind of workload you're running.
>
>
> I think we're assuming that the workload is OLTP when putting these tuning
> guidelines forward. Note that the ZFS tuning guidance referred to in this
> bug article recommend "turning vdev prefetching off" for "random I/O
> (databases)". This is exactly the opposite of what we should do for OLAP
> workloads.
>
> Also, the lore that setting recordsize on ZFS is mandatory for good database
> performance is similarly not appropriate for OLAP work.
>
> If the workload is OLAP / Data Warehousing, I'd suggest ignoring all of the
> tuning information from Sun that refers generically to "database". The
> untuned ZFS performance should be far better in those cases. Specifically,
> these three should be ignored:
> - (ignore this) limit ARC memory use
> - (ignore this) set recordsize to 8K
> - (ignore this) turn off vdev prefetch
>
> - Luke
>
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
> choose an index scan if your joining column's datatypes do not
> match
>
Hi Luke, On the same page of Solaris internals wiki you may find links to the study with db_STRESS benchmark (done on UFS and ZFS with PostgreSQL, MySQL and Oracle (well, Oracle results are removed, but at least I may say it entered into the same tuning as PgSQL). Tests were done on Sol10u3 (as well you may find any other platform details in report document)... Also, if block size adjustment is less or more transparent (don't read 32K if you need only 8K - with huge data volume you'll simply waste your cache; in case you're doing full scan - leave prefetch algorithm to work for you); probably ARC (cache) limitation need more light. Well, I even cannot say there is any problem, etc. with it - it just has too much aggressive implementation :)) If all your running programs fitting into 1GB of RAM - you may leave ARC size by default (leaves 1GB free of system RAM). Otherwise, you should limit ARC to keep your workload execution comfortable: ARC allocating memory very quickly and every time your program need more RAM - it entering into concurrency with ARC... In my tests I observed short workload freezes during such periods and I did not like it too much :)) specially with high connection numbers :)) well, we may spend hours to discuss :) (sorry to be short, I have a very limited mail access for the moment)... However, ZFS is improving all the time and works better and better with every Solaris release, so probably all current tuning will be different or obsolete at the end of this year :)) BTW, forgot to mention, you'll need Solaris 10u4 or at least 10u3 but with all recent patches applied to run M8000 on full power. Best regards! -Dimitri On 7/30/07, Luke Lonergan <LLonergan@greenplum.com> wrote: > Hi Dimitri, > > Can you post some experimental evidence that these settings matter? > > At this point we have several hundred terabytes of PG databases running on > ZFS, all of them setting speed records for data warehouses. > > We did testing on these settings last year on S10U2, perhaps things have > changed since then. > > - Luke > > Msg is shrt cuz m on ma treo > > -----Original Message----- > From: Dimitri [mailto:dimitrik.fr@gmail.com] > Sent: Monday, July 30, 2007 05:26 PM Eastern Standard Time > To: Luke Lonergan > Cc: Josh Berkus; pgsql-performance@postgresql.org; Marc Mamin > Subject: Re: [PERFORM] Postgres configuration for 64 CPUs, 128 GB RAM... > > Luke, > > ZFS tuning is not coming from general suggestion ideas, but from real > practice... > > So, > - limit ARC is the MUST for the moment to keep your database running > comfortable (specially DWH!) > - 8K blocksize is chosen to read exactly one page when PG ask to > read one page - don't mix it with prefetch! when prefetch is detected, > ZFS will read next blocks without any demand from PG; but otherwise > why you need to read more pages each time PG asking only one?... > - prefetch of course not needed for OLTP, but helps on OLAP/DWH, agree :) > > Rgds, > -Dimitri > > > On 7/22/07, Luke Lonergan <llonergan@greenplum.com> wrote: > > Josh, > > > > On 7/20/07 4:26 PM, "Josh Berkus" <josh@agliodbs.com> wrote: > > > > > There are some specific tuning parameters you need for ZFS or > performance > > > is going to suck. > > > > > > http://www.solarisinternals.com/wiki/index.php/ZFS_Best_Practices_Guide > > > (scroll down to "PostgreSQL") > > > http://www.sun.com/servers/coolthreads/tnb/applications_postgresql.jsp > > > http://bugs.opensolaris.org/view_bug.do?bug_id=6437054 > > > > > > You also don't say anything about what kind of workload you're running. > > > > > > I think we're assuming that the workload is OLTP when putting these tuning > > guidelines forward. Note that the ZFS tuning guidance referred to in this > > bug article recommend "turning vdev prefetching off" for "random I/O > > (databases)". This is exactly the opposite of what we should do for OLAP > > workloads. > > > > Also, the lore that setting recordsize on ZFS is mandatory for good > database > > performance is similarly not appropriate for OLAP work. > > > > If the workload is OLAP / Data Warehousing, I'd suggest ignoring all of > the > > tuning information from Sun that refers generically to "database". The > > untuned ZFS performance should be far better in those cases. > Specifically, > > these three should be ignored: > > - (ignore this) limit ARC memory use > > - (ignore this) set recordsize to 8K > > - (ignore this) turn off vdev prefetch > > > > - Luke > > > > > > > > ---------------------------(end of broadcast)--------------------------- > > TIP 9: In versions below 8.0, the planner will ignore your desire to > > choose an index scan if your joining column's datatypes do not > > match > > >
Marc, You should expect that for the kind of OLAP workload you describe in steps 2 and 3 you will have exactly one CPU working for you in Postgres. If you want to accelerate the speed of this processing by a factor of 100 or more on this machine, you should try Greenplum DB which is Postgres 8.2 compatible. Based on the overall setup you describe, you may have a hybrid installation with GPDB doing the reporting / OLAP workload and the other Postgres databases handling the customer workloads. - Luke On 7/24/07 7:38 AM, "Marc Mamin" <M.Mamin@intershop.de> wrote: > > Hello, > > thank you for all your comments and recommendations. > > I'm aware that the conditions for this benchmark are not ideal, mostly > due to the lack of time to prepare it. We will also need an additional > benchmark on a less powerful - more realistic - server to better > understand the scability of our application. > > > Our application is based on java and is generating dynamic reports from > log files content. Dynamic means here that a repor will be calculated > from the postgres data the first time it is requested (it will then be > cached). Java is used to drive the data preparation and to > handle/generate the reports requests. > > This is much more an OLAP system then an OLTP, at least for our > performance concern. > > > > > Data preparation: > > 1) parsing the log files with a heavy use of perl (regular expressions) > to generate csv files. Prepared statements also maintain reference > tables in the DB. Postgres performance is not an issue for this first > step. > > 2) loading the csv files with COPY. As around 70% of the data to load > come in a single daily table, we don't allow concurrent jobs for this > step. We have between a few and a few hundreds files to load into a > single table; they are processed one after the other. A primary key is > always defined; for the case when the required indexes are alreay built > and when the new data are above a given size, we are using a "shadow" > table instead (without the indexes) , build the index after the import > and then replace the live table with the shadow one. > For example, we a have a table of 13 GB + 11 GB indexes (5 pieces). > > Performances : > > a) is there an "ideal" size to consider for our csv files (100 x 10 > MB or better 1 x 1GB ?) > b) maintenance_work_mem: I'll use around 1 GB as recommended by > Stefan > > 3) Data agggregation. This is the heaviest part for Postgres. On our > current system some queries need above one hour, with phases of around > 100% cpu use, alterning with times of heavy i/o load when temporary > results are written/read to the plate (pgsql_tmp). During the > aggregation, other postgres activities are low (at least should be) as > this should take place at night. Currently we have a locking mechanism > to avoid having more than one of such queries running concurently. This > may be to strict for the benchmark server but better reflect our current > hardware capabilities. > > Performances : Here we should favorise a single huge transaction and > consider a low probability to have another transaction requiring large > sort space. Considering this, is it reasonable to define work_mem being > 3GB (I guess I should raise this parameter dynamically before running > the aggregation queries) > > 4) Queries (report generation) > > We have only few requests which are not satisfying while requiring large > sort operations. The data are structured in different aggregation levels > (minutes, hours, days) with logical time based partitions in oder to > limit the data size to compute for a given report. Moreover we can scale > our infrastrucure while using different or dedicated Postgres servers > for different customers. Smaller customers may share a same instance, > each of them having its own schema (The lock mechanism for large > aggregations apply to a whole Postgres instance, not to a single > customer) . The benchmark will help us to plan such distribution. > > During the benchmark, we will probably not have more than 50 not idle > connections simultaneously. It is a bit too early for us to fine tune > this part. The benchmark will mainly focus on the steps 1 to 3 > > During the benchmark, the Db will reach a size of about 400 GB, > simulating 3 different customers, also with data quite equally splitted > in 3 scheemas. > > > > I will post our configuration(s) later on. > > > > Thanks again for all your valuable input. > > Marc Mamin > > ---------------------------(end of broadcast)--------------------------- > TIP 5: don't forget to increase your free space map settings