Thread: Full statement logging problematic on larger machines?
Greetings. We're having trouble with full logging since we moved from an 8-core server with 16 GB memory to a machine with double that spec and I am wondering if this *should* be working or if there is a point on larger machines where logging and scheduling seeks of background writes - or something along those lines; it might be a theory - doesn't work together any more? The box in question is a Dell PowerEdge R900 with 16 cores and 64 GB of RAM (16 GB of shared buffers allocated), and a not-so-great root@db04:~# lspci|grep RAID 19:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS 1078 (rev 04) controller with 8 10k rpm disks in RAID 1+0 (one big filesystem), running Ubuntu Hardy with kernel version root@db04:~# uname -a Linux db04 2.6.24-22-server #1 SMP Mon Nov 24 20:06:28 UTC 2008 x86_64 GNU/Linux Logging to the disk array actually stops working much earlier; at off-peak time we have around 3k transactions per second and if we set log_statement = all, the server gets bogged down immediately: Load, context switches, and above all mean query duration shoot up; the application slows to a crawl and becomes unusable. So the idea came up to log to /dev/shm which is a default ram disk on Linux with half the available memory as a maximum size. This works much better but once we are at about 80% of peak load - which is around 8000 transactions per second currently - the server goes into a tailspin in the manner described above and we have to switch off full logging. This is a problem because we can't do proper query analysis any more. How are others faring with full logging on bigger boxes? Regards, Frank
On Wed, Mar 11, 2009 at 1:27 PM, Frank Joerdens <frank@joerdens.de> wrote: > Greetings. We're having trouble with full logging since we moved from > an 8-core server with 16 GB memory to a machine with double that > spec and I am wondering if this *should* be working or if there is a > point on larger machines where logging and scheduling seeks of > background writes - or something along those lines; it might be a > theory - doesn't work together any more? > > The box in question is a Dell PowerEdge R900 with 16 cores and 64 GB > of RAM (16 GB of shared buffers allocated), and a not-so-great > > root@db04:~# lspci|grep RAID > 19:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS > 1078 (rev 04) > > controller with 8 10k rpm disks in RAID 1+0 (one big filesystem), > running Ubuntu Hardy with kernel version > > root@db04:~# uname -a > Linux db04 2.6.24-22-server #1 SMP Mon Nov 24 20:06:28 UTC 2008 x86_64 GNU/Linux > > Logging to the disk array actually stops working much earlier; at > off-peak time we have around 3k transactions per second and if we set > log_statement = all, the server gets bogged down immediately: Load, > context switches, and above all mean query duration shoot up; the > application slows to a crawl and becomes unusable. > > So the idea came up to log to /dev/shm which is a default ram disk on > Linux with half the available memory as a maximum size. > > This works much better but once we are at about 80% of peak load - > which is around 8000 transactions per second currently - the server goes > into a tailspin in the manner described above and we have to switch off full > logging. > > This is a problem because we can't do proper query analysis any more. > > How are others faring with full logging on bigger boxes? We have 16 disks in our machine, 2 in a mirror for OS / logs, 2 in a mirror for pg_xlog, and 12 in a RAID-10 for our main data store. By having the logs go to a mirror set that's used for little else, we get much better throughput than having it be on the same array as our large transactional store. We're not handling anywhere near as many transactions per second as you though. But full query logging costs us almost nothing in terms of performance and load. Would logging only the slower / slowest running queries help you?
Frank Joerdens <frank@joerdens.de> writes: > Greetings. We're having trouble with full logging since we moved from > an 8-core server with 16 GB memory to a machine with double that > spec and I am wondering if this *should* be working or if there is a > point on larger machines where logging and scheduling seeks of > background writes - or something along those lines; it might be a > theory - doesn't work together any more? You didn't tell us anything interesting about *how* you are logging, so it's hard to comment on this. Are you just writing to stderr? syslog? using PG's built-in log capture process? It would be good to show all of your log-related postgresql.conf settings. regards, tom lane
On Wed, Mar 11, 2009 at 8:46 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Frank Joerdens <frank@joerdens.de> writes: >> Greetings. We're having trouble with full logging since we moved from >> an 8-core server with 16 GB memory to a machine with double that >> spec and I am wondering if this *should* be working or if there is a >> point on larger machines where logging and scheduling seeks of >> background writes - or something along those lines; it might be a >> theory - doesn't work together any more? > > You didn't tell us anything interesting about *how* you are logging, > so it's hard to comment on this. Are you just writing to stderr? > syslog? using PG's built-in log capture process? It would be good > to show all of your log-related postgresql.conf settings. Here's the complete postgresql.conf (I've whittled it down as much as I could so it's quite compact): frank@db04:~$ cat /etc/postgresql/8.2/main/postgresql.conf data_directory = '/var/lib/postgresql/8.2/main' hba_file = '/etc/postgresql/8.2/main/pg_hba.conf' ident_file = '/etc/postgresql/8.2/main/pg_ident.conf' listen_addresses = 'localhost,172.16.222.62' port = 5432 max_connections = 1000 shared_buffers = 16GB work_mem = 200MB maintenance_work_mem = 1GB max_fsm_pages = 50000 wal_buffers = 8MB checkpoint_segments = 16 autovacuum = on stats_start_collector = on stats_row_level = on effective_cache_size = 4GB default_statistics_target = 10 constraint_exclusion = off checkpoint_warning = 1h escape_string_warning = off log_duration = off log_min_duration_statement = 1000 log_statement = 'ddl' log_line_prefix = '%m %p %h %u ' archive_command = '/usr/bin/walmgr.py /var/lib/postgresql/walshipping/master.ini xarchive %p %f' redirect_stderr = on log_directory = '/dev/shm/' log_rotation_age = 0 log_rotation_size = 0 The above is what we're doing right now, only logging queries that run for over a second, and that is no problem; so the answer to Scott's question in his reply to my posting is: Yes, logging only the slower queries does work. Yesterday I changed log_duration = on and log_statement = 'all' at off-peak time and left it on for 4 hours while traffic was picking up. Eventually I had to stop it because the server got bogged down. Regards, Frank
On Wed, Mar 11, 2009 at 8:27 PM, Frank Joerdens <frank@joerdens.de> wrote: > This works much better but once we are at about 80% of peak load - > which is around 8000 transactions per second currently - the server goes > into a tailspin in the manner described above and we have to switch off full > logging. First, don't use log_duration = on + log_statement = 'all' to log all the queries, use log_min_duration_statement=0, it's less verbose. I don't know if the logging integrated into PostgreSQL can bufferize its output. Andrew? If not, you should try syslog instead and see if asynchronous logging with syslog is helping (you need to prefix the path with a dash to enable asynchronous logging). You can also try to send the logs on the network via udp (and also tcp if you have an enhanced syslog-like). Another option is to log the duration of every query but not the text. We used to have this sort of configuration to gather comprehensive statistics and slowest queries on highly loaded servers (it's not perfect though but it can be an acceptable compromise): log_duration = on log_min_duration_statement = 100 -- Guillaume
Guillaume Smet <guillaume.smet@gmail.com> writes: > I don't know if the logging integrated into PostgreSQL can bufferize > its output. Andrew? It uses fwrite(), and normally sets its output into line-buffered mode. For a high-throughput case like this it seems like using fully buffered mode might be an acceptable tradeoff. You could try changing _IOLBF to _IOFBF near the head of postmaster/syslogger.c and see if that helps. (If it does, we could think about exposing some way of tuning this without modifying the code, but for now you'll have to do that.) regards, tom lane
On Thu, Mar 12, 2009 at 2:05 AM, Andrew Dunstan <andrew@dunslane.net> wrote: > It is buffered at the individual log message level, so that we make sure we > don't multiplex messages. No more than that. OK. So if the OP can afford multiplexed queries by using a log analyzer supporting them, it might be a good idea to try syslog with full buffering. -- Guillaume
On Thu, Mar 12, 2009 at 1:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: [...] > You could try changing _IOLBF > to _IOFBF near the head of postmaster/syslogger.c and see if that helps. I just put the patched .deb on staging and we'll give it a whirl there for basic sanity checking - we currently have no way to even approximate the load that we have on live for testing. If all goes well I expect we'll put it on live early next week. I'll let you know how it goes. Regards, Frank
On Thursday 12 March 2009 14:38:56 Frank Joerdens wrote: > I just put the patched .deb on staging and we'll give it a whirl there > for basic sanity checking - we currently have no way to even > approximate the load that we have on live for testing. Is it a capacity problem or a tool suite problem? If the latter, you could try out tsung: http://archives.postgresql.org/pgsql-admin/2008-12/msg00032.php http://tsung.erlang-projects.org/ Regards, -- dim
Attachment
On Wed, Mar 11, 2009 at 11:42 PM, Frank Joerdens <frank@joerdens.de> wrote: > > effective_cache_size = 4GB Only 4GB with 64GB of ram ? About logging, we have 3 partition : - data - index - everything else, including logging. Usually, we log on a remote syslog (a dedicated log server for the whole server farm). For profiling (pgfouine), we have a crontab that change the postgresql logging configuration for just a few mn. and log "all" on the "everything but postgresql" partition. around 2000 query/seconds/servers, no problem. -- Laurent Laborde Sysadmin at JFG-Networks / Over-blog
for profiling, you can also use the epqa.
http://epqa.sourceforge.net/
http://epqa.sourceforge.net/
On Fri, Mar 13, 2009 at 3:02 AM, Laurent Laborde <kerdezixe@gmail.com> wrote:
On Wed, Mar 11, 2009 at 11:42 PM, Frank Joerdens <frank@joerdens.de> wrote:
>
> effective_cache_size = 4GB
Only 4GB with 64GB of ram ?
About logging, we have 3 partition :
- data
- index
- everything else, including logging.
Usually, we log on a remote syslog (a dedicated log server for the
whole server farm).
For profiling (pgfouine), we have a crontab that change the postgresql
logging configuration for just a few mn.
and log "all" on the "everything but postgresql" partition.
around 2000 query/seconds/servers, no problem.
--
Laurent Laborde
Sysadmin at JFG-Networks / Over-blog
--
Sent via pgsql-performance mailing list (pgsql-performance@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance
On Fri, Mar 13, 2009 at 9:28 AM, sathiya psql <sathiya.psql@gmail.com> wrote: > for profiling, you can also use the epqa. > > http://epqa.sourceforge.net/ or PGSI : http://bucardo.org/pgsi/ But it require a syslog date format we don't use here. So i wasn't able to test it :/ -- F4FQM Kerunix Flan Laurent Laborde
On Thu, Mar 12, 2009 at 9:32 PM, Laurent Laborde <kerdezixe@gmail.com> wrote: > On Wed, Mar 11, 2009 at 11:42 PM, Frank Joerdens <frank@joerdens.de> wrote: >> >> effective_cache_size = 4GB > > Only 4GB with 64GB of ram ? I'd been overly cautious lately with config changes as it's been difficult to argue for downtime and associated service risk. Changed it to 48 GB now since it doesn't require a restart which I'd missed. Thanks for spotting! Regards, Frank
On Thu, Mar 12, 2009 at 1:38 PM, Frank Joerdens <frank@joerdens.de> wrote: > On Thu, Mar 12, 2009 at 1:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > [...] >> You could try changing _IOLBF >> to _IOFBF near the head of postmaster/syslogger.c and see if that helps. The patched server is now running on live, and we'll be watching it over the weekend with log_duration = off log_min_duration_statement = 1000 log_statement = 'ddl' and then run a full logging test early next week if there are no problems with the above settings. Can you explain again what the extent of multiplexed messages I'll have to expect is? What exactly is the substance of the tradeoff? Occasionally the server will write the same line twice? Don't really follow why ... And the next problem is that now unfortunately the entire comparison is obfuscated and complicated by a release we did on Monday which has had a strange effect: Quite extreme load average spikes occurring frequently that do not seem to impact query speed - not much anyway or if they do then in a random intermittent manner that's not become apparent (yet) - CPU usage is actually marginally down, locks significantly down, and all other relevant metrics basically unchanged like context switches and memory usage profile. Now, it *seems* that the extra load is caused by idle (sic!) backends (*not* idle in transaction even) consuming significant CPU when you look at htop. I don't have a theory as to that right now. We use pgbouncer as a connection pooler. What could make idle backends load the server substantially? Regards, Frank
On Fri, Mar 20, 2009 at 8:21 PM, Andrew Dunstan <andrew@dunslane.net> wrote: > > > Frank Joerdens wrote: >> >> On Thu, Mar 12, 2009 at 1:38 PM, Frank Joerdens <frank@joerdens.de> wrote: >> >>> >>> On Thu, Mar 12, 2009 at 1:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >>> [...] >>> >>>> >>>> You could try changing _IOLBF >>>> to _IOFBF near the head of postmaster/syslogger.c and see if that helps. [...] >> Can you explain again what the extent of multiplexed messages I'll >> have to expect is? What exactly is the substance of the tradeoff? >> Occasionally the server will write the same line twice? Don't really >> follow why ... [...] > I don't believe changing this will result in any multiplexing. The > multiplexing problem was solved in 8.3 by the introduction of the chunking > protocol between backends and the syslogger, and changing the output > buffering of the syslogger should not cause a multiplexing problem, since > it's a single process. Hum, we're still on 8.2 - last attempt to upgrade before xmas was unsuccessful; we had to roll back due to not fully understood performance issues. We think we nailed the root cause(s) those though and will make another better planned effort to upgrade before March is out. Oh well, maybe this all means we shouldn't try to get this running on 8.2 and just tackle the issue again after the upgrade ... Cheers, Frank
On Fri, Mar 20, 2009 at 8:47 PM, Frank Joerdens <frank@joerdens.de> wrote: > On Thu, Mar 12, 2009 at 1:38 PM, Frank Joerdens <frank@joerdens.de> wrote: >> On Thu, Mar 12, 2009 at 1:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: >> [...] >>> You could try changing _IOLBF >>> to _IOFBF near the head of postmaster/syslogger.c and see if that helps. > > The patched server is now running on live, and we'll be watching it > over the weekend with > > log_duration = off > log_min_duration_statement = 1000 > log_statement = 'ddl' > > and then run a full logging test early next week if there are no > problems with the above settings. Reporting back on this eventually (hitherto, all our experimenting appeared inconclusive): The patched 8.2 server did not appear to make any difference, it still didn't work, performance was affected in the same way as before. However in the meantime we managed to go to 8.3 and now it does work *if* synchronous_commit = off And now I am wondering what that means exactly: Does it necessarily follow that it's I/O contention on the disk subsystem because delayed flushing to WAL - what asynchronous commit does - gives the server just the window to insert the log line into the disk controller's write cache, as the transaction commit's write and the log line write would be otherwise simultaneous with synchronous commit? Does it follow that if I put pg_xlog now on a separate spindle and/or controller, it should work? Somehow I think not, as the disk array isn't even near maxed out, according to vmstat. Or is the disk cache just too small then? Frank