Thread: Full statement logging problematic on larger machines?

From:
Frank Joerdens
Date:

Greetings. We're having trouble with full logging since we moved from
an 8-core server with 16 GB memory to a machine with double that
spec and I am wondering if this *should* be working or if there is a
point on larger machines where logging and scheduling seeks of
background writes - or something along those lines; it might be a
theory - doesn't work together any more?

The box in question is a Dell PowerEdge R900 with 16 cores and 64 GB
of RAM (16 GB of shared buffers allocated), and a not-so-great

root@db04:~# lspci|grep RAID
19:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS
1078 (rev 04)

controller with 8 10k rpm disks in RAID 1+0 (one big filesystem),
running Ubuntu Hardy with kernel version

root@db04:~# uname -a
Linux db04 2.6.24-22-server #1 SMP Mon Nov 24 20:06:28 UTC 2008 x86_64 GNU/Linux

Logging to the disk array actually stops working much earlier; at
off-peak time we have around 3k transactions per second and if we set
log_statement = all, the server gets bogged down immediately: Load,
context switches, and above all mean query duration shoot up; the
application slows to a crawl and becomes unusable.

So the idea came up to log to /dev/shm which is a default ram disk on
Linux with half the available memory as a maximum size.

This works much better but once we are at about 80% of peak load -
which is around 8000 transactions per second currently - the server goes
into a tailspin in the manner described above and we have to switch off full
logging.

This is a problem because we can't do proper query analysis any more.

How are others faring with full logging on bigger boxes?

Regards,

Frank

From:
Scott Marlowe
Date:

On Wed, Mar 11, 2009 at 1:27 PM, Frank Joerdens <> wrote:
> Greetings. We're having trouble with full logging since we moved from
> an 8-core server with 16 GB memory to a machine with double that
> spec and I am wondering if this *should* be working or if there is a
> point on larger machines where logging and scheduling seeks of
> background writes - or something along those lines; it might be a
> theory - doesn't work together any more?
>
> The box in question is a Dell PowerEdge R900 with 16 cores and 64 GB
> of RAM (16 GB of shared buffers allocated), and a not-so-great
>
> root@db04:~# lspci|grep RAID
> 19:00.0 RAID bus controller: LSI Logic / Symbios Logic MegaRAID SAS
> 1078 (rev 04)
>
> controller with 8 10k rpm disks in RAID 1+0 (one big filesystem),
> running Ubuntu Hardy with kernel version
>
> root@db04:~# uname -a
> Linux db04 2.6.24-22-server #1 SMP Mon Nov 24 20:06:28 UTC 2008 x86_64 GNU/Linux
>
> Logging to the disk array actually stops working much earlier; at
> off-peak time we have around 3k transactions per second and if we set
> log_statement = all, the server gets bogged down immediately: Load,
> context switches, and above all mean query duration shoot up; the
> application slows to a crawl and becomes unusable.
>
> So the idea came up to log to /dev/shm which is a default ram disk on
> Linux with half the available memory as a maximum size.
>
> This works much better but once we are at about 80% of peak load -
> which is around 8000 transactions per second currently - the server goes
> into a tailspin in the manner described above and we have to switch off full
> logging.
>
> This is a problem because we can't do proper query analysis any more.
>
> How are others faring with full logging on bigger boxes?

We have 16 disks in our machine, 2 in a mirror for OS / logs, 2 in a
mirror for pg_xlog, and 12 in a RAID-10 for our main data store.  By
having the logs go to a mirror set that's used for little else, we get
much better throughput than having it be on the same array as our
large transactional store.  We're not handling anywhere near as many
transactions per second as you though.  But full query logging costs
us almost nothing in terms of performance and load.

Would logging only the slower / slowest running queries help you?

From:
Tom Lane
Date:

Frank Joerdens <> writes:
> Greetings. We're having trouble with full logging since we moved from
> an 8-core server with 16 GB memory to a machine with double that
> spec and I am wondering if this *should* be working or if there is a
> point on larger machines where logging and scheduling seeks of
> background writes - or something along those lines; it might be a
> theory - doesn't work together any more?

You didn't tell us anything interesting about *how* you are logging,
so it's hard to comment on this.  Are you just writing to stderr?
syslog?  using PG's built-in log capture process?  It would be good
to show all of your log-related postgresql.conf settings.

            regards, tom lane

From:
Frank Joerdens
Date:

On Wed, Mar 11, 2009 at 8:46 PM, Tom Lane <> wrote:
> Frank Joerdens <> writes:
>> Greetings. We're having trouble with full logging since we moved from
>> an 8-core server with 16 GB memory to a machine with double that
>> spec and I am wondering if this *should* be working or if there is a
>> point on larger machines where logging and scheduling seeks of
>> background writes - or something along those lines; it might be a
>> theory - doesn't work together any more?
>
> You didn't tell us anything interesting about *how* you are logging,
> so it's hard to comment on this.  Are you just writing to stderr?
> syslog?  using PG's built-in log capture process?  It would be good
> to show all of your log-related postgresql.conf settings.

Here's the complete postgresql.conf (I've whittled it down as much as
I could so it's quite compact):

frank@db04:~$ cat /etc/postgresql/8.2/main/postgresql.conf
data_directory = '/var/lib/postgresql/8.2/main'
hba_file = '/etc/postgresql/8.2/main/pg_hba.conf'
ident_file = '/etc/postgresql/8.2/main/pg_ident.conf'

listen_addresses                = 'localhost,172.16.222.62'
port                            = 5432

max_connections                 = 1000
shared_buffers                  = 16GB
work_mem                        = 200MB
maintenance_work_mem            = 1GB
max_fsm_pages                   = 50000
wal_buffers                     = 8MB
checkpoint_segments             = 16


autovacuum                      = on
stats_start_collector           = on
stats_row_level                 = on


effective_cache_size            = 4GB
default_statistics_target       = 10
constraint_exclusion            = off
checkpoint_warning              = 1h
escape_string_warning           = off

log_duration                    = off
log_min_duration_statement      = 1000
log_statement                   = 'ddl'
log_line_prefix                 = '%m %p %h %u '

archive_command = '/usr/bin/walmgr.py
/var/lib/postgresql/walshipping/master.ini xarchive %p %f'

redirect_stderr                 = on
log_directory                   = '/dev/shm/'
log_rotation_age                = 0
log_rotation_size               = 0

The above is what we're doing right now, only logging queries that run
for over a second, and that is no problem; so the answer to Scott's
question in his reply to my posting is: Yes, logging only the slower
queries does work.

Yesterday I changed log_duration = on and log_statement = 'all' at
off-peak time and left it on for 4 hours while traffic was picking up.
Eventually I had to stop it because the server got bogged down.

Regards,

Frank

From:
Guillaume Smet
Date:

On Wed, Mar 11, 2009 at 8:27 PM, Frank Joerdens <> wrote:
> This works much better but once we are at about 80% of peak load -
> which is around 8000 transactions per second currently - the server goes
> into a tailspin in the manner described above and we have to switch off full
> logging.

First, don't use log_duration = on + log_statement = 'all' to log all
the queries, use log_min_duration_statement=0, it's less verbose.

I don't know if the logging integrated into PostgreSQL can bufferize
its output. Andrew? If not, you should try syslog instead and see if
asynchronous logging with syslog is helping (you need to prefix the
path with a dash to enable asynchronous logging). You can also try to
send the logs on the network via udp (and also tcp if you have an
enhanced syslog-like).

Another option is to log the duration of every query but not the text.
We used to have this sort of configuration to gather comprehensive
statistics and slowest queries on highly loaded servers (it's not
perfect though but it can be an acceptable compromise):
log_duration = on
log_min_duration_statement = 100

--
Guillaume

From:
Tom Lane
Date:

Guillaume Smet <> writes:
> I don't know if the logging integrated into PostgreSQL can bufferize
> its output. Andrew?

It uses fwrite(), and normally sets its output into line-buffered mode.
For a high-throughput case like this it seems like using fully buffered
mode might be an acceptable tradeoff.  You could try changing _IOLBF
to _IOFBF near the head of postmaster/syslogger.c and see if that helps.
(If it does, we could think about exposing some way of tuning this
without modifying the code, but for now you'll have to do that.)

            regards, tom lane

From:
Guillaume Smet
Date:

On Thu, Mar 12, 2009 at 2:05 AM, Andrew Dunstan <> wrote:
> It is buffered at the individual log message level, so that we make sure we
> don't multiplex messages. No more than that.

OK. So if the OP can afford multiplexed queries by using a log
analyzer supporting them, it might be a good idea to try syslog with
full buffering.

--
Guillaume

From:
Frank Joerdens
Date:

On Thu, Mar 12, 2009 at 1:45 AM, Tom Lane <> wrote:
[...]
> You could try changing _IOLBF
> to _IOFBF near the head of postmaster/syslogger.c and see if that helps.

I just put the patched .deb on staging and we'll give it a whirl there
for basic sanity checking - we currently have no way to even
approximate the load that we have on live for testing.

If all goes well I expect we'll put it on live early next week. I'll
let you know how it goes.

Regards,

Frank

From:
Dimitri Fontaine
Date:

On Thursday 12 March 2009 14:38:56 Frank Joerdens wrote:
> I just put the patched .deb on staging and we'll give it a whirl there
> for basic sanity checking - we currently have no way to even
> approximate the load that we have on live for testing.

Is it a capacity problem or a tool suite problem?
If the latter, you could try out tsung:
  http://archives.postgresql.org/pgsql-admin/2008-12/msg00032.php
  http://tsung.erlang-projects.org/

Regards,
--
dim


Attachment
From:
Laurent Laborde
Date:

On Wed, Mar 11, 2009 at 11:42 PM, Frank Joerdens <> wrote:
>
> effective_cache_size            = 4GB

Only 4GB with 64GB of ram ?

About logging, we have 3 partition :
- data
- index
- everything else, including logging.

Usually, we log on a remote syslog (a dedicated log server for the
whole server farm).

For profiling (pgfouine), we have a crontab that change the postgresql
logging configuration for just a few mn.
and log "all" on the "everything but postgresql" partition.

around 2000 query/seconds/servers, no problem.


--
Laurent Laborde
Sysadmin at JFG-Networks / Over-blog

From:
sathiya psql
Date:

for profiling, you can also use the epqa.

http://epqa.sourceforge.net/

On Fri, Mar 13, 2009 at 3:02 AM, Laurent Laborde <> wrote:
On Wed, Mar 11, 2009 at 11:42 PM, Frank Joerdens <> wrote:
>
> effective_cache_size            = 4GB

Only 4GB with 64GB of ram ?

About logging, we have 3 partition :
- data
- index
- everything else, including logging.

Usually, we log on a remote syslog (a dedicated log server for the
whole server farm).

For profiling (pgfouine), we have a crontab that change the postgresql
logging configuration for just a few mn.
and log "all" on the "everything but postgresql" partition.

around 2000 query/seconds/servers, no problem.


--
Laurent Laborde
Sysadmin at JFG-Networks / Over-blog

--
Sent via pgsql-performance mailing list ()
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-performance

From:
Laurent Laborde
Date:

On Fri, Mar 13, 2009 at 9:28 AM, sathiya psql <> wrote:
> for profiling, you can also use the epqa.
>
> http://epqa.sourceforge.net/

or PGSI : http://bucardo.org/pgsi/
But it require a syslog date format we don't use here. So i wasn't
able to test it :/

--
F4FQM
Kerunix Flan
Laurent Laborde

From:
Frank Joerdens
Date:

On Thu, Mar 12, 2009 at 9:32 PM, Laurent Laborde <> wrote:
> On Wed, Mar 11, 2009 at 11:42 PM, Frank Joerdens <> wrote:
>>
>> effective_cache_size            = 4GB
>
> Only 4GB with 64GB of ram ?

I'd been overly cautious lately with config changes as it's been
difficult to argue for downtime and associated service risk. Changed
it to 48 GB now since it doesn't require a restart which I'd missed.
Thanks for spotting!

Regards,

Frank

From:
Frank Joerdens
Date:

On Thu, Mar 12, 2009 at 1:38 PM, Frank Joerdens <> wrote:
> On Thu, Mar 12, 2009 at 1:45 AM, Tom Lane <> wrote:
> [...]
>> You could try changing _IOLBF
>> to _IOFBF near the head of postmaster/syslogger.c and see if that helps.

The patched server is now running on live, and we'll be watching it
over the weekend with

log_duration = off
log_min_duration_statement = 1000
log_statement = 'ddl'

and then run a full logging test early next week if there are no
problems with the above settings.

Can you explain again what the extent of multiplexed messages I'll
have to expect is? What exactly is the substance of the tradeoff?
Occasionally the server will write the same line twice? Don't really
follow why ...

And the next problem is that now unfortunately the entire comparison
is obfuscated and complicated by a release we did on Monday which has
had a strange effect: Quite extreme load average spikes occurring
frequently that do not seem to impact query speed - not much anyway or
if they do then in a random intermittent manner that's not become
apparent (yet) - CPU usage is actually marginally down, locks
significantly down, and all other relevant metrics basically unchanged
like context switches and memory usage profile. Now, it *seems* that
the extra load is caused by idle (sic!) backends (*not* idle in
transaction even) consuming significant CPU when you look at htop. I
don't have a theory as to that right now. We use pgbouncer as a
connection pooler. What could make idle backends load the server
substantially?

Regards,

Frank

From:
Frank Joerdens
Date:

On Fri, Mar 20, 2009 at 8:21 PM, Andrew Dunstan <> wrote:
>
>
> Frank Joerdens wrote:
>>
>> On Thu, Mar 12, 2009 at 1:38 PM, Frank Joerdens <> wrote:
>>
>>>
>>> On Thu, Mar 12, 2009 at 1:45 AM, Tom Lane <> wrote:
>>> [...]
>>>
>>>>
>>>> You could try changing _IOLBF
>>>> to _IOFBF near the head of postmaster/syslogger.c and see if that helps.
[...]
>> Can you explain again what the extent of multiplexed messages I'll
>> have to expect is? What exactly is the substance of the tradeoff?
>> Occasionally the server will write the same line twice? Don't really
>> follow why ...
[...]
> I don't believe changing this will result in any multiplexing. The
> multiplexing problem was solved in 8.3 by the introduction of the chunking
> protocol between backends and the syslogger, and changing the output
> buffering of the syslogger should not cause a multiplexing problem, since
> it's a single process.

Hum, we're still on 8.2 - last attempt to upgrade before xmas was
unsuccessful; we had to roll back due to not fully understood
performance issues. We think we nailed the root cause(s) those though
and will make another better planned effort to upgrade before March is
out.

Oh well, maybe this all means we shouldn't try to get this running on
8.2 and just tackle the issue again after the upgrade ...

Cheers,

Frank

From:
Frank Joerdens
Date:

On Fri, Mar 20, 2009 at 8:47 PM, Frank Joerdens <> wrote:
> On Thu, Mar 12, 2009 at 1:38 PM, Frank Joerdens <> wrote:
>> On Thu, Mar 12, 2009 at 1:45 AM, Tom Lane <> wrote:
>> [...]
>>> You could try changing _IOLBF
>>> to _IOFBF near the head of postmaster/syslogger.c and see if that helps.
>
> The patched server is now running on live, and we'll be watching it
> over the weekend with
>
> log_duration = off
> log_min_duration_statement = 1000
> log_statement = 'ddl'
>
> and then run a full logging test early next week if there are no
> problems with the above settings.

Reporting back on this eventually (hitherto, all our experimenting
appeared inconclusive): The patched 8.2 server did not appear to make
any difference, it still didn't work, performance was affected in the
same way as before.

However in the meantime we managed to go to 8.3 and now it does work *if*

synchronous_commit = off

And now I am wondering what that means exactly: Does it necessarily
follow that it's I/O contention on the disk subsystem because delayed
flushing to WAL - what asynchronous commit does - gives the server
just the window to insert the log line into the disk controller's
write cache, as the transaction commit's write and the log line write
would be otherwise simultaneous with synchronous commit? Does it
follow that if I put pg_xlog now on a separate spindle and/or
controller, it should work?

Somehow I think not, as the disk array isn't even near maxed out,
according to vmstat. Or is the disk cache just too small then?

Frank