PostgreSQL 8.3beta1 on Solaris testing case study - Mailing list pgsql-performance

From Jignesh K. Shah
Subject PostgreSQL 8.3beta1 on Solaris testing case study
Date
Msg-id 4720A288.8090605@sun.com
Whole thread Raw
List pgsql-performance
I thought I will update this to the Performance alias too about our
testing with PG8.3beta1 on Solaris.

Regards,
Jignesh

__Background_:_
We were using PostgreSQL 8.3beta1 testing on our latest Sun SPARC
Enterprise T5220 Server using Solaris 10 8/07. Generally for performance
benefits in Solaris  we put file systems on forcedirectio we bypass the
filesystem cache and go direct to disks.

__Problem_:_
What we were observing that there were lots of reads happening  about
4MB/sec on the file system holding $PGDATA and the database tables
during an OLTP Benchmark run. Initially we thought that our bufferpools
were not big enough. But thanks to 64-bit builds  we could use bigger
bufferpools. However even with extraordinary bufferpool sizes we still
saw lots of reads going to the disks.

__DTrace to the Rescue_:_

I modified iosnoop.d to just snoop on reads. The modified rsnoop.d is as
follows:
 $ cat rsnoop.d
#!/usr/sbin/dtrace -s
syscall::read:entry
/execname=="postgres"/
{
        printf("pid %d reading  %s\n", pid, fds[arg0].fi_pathname);
}

Based on it I found that most postgresql  processes were doing lots of
reads from pg_clog directory.
CLOG or commit logs keep track of transactions in flight. Writes of CLOG
comes from recording of transaction commits( or when it aborts) or when
an XLOG is generated. However though I am not clear on reads yet, it
seems every process constantly reads it to get some status. CLOG data is
not cached in any PostgreSQL shared memory segments and hence becomes
the bottleneck as it has to constantly go to the filesystem to get the
read data.


__Workaround for the high reads on CLOG on Solaris_ :
_Start with the cluster $PGDATA on regular UFS (which is buffered and
logging is enabled). Always create a new tablespace for your database on
forcedirectio mounted file system which bypasses the file system cache.
This allows all PostgreSQL CLOG files to be cached in UFS greatly
reducing stress on the underlying storage. For writes to the best of my
knowledge, PostgreSQL will still do fsync to force the writes the CLOGs
onto the disks so it is consistent. But the reads are spared from going
to the disks and returned from the cache.

__Result_:_
With rightly sized bufferpool now all database data can be in PostgreSQL
cache and hence reads are spared from the tablespaces. As for PGDATA
data, UFS will do the caching of CLOG files, etc and hence sparring
reads from going to the disks again. In the end what we achieve is a
right sized bufferpool where there are no reads required during a high
OLTP environment and the disks are just busy doing the writes of updates
and inserts.



pgsql-performance by date:

Previous
From: "Jignesh K. Shah"
Date:
Subject: 8.3beta1 testing on Solaris
Next
From: Tom Lane
Date:
Subject: Re: [HACKERS] 8.3beta1 testing on Solaris