Thread: PostgreSQL on Solaris 8 and ufs
We've recently moved our pgsql installation and DBs to a Solaris 8 machine with striped and mirrored ufs filesystem that houses the DB data. We are now seeing terrible performance and the bottleneck is no doubt disk I/O. We've tried modifying a tunables related to ufs, but it doesn't seem to be helping. Is there anything we should be looking at that is specifically related to ufs filesystems on Solaris 8 or possibly something in general that would improve performance? Thanks. -- Brandon
Brandon Metcalf wrote: >We've recently moved our pgsql installation and DBs to a Solaris 8 >machine with striped and mirrored ufs filesystem that houses the DB >data. We are now seeing terrible performance and the bottleneck is no >doubt disk I/O. > >We've tried modifying a tunables related to ufs, but it doesn't seem >to be helping. > >Is there anything we should be looking at that is specifically related >to ufs filesystems on Solaris 8 or possibly something in general that >would improve performance? > > Well, Solaris 8 is a bit old now, so I don't remember all the details. But, if memory servers, Solaris 8 still has some "high water" and "lo water" tunables related to the amount of IO can be outstanding to a single file. Try setting set ufs:ufs_WRITES=0 in /etc/system and rebooting, which basically says "any amount of disk IO can be outstanding". There's a tunables doc on docs.sun.com that explains this option. Also, logging UFS might help with some of the metadata requirements of UFS as well. So, use "mount -o logging" or add the relevant entry in /etc/vfstab. Of course, the best thing is Solaris 9 or 10, which would be much better for this sort of thing. Hope this helps. -- Alan
On Tue, 2005-03-22 at 14:44 -0600, Brandon Metcalf wrote: > We've recently moved our pgsql installation and DBs to a Solaris 8 > machine with striped and mirrored ufs filesystem that houses the DB > data. We are now seeing terrible performance and the bottleneck is no > doubt disk I/O. > > We've tried modifying a tunables related to ufs, but it doesn't seem > to be helping. > > Is there anything we should be looking at that is specifically related > to ufs filesystems on Solaris 8 or possibly something in general that > would improve performance? > > Thanks. > What are you using to create your raid? You say it is "no doubt disk I/O" - does iostat confirm this? A lot of performance issues are related to the size of the stripe you chose for the striped portion of the array, the actual array configuration, etc. I am assuming you have looked at system variables such as autoup and the likes? What tweaks have you done? Also, are your pg_xlog and data directories separated onto separate volumes? Doing so will help immensely. What are you using to measure performance? Sven
s == stange@rentec.com writes: s> Try setting s> set ufs:ufs_WRITES=0 s> in /etc/system and rebooting, which basically says "any amount of disk s> IO can be outstanding". There's a tunables doc on docs.sun.com that s> explains this option. s> Also, logging UFS might help with some of the metadata requirements of s> UFS as well. So, use "mount -o logging" or add the relevant entry in s> /etc/vfstab. OK. I'll try these out. We do have ufs_WRITES enabled with the following parameters: set ncsize = 257024 set autoup = 90 set bufhwm = 15000 set tune_t_fsflushr = 15 set ufs:ufs_HW = 16777216 set ufs:ufs_LW = 8388608 -- Brandon
s == sven@dmv.com writes: s> What are you using to create your raid? Hm. I didn't set this up. I'll have to check. s> You say it is "no doubt disk s> I/O" - does iostat confirm this? A lot of performance issues are related s> to the size of the stripe you chose for the striped portion of the s> array, the actual array configuration, etc. I am assuming you have s> looked at system variables such as autoup and the likes? What tweaks s> have you done? I've mainly been using Glance which shows a lot of queued requests for the disks in question. Here's currently what we have in /etc/system related to ufs: set ncsize = 257024 set autoup = 90 set bufhwm = 15000 set tune_t_fsflushr = 15 set ufs:ufs_HW = 16777216 set ufs:ufs_LW = 8388608 s> Also, are your pg_xlog and data directories separated onto separate s> volumes? Doing so will help immensely. No, they are on the same volume. s> What are you using to measure s> performance? Nothing too scientific other than the fact that since we have moved the DB, we consistenly see a large number of postmater processes (close to 100) where before we did not. -- Brandon
Brandon Metcalf wrote: > We've recently moved our pgsql installation and DBs to a Solaris 8 > machine with striped and mirrored ufs filesystem that houses the DB > data. We are now seeing terrible performance and the bottleneck is no > doubt disk I/O. > > We've tried modifying a tunables related to ufs, but it doesn't seem > to be helping. > > Is there anything we should be looking at that is specifically related > to ufs filesystems on Solaris 8 or possibly something in general that > would improve performance? > > Thanks. > I found that mounting the filesystem that contains the PGDATA directory (probably only the pg_xlog directory in fact) without logging improved things a great deal (assuming you are using logging that is...). In addition changing the postgresql.conf parameter wal_sync_method from the default of open_datasync to fdatasync improved things a bit more. However I seem to recall a posting suggesting the opposite! ...so feel free to experiment and let us know! Mark P.s : original tests on Solaris 8, http://archives.postgresql.org/pgsql-performance/2003-12/msg00165.php
On Tue, Mar 22, 2005 at 03:23:18PM -0600, Brandon Metcalf wrote: > s> What are you using to measure > s> performance? > > Nothing too scientific other than the fact that since we have moved > the DB, we consistenly see a large number of postmater processes > (close to 100) where before we did not. What did you move from? The Solaris ps (not in ucb, which is the BSD-style ps) shows the parent process name, so everything shows up as "postmaster" rather than "postgres". There's always one back end per connection. If you are in fact using more connections, by the way, I can tell you that Solaris 8, in my experience, is _very bad_ at managing context switches. So you may not be merely I/O bound (although your other reports seem to indicate that you are). A -- Andrew Sullivan | ajs@crankycanuck.ca The whole tendency of modern prose is away from concreteness. --George Orwell
a == ajs@crankycanuck.ca writes: a> On Tue, Mar 22, 2005 at 03:23:18PM -0600, Brandon Metcalf wrote: a> > s> What are you using to measure a> > s> performance? a> > a> > Nothing too scientific other than the fact that since we have moved a> > the DB, we consistenly see a large number of postmater processes a> > (close to 100) where before we did not. a> What did you move from? The Solaris ps (not in ucb, which is the a> BSD-style ps) shows the parent process name, so everything shows up a> as "postmaster" rather than "postgres". There's always one back end a> per connection. a> If you are in fact using more connections, by the way, I can tell you a> that Solaris 8, in my experience, is _very bad_ at managing context a> switches. So you may not be merely I/O bound (although your other a> reports seem to indicate that you are). We moved from an HP-UX 10.20 box where the pgsql installation and data were on a vxfs fileystem. And we're definitely seeing more connections at a time which indicates that each process is taking longer to complete. -- Brandon
On the context switching issue, we've found that this setting in /etc/system helps: set rechoose_interval=30 this sets the minimum time that a process is eligible to be switched to another cpu. (the default is 3). You can monitor context switching with the cs column in vmstat. We've found that high context switching seems to be morea symptom, rather than a cause of problems -- for example we had an issue with column statistics and some really bad queries, and thecpu's start context switching like crazy. (20,000 - 50,000 or more in a 5 second period, normally < 5000 per 5 second period under heavyload.) Brandon Metcalf wrote: > a == ajs@crankycanuck.ca writes: > > a> On Tue, Mar 22, 2005 at 03:23:18PM -0600, Brandon Metcalf wrote: > a> > s> What are you using to measure > a> > s> performance? > a> > > a> > Nothing too scientific other than the fact that since we have moved > a> > the DB, we consistenly see a large number of postmater processes > a> > (close to 100) where before we did not. > > a> What did you move from? The Solaris ps (not in ucb, which is the > a> BSD-style ps) shows the parent process name, so everything shows up > a> as "postmaster" rather than "postgres". There's always one back end > a> per connection. > > a> If you are in fact using more connections, by the way, I can tell you > a> that Solaris 8, in my experience, is _very bad_ at managing context > a> switches. So you may not be merely I/O bound (although your other > a> reports seem to indicate that you are). > > > We moved from an HP-UX 10.20 box where the pgsql installation and data > were on a vxfs fileystem. > > And we're definitely seeing more connections at a time which indicates > that each process is taking longer to complete. > >
On Wed, Mar 23, 2005 at 11:16:29AM -0600, Brandon Metcalf wrote: > > We moved from an HP-UX 10.20 box where the pgsql installation and data > were on a vxfs fileystem. My best guess, then, is that ufs tuning really is your issue. We always used vxfs for our Sun database servers (which was a nightmare all on its own, BTW, so I don't actually recommend this), so I don't have any real ufs tuning advice. The Packer Solaris database book (Packer, Allan N., _Configuring & Tuning Databases on the Solaris Platform_. Palo Alto: Sun Microsystems P, 2002. ISBN 0-13-083417-3) does suggest mounting the filesystems with forcedirectio; I dimly recall using this for the wal partition on one test box, and STR that it helped. Also, you want to make sure you use the right fsync method; if it's still set to "fsync" in the config file, you'll want to change that. I remember finding that fsync was something like 3 times slower than everything else. I don't have any more Solaris boxes to check, but I believe we were using open_datasync as our method. You'll want to run some tests. You also should enable priority paging, but expect that this will give you really strange po numbers from vmstat and friends. Priority paging, I found, makes things look like you're swapping when you aren't. Procmem is useful, but if you really want the goods on what's going on, you need the SE toolkit. Just be careful using it as root -- in some cases it'll modify kernel parameters behind the scenes. In my case, I didn't have superuser access, so there wasn't a danger; but I've heard sysadmins complain about this. A -- Andrew Sullivan | ajs@crankycanuck.ca This work was visionary and imaginative, and goes to show that visionary and imaginative work need not end up well. --Dennis Ritchie
On Wed, Mar 23, 2005 at 09:32:07AM -0800, Tom Arthurs wrote: > found that high context switching seems to be more a symptom, Yes, that's a good point. It usually _is_ a symptom; but it might be a symptom that you've got an expensive query, and Solaris's foot-gun approach to handling such cases is a little painful. (We didn't give up on Solaris because of cs problems, BTW; but I have to say that AIX seems to be a little less prone to self-DOS on this front than Solaris was. If you really want to hear me rant, ask me some time about ecache and Sun support.) A -- Andrew Sullivan | ajs@crankycanuck.ca The plural of anecdote is not data. --Roger Brinner
Andrew, > The Packer Solaris database book (Packer, Allan N., _Configuring & > Tuning Databases on the Solaris Platform_. Palo Alto: Sun > Microsystems P, 2002. ISBN 0-13-083417-3) does suggest mounting the > filesystems with forcedirectio; I dimly recall using this for the wal > partition on one test box, and STR that it helped. This is a good idea for a WAL partition, but is NOT a good idea for the database. You pay want to look into setting segmap_percent to 70% or something. On Solaris 10 at least, the OS by default only uses 10% of memory for disk buffers. -- --Josh Josh Berkus Aglio Database Solutions San Francisco
Hi, I am thinking about how to continuously monitor the performance of a PostgreSQL 8 database. I am interested in two things: (1) the growth of tables with TOAST and indexes; and (2) the respond time breakdown for a query. In Chapters 23 and 24 of the big manual, I found enough materials to teach me how to do the 1st job. But I have difficulty with the 2nd one. I found some script for Oracle (http://www.ixora.com.au/scripts/waits.htm). Do we have something that can do the same job for PostgreSQL 8? Thanks. -Jack
Jack, > I am thinking about how to continuously monitor the performance of a > PostgreSQL 8 database. I am interested in two things: (1) the growth of > tables with TOAST and indexes; This is queryable from the system tables, if you don't mind an approximate. > and (2) the respond time breakdown for a > query. The what? You mean EXPLAIN ANALYZE? > In Chapters 23 and 24 of the big manual, I found enough materials to > teach me how to do the 1st job. But I have difficulty with the 2nd one. > I found some script for Oracle > (http://www.ixora.com.au/scripts/waits.htm). Life's too short for reading Oracle docs. Can you just explain, in step-by-step detail, what you want? -- Josh Berkus Aglio Database Solutions San Francisco
Josh, The description of the Oracle script is: This script can be used to focus tuning attention on the most important issues. It reports a breakdown of total foreground response time into four major categories: CPU usage, disk I/O, resource waits, and routine latencies. These categories are broken down further into sub-categories, and the component wait events are shown. The 8.1.6 version of the script uses the ratio_to_report analytic function to calculate percentages. The 8.1.5 version can be used if percentages are not required. The 8.1.5 version of the script should work on Oracle8 also, but has not yet been tested. The print out of this script is: SQL> @response_time_breakdown MAJOR MINOR WAIT_EVENT SECONDS PCT -------- ------------- ---------------------------------------- -------- ------ CPU time parsing n/a 497 .57% reloads n/a 13 .01% execution n/a 52618 59.99% disk I/O normal I/O db file sequential read 21980 25.06% full scans db file scattered read 9192 10.48% direct I/O direct path read 1484 1.69% direct path write 354 .40% other I/O log file sequential read 9 .01% db file parallel read 0 .00% control file sequential read 0 .00% control file parallel write 0 .00% waits DBWn writes rdbms ipc reply 143 .16% free buffer waits 36 .04% checkpoint completed 31 .04% LGWR writes log file switch completion 698 .80% other locks latch free 496 .57% sort segment request 108 .12% latency commits log file sync 6 .01% network SQL*Net break/reset to client 18 .02% SQL*Net more data to client 8 .01% SQL*Net message to client 7 .01% SQL*Net more data from client 3 .00% file ops file open 4 .01% file identify 1 .00% misc instance state change 0 .00% The script is pretty long: ------------------------------------------------------------------------------- -- -- Script: response_time_breakdown.sql -- Purpose: to report the components of foreground response time in % terms -- For: 8.0 to 8.1.5 -- -- Copyright: (c) Ixora Pty Ltd -- Author: Steve Adams -- ------------------------------------------------------------------------------- @save_sqlplus_settings column major format a8 column wait_event format a40 trunc column seconds format 9999999 column pct justify right break on major skip 1 on minor select substr(n_major, 3) major, substr(n_minor, 3) minor, wait_event, round(time/100) seconds from ( select /*+ ordered use_hash(b) */ '1 CPU time' n_major, decode(t.ksusdnam, 'redo size', '2 reloads', 'parse time cpu', '1 parsing', '3 execution' ) n_minor, 'n/a' wait_event, decode(t.ksusdnam, 'redo size', nvl(r.time, 0), 'parse time cpu', t.ksusgstv - nvl(b.time, 0), t.ksusgstv - nvl(b.time, 0) - nvl(r.time, 0) ) time from sys.x_$ksusgsta t, ( select /*+ ordered use_nl(s) */ -- star query: few rows from d and b s.ksusestn, -- statistic# sum(s.ksusestv) time -- time used by backgrounds from sys.x_$ksusd d, -- statname sys.x_$ksuse b, -- session sys.x_$ksbdp p, -- background process sys.x_$ksusesta s -- sesstat where d.ksusdnam in ( 'parse time cpu', 'CPU used when call started') and b.ksspaown = p.ksbdppro and s.ksusestn = d.indx and s.indx = b.indx group by s.ksusestn ) b, ( select /*+ no_merge */ ksusgstv * -- parse cpu time * kglstrld / -- SQL AREA reloads / (1 + kglstget - kglstght) -- SQL AREA misses time from sys.x_$kglst k, sys.x_$ksusgsta g where k.indx = 0 and g.ksusdnam = 'parse time cpu' ) r where t.ksusdnam in ( 'redo size', -- arbitrary: to get a row to replace 'parse time cpu', -- with the 'reload cpu time' 'CPU used when call started') and b.ksusestn (+) = t.indx union all select decode(n_minor, '1 normal I/O', '2 disk I/O', '2 full scans', '2 disk I/O', '3 direct I/O', '2 disk I/O', '4 BFILE reads', '2 disk I/O', '5 other I/O', '2 disk I/O', '1 DBWn writes', '3 waits', '2 LGWR writes', '3 waits', '3 ARCn writes', '3 waits', '4 enqueue locks', '3 waits', '5 PCM locks', '3 waits', '6 other locks', '3 waits', '1 commits', '4 latency', '2 network', '4 latency', '3 file ops', '4 latency', '4 process ctl', '4 latency', '5 global locks', '4 latency', '6 misc', '4 latency' ) n_major, n_minor, wait_event, time from ( select /*+ ordered use_hash(b) use_nl(d) */ decode( d.kslednam, -- disk I/O 'db file sequential read', '1 normal I/O', 'db file scattered read', '2 full scans', 'BFILE read', '4 BFILE reads', 'KOLF: Register LFI read', '4 BFILE reads', 'log file sequential read', '5 other I/O', 'log file single write', '5 other I/O', -- resource waits 'checkpoint completed', '1 DBWn writes', 'free buffer waits', '1 DBWn writes', 'write complete waits', '1 DBWn writes', 'local write wait', '1 DBWn writes', 'log file switch (checkpoint incomplete)', '1 DBWn writes', 'rdbms ipc reply', '1 DBWn writes', 'log file switch (archiving needed)', '3 ARCn writes', 'enqueue', '4 enqueue locks', 'buffer busy due to global cache', '5 PCM locks', 'global cache cr request', '5 PCM locks', 'global cache lock cleanup', '5 PCM locks', 'global cache lock null to s', '5 PCM locks', 'global cache lock null to x', '5 PCM locks', 'global cache lock s to x', '5 PCM locks', 'lock element cleanup', '5 PCM locks', 'checkpoint range buffer not saved', '6 other locks', 'dupl. cluster key', '6 other locks', 'PX Deq Credit: free buffer', '6 other locks', 'PX Deq Credit: need buffer', '6 other locks', 'PX Deq Credit: send blkd', '6 other locks', 'PX qref latch', '6 other locks', 'Wait for credit - free buffer', '6 other locks', 'Wait for credit - need buffer to send', '6 other locks', 'Wait for credit - send blocked', '6 other locks', 'global cache freelist wait', '6 other locks', 'global cache lock busy', '6 other locks', 'index block split', '6 other locks', 'lock element waits', '6 other locks', 'parallel query qref latch', '6 other locks', 'pipe put', '6 other locks', 'rdbms ipc message block', '6 other locks', 'row cache lock', '6 other locks', 'sort segment request', '6 other locks', 'transaction', '6 other locks', 'unbound tx', '6 other locks', -- routine waits 'log file sync', '1 commits', 'name-service call wait', '2 network', 'Test if message present', '4 process ctl', 'process startup', '4 process ctl', 'read SCN lock', '5 global locks', decode(substr(d.kslednam, 1, instr(d.kslednam, ' ')), -- disk I/O 'direct ', '3 direct I/O', 'control ', '5 other I/O', 'db ', '5 other I/O', -- resource waits 'log ', '2 LGWR writes', 'buffer ', '6 other locks', 'free ', '6 other locks', 'latch ', '6 other locks', 'library ', '6 other locks', 'undo ', '6 other locks', -- routine waits 'SQL*Net ', '2 network', 'BFILE ', '3 file ops', 'KOLF: ', '3 file ops', 'file ', '3 file ops', 'KXFQ: ', '4 process ctl', 'KXFX: ', '4 process ctl', 'PX ', '4 process ctl', 'Wait ', '4 process ctl', 'inactive ', '4 process ctl', 'multiple ', '4 process ctl', 'parallel ', '4 process ctl', 'DFS ', '5 global locks', 'batched ', '5 global locks', 'on-going ', '5 global locks', 'global ', '5 global locks', 'wait ', '5 global locks', 'writes ', '5 global locks', '6 misc' ) ) n_minor, d.kslednam wait_event, -- event name i.kslestim - nvl(b.time, 0) time -- non-background time from sys.x_$kslei i, -- system events ( select /*+ ordered use_hash(e) */ -- no fixed index on e e.kslesenm, -- event number sum(e.kslestim) time -- time waited by backgrounds from sys.x_$ksuse s, -- sessions sys.x_$ksbdp b, -- backgrounds sys.x_$ksles e -- session events where s.ksspaown = b.ksbdppro and -- background session e.kslessid = s.indx group by e.kslesenm having sum(e.kslestim) > 0 ) b, sys.x_$ksled d where i.kslestim > 0 and b.kslesenm (+) = i.indx and nvl(b.time, 0) < i.kslestim and d.indx = i.indx and d.kslednam not in ( 'Null event', 'KXFQ: Dequeue Range Keys - Slave', 'KXFQ: Dequeuing samples', 'KXFQ: kxfqdeq - dequeue from specific qref', 'KXFQ: kxfqdeq - normal deqeue', 'KXFX: Execution Message Dequeue - Slave', 'KXFX: Parse Reply Dequeue - Query Coord', 'KXFX: Reply Message Dequeue - Query Coord', 'PAR RECOV : Dequeue msg - Slave', 'PAR RECOV : Wait for reply - Query Coord', 'Parallel Query Idle Wait - Slaves', 'PL/SQL lock timer', 'PX Deq: Execute Reply', 'PX Deq: Execution Msg', 'PX Deq: Index Merge Execute', 'PX Deq: Index Merge Reply', 'PX Deq: Par Recov Change Vector', 'PX Deq: Par Recov Execute', 'PX Deq: Par Recov Reply', 'PX Deq: Parse Reply', 'PX Deq: Table Q Get Keys', 'PX Deq: Table Q Normal', 'PX Deq: Table Q Sample', 'PX Deq: Table Q qref', 'PX Deq: Txn Recovery Reply', 'PX Deq: Txn Recovery Start', 'PX Deque wait', 'PX Idle Wait', 'Replication Dequeue', 'Replication Dequeue ', 'SQL*Net message from client', 'SQL*Net message from dblink', 'debugger command', 'dispatcher timer', 'parallel query dequeue wait', 'pipe get', 'queue messages', 'rdbms ipc message', 'secondary event', 'single-task message', 'slave wait', 'virtual circuit status' ) and d.kslednam not like 'resmgr:%' ) ) order by n_major, n_minor, time desc / @restore_sqlplus_settings Do we have some similar for Postgres? Thanks. -Jack On Fri, 2005-03-25 at 12:40, Josh Berkus wrote: > Jack, > > > I am thinking about how to continuously monitor the performance of a > > PostgreSQL 8 database. I am interested in two things: (1) the growth of > > tables with TOAST and indexes; > > This is queryable from the system tables, if you don't mind an approximate. > > > and (2) the respond time breakdown for a > > query. > > The what? You mean EXPLAIN ANALYZE? > > > In Chapters 23 and 24 of the big manual, I found enough materials to > > teach me how to do the 1st job. But I have difficulty with the 2nd one. > > I found some script for Oracle > > (http://www.ixora.com.au/scripts/waits.htm). > > Life's too short for reading Oracle docs. Can you just explain, in > step-by-step detail, what you want?
Jack, > This script can be used to focus tuning attention on the most important > issues. It reports a breakdown of total foreground response time into > four major categories: CPU usage, disk I/O, resource waits, and routine > latencies. These categories are broken down further into sub-categories, > and the component wait events are shown. This would be very nice. And very, very hard to build. No, we don't have anything similar. You can, of course, use profiling tools. -- Josh Berkus Aglio Database Solutions San Francisco
I just wanted to follow up and let everyone know that the biggest improvement in performance came from moving the pg_xlog directory to another filesystem (different set of disks) separate from the data directory. Thanks for the suggestions. -- Brandon