Thread: Sanity check requested
Hi folks- For some time, we've been running Postgres with the default configuration & getting adequate performance, but the time has come to tune a bit, so I've been lurking on this list & gathering notes. Now I'm about ready to make a change & would appreciate it if a few more experienced folks could comment on whether I appear to be heading in the right direction- Here's what I'm planning: Increase SHMMAX and SHMALL in my kernel to 134217728 (128MB) Increase shared_buffers to 8192 (64MB) Increase sort_mem to 16384 (16MB) Increase effective_cache_size to 65536 (1/2 GB) Here's the environment: The Hardware is a dual-processor Athlon 1.2 Ghz box with 1 GB of RAM and the DB on SCSI RAID drives. The server runs only PostgreSQL The database size is about 8GB, with the largest table 2.5 GB, and the two most commonly queried tables at 1 GB each. The two most commonly queried tables are usually queried based on a non-unique indexed varchar field typically 20 chars long. The query is a "like" on people's names with trailing %, so this often gets pushed to seq scan or returns several thousand records. (As when someone searches on 'Jones%'. Records from the largest table are always accessed via unique index in groups of 20 or less. The OS is Debian Linux kernel 2.4.x (recompiled custom kernel for dual processor support) The PostgreSQL version is 7.3.2 We typically have about 30 interactive users on the DB, but they're using a shared connection pool of 16. Our main problem appears to be when one of the users fires up a large query and creates a log-jam with resources. My reasoning is that I'll increase shared_buffers based on anecdotal recommendations I've seen on this list to 64MB. I'll boost the OS SHMMAX to twice that value to allow adequate room for other shared memory needs, thus reserving 128MB. Of the remaining memory, 256MB goes to 16 connections * 16MB sort space, if I leave about 128 MB for headroom, then 1/2 GB should be left available for the effective cache size. Any thoughts? Is this a sane plan? Are there other parameters I should consider changing first? Thanks! -Nick --------------------------------------------------------------------- Nick Fankhauser nickf@doxpop.com Phone 1.765.965.7363 Fax 1.765.962.9788 doxpop - Court records at your fingertips - http://www.doxpop.com/ --------------------------------------------------------------------- Nick Fankhauser nickf@doxpop.com Phone 1.765.965.7363 Fax 1.765.962.9788 doxpop - Court records at your fingertips - http://www.doxpop.com/
On 14 Jul 2003 at 12:51, Nick Fankhauser wrote: > Any thoughts? Is this a sane plan? Are there other parameters I should > consider changing first? Well, everything seems to be in order and nothing much to suggest I guess. But still.. 1. 30 users does not seem to be much of a oevrhead. If possible try doing away with connection pooling. Buta test benchmark run is highly recommended. 2. While increasing sort memory, try 4/8/16 in that order. That way you will get a better picture of load behaviour. Though whatever you put appears reasonable, having more data always help. 3. I don't know how this affects on SCSI drives, but what file system you are using? Can you try diferent ones? Like reiserfs/ext3 and XFS? See what fits your bill. 4. OK, this is too much but linux kernel 2.6 is in test and has vastly improved IO scheduler. May be you should look at it if you are up to experimentation. HTH Bye Shridhar -- You're dead, Jim. -- McCoy, "The Tholian Web", stardate unknown
Shridhar- I appreciate your thoughts- I'll be running some before & after tests on this using one of our development/hot-swap boxes, so I'll report the results back to the list. A few more thoughts/questions: > 1. 30 users does not seem to be much of a oevrhead. If possible > try doing away with connection pooling. The application needs to scale up gracefully. We actually have about 200 users that could decide to log on at the same time- 30 is just a typical load. We'd also prefer to have 20,000 subscribers so we can start making a living with this business <g>. > 2. While increasing sort memory, try 4/8/16 in that order. That > way you will get a better picture of load behaviour. Though whatever you put appears > reasonable, having more data always help. I'll try that approach while testing. Is it the case that the sort memory is allocated for each connection and becomes unavailable to other processes while the connection exists? If so, since I'm using a connection pool, I should be able to control total usage precisely. Without a connection pool, I could start starving the rest of the system for resources if the number of users spiked unexpectedly. Correct? > 3. I don't know how this affects on SCSI drives, but what file > system you are using? Can you try diferent ones? > 4. OK, this is too much but linux kernel 2.6 is in test and has > vastly improved IO... I'm using ext2. For now, I'll leave this and the OS version alone. If I change too many variables, I won't be able to discern which one is causing a change. Although I understand that there's an element of art to tuning, I'm enough of a neophyte that I don't have a "feeling" for the tuning parameters yet and hence I have to take a scientific approach of just tweaking a few variables in an otherwise controlled and unchanged environment. If I can't reach my goals with the simple approach, I'll consider some of the more radical ideas. Again, thanks for the ideas- I'll feed the results back after I've done some tests -Nick
Nick, > I'll try that approach while testing. Is it the case that the sort memory > is allocated for each connection and becomes unavailable to other processes > while the connection exists? If so, since I'm using a connection pool, I > should be able to control total usage precisely. Without a connection pool, > I could start starving the rest of the system for resources if the number > of users spiked unexpectedly. Correct? Wrong, actually. Sort memory is allocated *per sort*, not per connnection or per query. So a single complex query could easily use 4xsort_mem if it has several merge joins ... and a pooled connection could use many times sort_mem depending on activity. Thus connection pooling does not help you with sort_mem usage at all, unless your pooling mechanism can control the rate at which queries are fed to the planner. -- Josh Berkus Aglio Database Solutions San Francisco
> Wrong, actually. Sort memory is allocated *per sort*, not per connnection or > per query. So a single complex query could easily use 4xsort_mem if it has > several merge joins ... Thanks for the correction- it sounds like this is one where usage can't be precisely controlled in a dynamic user environment & I just need to get a feel for what works under a load that approximates my production system. -Nick
Nick Fankhauser wrote: > Thanks for the correction- it sounds like this is one where usage > can't be precisely controlled in a dynamic user environment & I just > need to get a feel for what works under a load that approximates my > production system. > I think the most important point here is that if you set sort_mem too high, and you have a lot of simultaneous sorts, you can drive the server into swapping, which obviously is a very bad thing. You want it set as high as possible, but not so high given your usage patterns that you wind up swapping. Joe
On 17 Jul 2003 at 10:41, Nick Fankhauser wrote: > I'm using ext2. For now, I'll leave this and the OS version alone. If I I appreciate your approach but it almost proven that ext2 is not the best and fastest out there. IMO, you can safely change that to reiserfs or XFS. Or course, testing is always recommended. HTH Bye Shridhar -- Newton's Little-Known Seventh Law: A bird in the hand is safer than one overhead.
I have two tables in the database which are used almost every time someone makes use of the webpages themselves. The first, is some sort of database side parameter list which stores parameters from session to session. While the other, is a table that handles the posting activity of all the rooms and chatters. The first is required in all authentication with the system and when entries are missing you are challenged by the system to prove your identity. This table is based on a randomized order, as in the unique number changes pseudo randomly and this table sees a reduction in entries every hour on the hour as to keep it's information fresh and manageable. The other table follows a sequential order and carries more columns of information. However, this table clears it's entry nightly and with current settings will delete roughly a days traffic sitting at 50K rows of information. The difference is as follows: Without making the use of vacuum every hour the parameter table performs very well, showing no loss in service or degradation. Since people authenticate more then post, it is safe to assume that it removes more rows daily then the posting table. The posting table often drags the system down in performance when a day has been skipped, which includes the use of VACUUM ANALYZE EXPLAIN. This seems to be an indication that the process of a daily delete is actually a very wise step to take, even if the information itself is not needed for very long. A VACUUM FULL will correct the issue, but put the site out of commission for roughly 20 minutes as the drive crunches the information. My question is, should the purging of rows be done more often then once a day for both tables. Is this why performance seems to take a hit specifically? As there were too many rows purged for vacuum to accurately keep track of? Martin Foster Creator/Designer Ethereal Realms martin@ethereal-realms.org
Shridhar Daithankar wrote: > On 17 Jul 2003 at 10:41, Nick Fankhauser wrote: > >>I'm using ext2. For now, I'll leave this and the OS version alone. If I > > > I appreciate your approach but it almost proven that ext2 is not the best and > fastest out there. Agreed. > IMO, you can safely change that to reiserfs or XFS. Or course, testing is > always recommended. We've been using ext3fs for our production systems. (Red Hat Advanced Server 2.1) And since your (Nick) system is based on Debian, I have done some rough testing on Debian sarge (testing) (with custom 2.4.20) with ext3fs, reiserfs and jfs. Can't get XFS going easily on Debian, though. I used a single partition mkfs'd with ext3fs, reiserfs and jfs one after the other on an IDE disk. Ran pgbench and osdb-x0.15-0 on it. jfs's has been underperforming for me. Somehow the CPU usage is higher than the other two. As for ext3fs and reiserfs, I can't detect any significant difference. So if you're in a hurry, it'll be easier to convert your ext2 to ext3 (using tune2fs) and use that. Otherwise, it'd be nice if you could do your own testing, and post it to the list. -- Linux homer 2.4.18-14 #1 Wed Sep 4 13:35:50 EDT 2002 i686 i686 i386 GNU/Linux 2:30pm up 204 days, 5:35, 5 users, load average: 5.50, 5.18, 5.13
Attachment
On 18/07/2003 07:55 Martin Foster wrote: > [snip] > A VACUUM FULL will correct the issue, but put the site out of commission > for roughly 20 minutes as the drive crunches the information. > > My question is, should the purging of rows be done more often then once > a day for both tables. Is this why performance seems to take a hit > specifically? As there were too many rows purged for vacuum to > accurately keep track of? ISTR that there are setting in postgresql.conf which affect how many tables/rows vacuum can reclaim. The docs say that the default setting of max_fsm_pages is 10000. Maybe this should be increased for your situation? HTH -- Paul Thomas +------------------------------+---------------------------------------------+ | Thomas Micro Systems Limited | Software Solutions for the Smaller Business | | Computer Consultants | http://www.thomas-micro-systems-ltd.co.uk | +------------------------------+---------------------------------------------+
On Fri, Jul 18, 2003 at 12:55:12AM -0600, Martin Foster wrote: > The other table follows a sequential order and carries more columns of > information. However, this table clears it's entry nightly and with > current settings will delete roughly a days traffic sitting at 50K rows > of information. > has been skipped, which includes the use of VACUUM ANALYZE EXPLAIN. > This seems to be an indication that the process of a daily delete is > actually a very wise step to take, even if the information itself is not > needed for very long. > > A VACUUM FULL will correct the issue, but put the site out of commission > for roughly 20 minutes as the drive crunches the information. During your "clearing period", why not do the deletes in batches, and VACUUM the table periodically. That will allow you to reclaim the space gradually, and ensure that you don't end up with a big "bald spot". But you probably want to increase your FSM settings. See the docs. A -- ---- Andrew Sullivan 204-4141 Yonge Street Liberty RMS Toronto, Ontario Canada <andrew@libertyrms.info> M2P 2A8 +1 416 646 3304 x110
Martin Foster <martin@ethereal-realms.org> writes: > My question is, should the purging of rows be done more often then once > a day for both tables. Is this why performance seems to take a hit > specifically? Given that the hourly purge seems to work well for you, I'd suggest trying it on both tables. Non-FULL vacuum is intended to be run *frequently*, say as often as you've updated or deleted 10% to 50% of the rows in a table. Delaying it until you've had multiple complete turnovers of the table contents will cost you. > As there were too many rows purged for vacuum to > accurately keep track of? Only possible if you don't have the FSM parameters set high enough. Infrequent vacuuming means you need more FSM space, btw. regards, tom lane
Thanks for the suggestions in the FS types- especially the Debian oriented info. I'll start by playing with the memory allocation parameters that I originally listed (seems like they should provide results in a way that is unaffected by the disk IO). Then once I have them at optimal values, move on to trying different file systems. I assume that as I make changes that affect the disk IO performance, I'll then need to do some testing to find new values for the IO cost for the planner- Do you folks have some ballpark numbers to start with for this based on your experience? I'm departing in three ways from the simple IDE model that (I presume) the default random page cost of 4 is based on- The disks are SCSI & RAID and the FS would be different. At this point, I can't think of any better way to test this than simply running my local test suite with various values and recording the wall-clock results. Is there a different approach that might make more sense? (This means that my results will be skewed to my environment, but I'll post them anyway.) I'll post results back to the list as I get to it- It might be a slow process Since I spend about 18 hours of each day keeping the business running, I'll have to cut back on sleep & do this in the other 10 hours. <g> -NF > Shridhar Daithankar wrote: > I appreciate your approach but it almost proven that ext2 is > not the best and fastest out there. > > Agreed. > Ang Chin Han wrote: > We've been using ext3fs for our production systems. (Red Hat Advanced > Server 2.1) > > And since your (Nick) system is based on Debian, I have done some rough > testing on Debian sarge (testing) (with custom 2.4.20) with ext3fs, > reiserfs and jfs. Can't get XFS going easily on Debian, though. > > I used a single partition mkfs'd with ext3fs, reiserfs and jfs one after > the other on an IDE disk. Ran pgbench and osdb-x0.15-0 on it. > > jfs's has been underperforming for me. Somehow the CPU usage is higher > than the other two. As for ext3fs and reiserfs, I can't detect any > significant difference. So if you're in a hurry, it'll be easier to > convert your ext2 to ext3 (using tune2fs) and use that. Otherwise, it'd > be nice if you could do your own testing, and post it to the list. > > -- > Linux homer 2.4.18-14 #1 Wed Sep 4 13:35:50 EDT 2002 i686 i686 i386 > GNU/Linux > 2:30pm up 204 days, 5:35, 5 users, load average: 5.50, 5.18, 5.13 >
On Fri, 18 Jul 2003, Ang Chin Han wrote: > Shridhar Daithankar wrote: > > On 17 Jul 2003 at 10:41, Nick Fankhauser wrote: > > > >>I'm using ext2. For now, I'll leave this and the OS version alone. If I > > > > > > I appreciate your approach but it almost proven that ext2 is not the best and > > fastest out there. > > Agreed. Huh? How can journaled file systems hope to outrun a simple unjournaled file system? There's just less overhead for ext2 so it's quicker, it's just not as reliable. I point you to this link from IBM: http://www-124.ibm.com/developerworks/opensource/linuxperf/iozone/iozone.php While ext3 is a clear loser to jfs and rfs, ext2 wins most of the contests against both reiser and jfs. Note that xfs wasn't tested here. But in general, ext2 is quite fast nowadays. > > > IMO, you can safely change that to reiserfs or XFS. Or course, testing is > > always recommended. > > We've been using ext3fs for our production systems. (Red Hat Advanced > Server 2.1) > > And since your (Nick) system is based on Debian, I have done some rough > testing on Debian sarge (testing) (with custom 2.4.20) with ext3fs, > reiserfs and jfs. Can't get XFS going easily on Debian, though. > > I used a single partition mkfs'd with ext3fs, reiserfs and jfs one after > the other on an IDE disk. Ran pgbench and osdb-x0.15-0 on it. > > jfs's has been underperforming for me. Somehow the CPU usage is higher > than the other two. As for ext3fs and reiserfs, I can't detect any > significant difference. So if you're in a hurry, it'll be easier to > convert your ext2 to ext3 (using tune2fs) and use that. Otherwise, it'd > be nice if you could do your own testing, and post it to the list. I would like to see some tests on how they behave on top of large fast RAID arrays, like a 10 disk RAID5 or something. It's likely that on a single IDE drive the most limiting factor is the bandwidth of the drive, whereas on a large array, the limiting factor would likely be the file system code.
On 2003-07-17 10:41:35 -0500, Nick Fankhauser wrote: > I'm using ext2. For now, I'll leave this and the OS version alone. If I > I'd upgrade to a journaling filesystem as soon as possible for reliability. Testing in our own environment has shown that PostgreSQL performs best on ext3 (yes, better than XFS, JFS or ReiserFS) with a linux 2.4.21 kernel. Be sure to mount noatime and to create the ext3 partition with the correct stripe size of your RAID array using the '-R stride=foo' option (see man mke2fs). Vincent van Leeuwen Media Design - http://www.mediadesign.nl/
> Be sure to mount noatime I did "chattr -R +A /var/lib/pgsql/data" that should do the trick as well or am I wrong? regards, Oli
On 2003-07-18 18:20:55 +0200, Oliver Scheit wrote: > > Be sure to mount noatime > > I did "chattr -R +A /var/lib/pgsql/data" > that should do the trick as well or am I wrong? > According to the man page it gives the same effect. There are a few things you should consider though: - new files won't be created with the same options (I think), so you'll have to run this command as a daily cronjob or something to that effect - chattr is probably more filesystem-specific than a noatime mount, although this isn't a problem on ext[23] ofcourse Vincent van Leeuwen Media Design - http://www.mediadesign.nl/
>> > Be sure to mount noatime >> >> I did "chattr -R +A /var/lib/pgsql/data" >> that should do the trick as well or am I wrong? >> > > According to the man page it gives the same effect. > There are a few things you should consider though: > - new files won't be created with the same options (I think), > so you'll have to run this command as a daily cronjob or > something to that effect This would be a really interesting point to know. I will look into this. I think the advantage of "chattr" is that the last access time is still available for the rest of the filesystem. (Of course you could have your own filesystem just for the database stuff, in this case the advantage would be obsolete) regards, Oli
I'm confused: Ang Chin Han wrote: > We've been using ext3fs for our production systems. (Red Hat Advanced > Server 2.1) Vincent van Leeuwen wrote: > I'd upgrade to a journaling filesystem as soon as possible for > reliability. ...About one year ago I considered moving to a journaling file system, but opted not to because it seems like that's what WAL does for us already. How does putting a journaling file system under it add more reliability? I also guessed that a journaling file system would add overhead because now a write to the WAL file could itself be deferred and logged elsewhere. ...So now I'm really puzzled because folks are weighing in with solid anecdotal evidence saying that I'll get both better reliability and performance. Can someone explain what I'm missing about the concept? -A puzzled Nick
Nick, > ...About one year ago I considered moving to a journaling file system, but > opted not to because it seems like that's what WAL does for us already. How > does putting a journaling file system under it add more reliability? It lets you restart your server quickly after an unexpected power-out. Ext2 is notoriously bad about this. Also, WAL cannot necessarily recover properly if the underlying filesystem is corrupted. > I also guessed that a journaling file system would add overhead because now > a write to the WAL file could itself be deferred and logged elsewhere. You are correct. -- -Josh Berkus Aglio Database Solutions San Francisco
> ...About one year ago I considered moving to a journaling file system, but > opted not to because it seems like that's what WAL does for us already. How > does putting a journaling file system under it add more reliability? WAL only works if the WAL files are actually written to disk and can be read off it again. Ext2 has a number of deficiencies which can cause problems with this basic operation (inode corruptions, etc). Journaling does not directly help.
Attachment
"Nick Fankhauser" <nickf@ontko.com> writes: > I'm departing in three ways from the simple IDE > model that (I presume) the default random page cost of 4 is based on- The > disks are SCSI & RAID and the FS would be different. Actually, the default 4 is based on experiments I did quite awhile back on HPUX (with a SCSI disk) and Linux (with an IDE disk, and a different filesystem). I didn't see too much difference between 'em. RAID might alter the equation, or not. regards, tom lane