Thread: Configuring PostgreSQL to minimize impact of checkpoints
My company is developing a PostgreSQL application. We're using 7.3.4 but will soon upgrade to 7.4.x. Our OS is RedHat 9. Our production machines have 512 MB RAM and IDE disks. So far we've been using default configuration settings, but I have started to examine performance and to modify these settings. Our typical transaction involves 5-10 SELECT, INSERT or UPDATEs, (usually 1/2 SELECT and the remainder a mixture of INSERT and UPDATE). There are a few aggregation queries which need to scan an entire table. We observed highly uneven performance for the small transactions. A transaction usually runs in under 100 msec, but we would see spikes as high as 40,000 msec. These spikes occurred regularly, every 4-5 minutes, and I speculated that checkpointing might be the issue. I created a test case, based on a single table: create table test( id int not null, count int not null, filler varchar(200), primary key(id)) I loaded a database with 1,000,000 rows, with the filler column always filled with 200 characters. I then ran a test in which a random row was selected, and the count column incremented. Each transaction contained ten such updates. In this test, I set shared_buffers = 2000 checkpoint_segments = 40 checkpoint_timeout = 600 wal_debug = 1 I set checkpoint_segments high because I wanted to see whether the spikes correlated with checkpoints. Most transactions completed in under 60 msec. Approximately every 10th transaction, the time went up to 500-600 msec, (which is puzzling, but not my major concern). I did see a spike every 10 minutes, in which transaction time goes up to 5000-8000 msec. The spikes were correlated with checkpoint activity, occurring slightly before a log entry that looks like this: 2004-05-09 16:34:19 LOG: INSERT @ 2/C2A0F628: prev 2/C2A0F5EC; xprev 0/0; xid 0: XLOG - checkpoint: redo 2/C2984D4C; undo 0/0; sui 36; xid 1369741; oid 6321782; online Questions: 1. Can someone provide an overview of checkpoint processing, to help me understand the performance issues? 2. Is the spike due to the checkpoint process keeping the disk busy? Or is there some locking involved that blocks my application until the checkpoint completes? 3. The spikes are quite problematic for us. What can I do to minimize the impact of checkpointing on my application? I understand how checkpoint_segments and checkpoint_timeout determine when a checkpoint occurs; what can I do to lessen the impact of a checkpoint? 4. I understand that a "background writer" is being contemplated for 7.5. Will that replace or augment the checkpoint process? Any comments on how that work will apply to my problem would be appreciated. I wouldn't mind seeing the average performance, (without the spikes) go up -- let's say -- 10%, in exchange for more uniform performance. These spikes are a real problem. Jack Orenstein ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program.
jao@geophile.com wrote: > 4. I understand that a "background writer" is being contemplated for > 7.5. Will that replace or augment the checkpoint process? Any > comments on how that work will apply to my problem would be > appreciated. I wouldn't mind seeing the average performance, > (without the spikes) go up -- let's say -- 10%, in exchange for > more uniform performance. These spikes are a real problem. The background writer is designed to address your specific problem. We will stil checkpoint, but the spike should be greatly minimized. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Bruce Momjian wrote: > jao@geophile.com wrote: > >>4. I understand that a "background writer" is being contemplated for >>7.5. Will that replace or augment the checkpoint process? Any >>comments on how that work will apply to my problem would be >>appreciated. I wouldn't mind seeing the average performance, >>(without the spikes) go up -- let's say -- 10%, in exchange for >>more uniform performance. These spikes are a real problem. > > > The background writer is designed to address your specific problem. We > will stil checkpoint, but the spike should be greatly minimized. > Thanks. Do you know when 7.5 is expected to be released? Until then, is a workaround known? Also, are the delays I'm seeing out of the ordinary? I'm looking at one case in which two successive transactions, each updating a handful of records, take 26 and 18 *seconds* (not msec) to complete. These transactions normally complete in under 30 msec. Jack Orenstein
Jack Orenstein wrote: > Bruce Momjian wrote: > > jao@geophile.com wrote: > > > >>4. I understand that a "background writer" is being contemplated for > >>7.5. Will that replace or augment the checkpoint process? Any > >>comments on how that work will apply to my problem would be > >>appreciated. I wouldn't mind seeing the average performance, > >>(without the spikes) go up -- let's say -- 10%, in exchange for > >>more uniform performance. These spikes are a real problem. > > > > > > The background writer is designed to address your specific problem. We > > will stil checkpoint, but the spike should be greatly minimized. > > > > Thanks. Do you know when 7.5 is expected to be released? 3-6 months. > Until then, is a workaround known? Also, are the delays I'm seeing out of the ordinary? > I'm looking at one case in which two successive transactions, each updating a handful of > records, take 26 and 18 *seconds* (not msec) to complete. These transactions normally complete > in under 30 msec. Wow. Others might know the answer to that. -- Bruce Momjian | http://candle.pha.pa.us pgman@candle.pha.pa.us | (610) 359-1001 + If your life is a hard drive, | 13 Roberts Road + Christ can be your backup. | Newtown Square, Pennsylvania 19073
Jack Orenstein <jao@geophile.com> writes: > I'm looking at one case in which two successive transactions, each > updating a handful of records, take 26 and 18 *seconds* (not msec) to > complete. These transactions normally complete in under 30 msec. I've seen installations in which it seemed that the "normal" query load was close to saturating the available disk bandwidth, and the extra load imposed by a background VACUUM just pushed the entire system's response time over a cliff. In an installation that has I/O capacity to spare, a VACUUM doesn't really hurt foreground query response at all. I suspect that the same observations hold true for checkpoints, though I haven't specifically seen an installation suffering from that effect. Already-committed changes for 7.5 include a background writer, which basically will "trickle" out dirty pages between checkpoints, thereby hopefully reducing the volume of I/O forced at a checkpoint. We have also got code in place that throttles the rate of I/O requests during VACUUM. It seems like it might be useful to similarly throttle the I/O request rate during a CHECKPOINT, though I'm not sure if there'd be any bad side effects from lengthening the elapsed time for a checkpoint. (Jan, any thoughts here?) None of this is necessarily going to fix matters for an installation that has no spare I/O capacity, though. And from the numbers you're quoting I fear you may be in that category. "Buy faster disks" may be the only answer ... regards, tom lane
> > Jack Orenstein <jao@geophile.com> writes: > > I'm looking at one case in which two successive transactions, each > > updating a handful of records, take 26 and 18 *seconds* (not msec) to > > complete. These transactions normally complete in under 30 msec. ... > None of this is necessarily going to fix matters for an installation > that has no spare I/O capacity, though. And from the numbers you're > quoting I fear you may be in that category. "Buy faster disks" may > be the only answer ... > I had a computer once that had an out-of-the-box hard drive configuration that provided horrible disk performance. I found a tutorial at O'Reilly that explained how to use hdparm to dramatically speed up disk performance on Linux. I've noticed on other computers I've set up recently that hdparm seems to be used by default out of the box to give good performance. Maybe your computer is using all of it's I/O capacity because it's using PIO mode or some other non-optimal method of accessing the disk. Just a suggestion, I hope it helps, Matthew Nuzum | ISPs: Make $200 - $5,000 per referral by www.followers.net | recomending Elite CMS to your customers! matt@followers.net | http://www.followers.net/isp
Jan Wieck <JanWieck@Yahoo.com> writes: > If we would combine the background writer and the checkpointer, ... which in fact is on my agenda of things to do ... > then a > "checkpoint flush" could actually be implemented as a temporary change > in that activity that basically is done by not reevaluating the list of > to be flushed blocks any more and switching to a constant amount of > blocks flushed per cycle. When that list get's empty, the checkpoint > flush is done, the checkpoint can complete and the background writer > resumes normal business. Sounds like a plan. I'll do it that way. However, we might want to have different configuration settings controlling the write rate during checkpoint and the rate during normal background writing --- what do you think? Also, presumably a shutdown checkpoint should just whomp out the data without any delays. We can't afford to wait around and risk having init decide we took too long. >> None of this is necessarily going to fix matters for an installation >> that has no spare I/O capacity, though. > As a matter of fact, the background writer increases the overall IO. It > writes buffers that possibly get modified again before a checkpoint or > their replacement requires them to be finally written. So if there is no > spare IO bandwidth, it makes things worse. Right, the trickle writes could be wasted effort. regards, tom lane
Matthew Nuzum wrote: >>Jack Orenstein <jao@geophile.com> writes: >> >>>I'm looking at one case in which two successive transactions, each >>>updating a handful of records, take 26 and 18 *seconds* (not msec) to >>>complete. These transactions normally complete in under 30 msec. >>None of this is necessarily going to fix matters for an installation >>that has no spare I/O capacity, though. And from the numbers you're >>quoting I fear you may be in that category. "Buy faster disks" may >>be the only answer ... >> > I had a computer once that had an out-of-the-box hard drive configuration > that provided horrible disk performance. I found a tutorial at O'Reilly > that explained how to use hdparm to dramatically speed up disk performance > on Linux. I've noticed on other computers I've set up recently that hdparm > seems to be used by default out of the box to give good performance. > > Maybe your computer is using all of it's I/O capacity because it's using PIO > mode or some other non-optimal method of accessing the disk. There's certainly some scope there. I have an SGI Octane whos SCSI 2 disks were set-up by default with no write buffer and CTQ depth of zero :/ IDE drivers in Linux maybe not detecting your IDE chipset correctly and stepping down, however unlikely there maybe something odd going on but you could check hdparm out. Ensure correct cables too, and the aren't crushed or twisted too bad.... I digress... Assuming you're running with optimal schema and index design (ie you're not doing extra work unnecessarily), and your backend has better-then-default config options set-up (plenty of tips around here), then disk arrangement is critical to smoothing the ride. Taking things to a relative extreme, we implemented a set-up with issues similar sounding to yours. It was resolved by first optimising everything but hardware, then finally optimising hardware. This served us because it meant we squeezed as much out of the available hardware, before finally throwing more at it, getting us the best possible returns (plus further post optimisation on the new hardware). First tip would to take your pg_xlog and put it on another disk (and channel). Next if you're running a journalled fs, get that journal off onto another disk (and channel). Finally, get as many disks for the data store and spread the load across spindles. You're aiming here to distribute the contention and disk I/O more evenly to remove the congestion. sar and iostat help out as part of the analysis. You say you're using IDE, for which I'd highly recommend switching to SCSI and mutliple controllers because IDE isn't great for lots of other reasons. Obviously budgets count, and playing with SCSI certainly limits that. We took a total of 8 disks across 2 SCSI 160 channels and split up the drives into a number of software RAID arrays. RAID0 mirrors for the os, pg_xlog, data disk journal and swap and the rest became a RAID5 array for the data. You could instead implement your DATA disk as RAID1+0 if you wanted more perf at the cost of free space. Anyway, it's certainly not the fastest config out there, but it made all the difference to this particular application. Infact, we had so much free I/O we recently installed another app on there (based on mysql, sorry) which runs concurrently, and itself 4 times faster than it originally did... YMMV, just my 2p. -- Rob Fielding rob@dsvr.net www.dsvr.co.uk Development Designer Servers Ltd
Quoting Rob Fielding <rob@dsvr.net>: > Assuming you're running with optimal schema and index design (ie you're > not doing extra work unnecessarily), and your backend has > better-then-default config options set-up (plenty of tips around here), > then disk arrangement is critical to smoothing the ride. The schema and queries are extremely simple. I've been experimenting with config options. One possibility I'm looking into is whether shared_buffers is too high, at 12000. We have some preliminary evidence that setting it lower (1000) reduces the demand for IO bandwidth to a point where the spikes become almost tolerable. > First tip would to take your pg_xlog and put it on another disk (and > channel). That's on my list of things to try. > Next if you're running a journalled fs, get that journal off > onto another disk (and channel). Finally, get as many disks for the data > store and spread the load across spindles. Dumb question: how do I spread the data across spindles? Do you have a pointer to something I could read? Jack Orenstein ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program.
On Tue, 11 May 2004 jao@geophile.com wrote: > Quoting Rob Fielding <rob@dsvr.net>: > > > Assuming you're running with optimal schema and index design (ie you're > > not doing extra work unnecessarily), and your backend has > > better-then-default config options set-up (plenty of tips around here), > > then disk arrangement is critical to smoothing the ride. > > The schema and queries are extremely simple. I've been experimenting > with config options. One possibility I'm looking into is whether > shared_buffers is too high, at 12000. We have some preliminary evidence > that setting it lower (1000) reduces the demand for IO bandwidth to > a point where the spikes become almost tolerable. If the shared_buffers are large, postgresql seems to have a performance issue with handling them. Plus they can cause the kernel to dump cache on things that would otherwise be right there and therefore forces the database to hit the drives. You might wanna try settings between 1000 and 10000 and see where your sweet spot is. > > First tip would to take your pg_xlog and put it on another disk (and > > channel). > > That's on my list of things to try. > > > Next if you're running a journalled fs, get that journal off > > onto another disk (and channel). Finally, get as many disks for the data > > store and spread the load across spindles. > > Dumb question: how do I spread the data across spindles? Do you have > a pointer to something I could read? Look into a high quality hardware RAID controller with battery backed cache on board. We use the ami/lsi megaraid and I'm quite pleased with its writing performance. How you configure your drives is up to you. For smaller numbers of drives (6 or less) RAID 1+0 is usually a clear winner. For medium numbers of drives, say 8 to 20, RAID 5 works well. For more drives than that, many folks report RAID 5+0 or 0+5 to work well. I've only played around with 12 or fewer drives, so I'm saying RAID 5+0 is a good choice from my experience, just reflecting back what I've heard here on the performance mailing list. If you're not doing much writing, then a software RAID may be a good intermediate solution, especially RAID1 with >2 disks under linux seems a good setup for a mostly read database.
The king of statistics in these cases, is probably vmstat. one can drill down on specific things from there, but first you should send some vmstat output. Reducing cache -> reducing IO suggests to me the OS might be paging out shared buffers. This is indicated by activity in the "si" and "so" columns of vmstat. intentional disk activity by the applciation(postgres) shows up in the "bi" and "bo" columns. If you are having a "write storm" or bursty writes that's burying performance, a scsi raid controler with writeback cache will greatly improve the situation, but I do believe they run around $1-2k. If it's write specific problem, the cache matters more than the striping, except to say that write specfic perf problems should avoid raid5 please send the output of "vmstat 10" for about 10 minutes, spanning good performance and bad performance. On May 11, 2004, at 9:52 AM, jao@geophile.com wrote: > Quoting Rob Fielding <rob@dsvr.net>: > >> Assuming you're running with optimal schema and index design (ie >> you're >> not doing extra work unnecessarily), and your backend has >> better-then-default config options set-up (plenty of tips around >> here), >> then disk arrangement is critical to smoothing the ride. > > The schema and queries are extremely simple. I've been experimenting > with config options. One possibility I'm looking into is whether > shared_buffers is too high, at 12000. We have some preliminary evidence > that setting it lower (1000) reduces the demand for IO bandwidth to > a point where the spikes become almost tolerable. > >> First tip would to take your pg_xlog and put it on another disk (and >> channel). > > That's on my list of things to try. > >> Next if you're running a journalled fs, get that journal off >> onto another disk (and channel). Finally, get as many disks for the >> data >> store and spread the load across spindles. > > Dumb question: how do I spread the data across spindles? Do you have > a pointer to something I could read? > > Jack Orenstein > > ---------------------------------------------------------------- > This message was sent using IMP, the Internet Messaging Program. > > ---------------------------(end of > broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly >
On Tue, 11 May 2004, Paul Tuckfield wrote: > If you are having a "write storm" or bursty writes that's burying > performance, a scsi raid controler with writeback cache will greatly > improve the situation, but I do believe they run around $1-2k. If > it's write specific problem, the cache matters more than the striping, > except to say that write specfic perf problems should avoid raid5 Actually, a single channel MegaRAID 320-1 (single channel ultra 320) is only $421 at http://www.siliconmechanics.com/c248/u320-scsi.php It works pretty well for me, having 6 months of a production server on one with zero hickups and very good performance. They have a dual channel intel card for only $503, but I'm not at all familiar with that card. The top of the line megaraid is the 320-4, which is only $1240, which ain't bad for a four channel RAID controller. Battery backed cache is an addon, but I think it's only about $80 or so.
-----Original Message----- From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of scott.marlowe Sent: Tuesday, May 11, 2004 2:23 PM To: Paul Tuckfield Cc: jao@geophile.com; Matthew Nuzum; pgsql-performance@postgresql.org; Rob Fielding Subject: Re: [PERFORM] Configuring PostgreSQL to minimize impact of checkpoints On Tue, 11 May 2004, Paul Tuckfield wrote: > If you are having a "write storm" or bursty writes that's burying > performance, a scsi raid controler with writeback cache will greatly > improve the situation, but I do believe they run around $1-2k. If > it's write specific problem, the cache matters more than the striping, > except to say that write specfic perf problems should avoid raid5 Actually, a single channel MegaRAID 320-1 (single channel ultra 320) is only $421 at http://www.siliconmechanics.com/c248/u320-scsi.php It works pretty well for me, having 6 months of a production server on one with zero hickups and very good performance. They have a dual channel intel card for only $503, but I'm not at all familiar with that card. The top of the line megaraid is the 320-4, which is only $1240, which ain't bad for a four channel RAID controller. Battery backed cache is an addon, but I think it's only about $80 or so. ---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to majordomo@postgresql.org so that your message can get through to the mailing list cleanly ----------------------------- If you don't mind slumming on ebay :-) keep an eye out for PERC III cards, they are dell branded LSI cards. Perc = Power Edge Raid Controller. There are models on there dual channel u320 and dell usually sells them with battery backed cache. That's how I have acquired all my high end raid cards. Rob
Love that froogle. It looks like a nice card. One thing I didn't get straight is if the cache is writethru or write back. If the original posters problem is truly a burst write problem (and not linux caching or virtual memory overcommitment) then writeback is key. > On Tue, 11 May 2004, Paul Tuckfield wrote: > >> If you are having a "write storm" or bursty writes that's burying >> performance, a scsi raid controler with writeback cache will greatly >> improve the situation, but I do believe they run around $1-2k. If >> it's write specific problem, the cache matters more than the striping, >> except to say that write specfic perf problems should avoid raid5 > > Actually, a single channel MegaRAID 320-1 (single channel ultra 320) is > only $421 at http://www.siliconmechanics.com/c248/u320-scsi.php It > works > pretty well for me, having 6 months of a production server on one with > zero hickups and very good performance. They have a dual channel intel > card for only $503, but I'm not at all familiar with that card. > > The top of the line megaraid is the 320-4, which is only $1240, which > ain't bad for a four channel RAID controller. > > Battery backed cache is an addon, but I think it's only about $80 or > so. > > > ---------------------------(end of > broadcast)--------------------------- > TIP 3: if posting/reading through Usenet, please send an appropriate > subscribe-nomail command to majordomo@postgresql.org so that your > message can get through to the mailing list cleanly >
On Tue, 11 May 2004, Paul Tuckfield wrote: > Love that froogle. > > It looks like a nice card. One thing I didn't get straight is if > the cache is writethru or write back. > > If the original posters problem is truly a burst write problem (and not > linux caching or virtual memory overcommitment) then writeback is key. the MegaRaid can be configured either way. it defaults to writeback if the battery backed cache is present, I believe.
On Tue, 11 May 2004, Rob Sell wrote: > > If you don't mind slumming on ebay :-) keep an eye out for PERC III cards, > they are dell branded LSI cards. Perc = Power Edge Raid Controller. There > are models on there dual channel u320 and dell usually sells them with > battery backed cache. That's how I have acquired all my high end raid > cards. Not all Perc3s are lsi, many are adaptec. The perc3di is adaptec, the perc3dc is lsi/megaraid.
On Tue, 2004-05-11 at 14:52, Paul Tuckfield wrote: > Love that froogle. > > It looks like a nice card. One thing I didn't get straight is if > the cache is writethru or write back. The LSI MegaRAID reading/writing/caching behavior is user configurable. It will support both write-back and write-through, and IIRC, three different algorithms for reading (none, read-ahead, adaptive). Plenty of configuration options. It is a pretty mature and feature complete hardware RAID implementation. j. andrew rogers
>>>>> "TL" == Tom Lane <tgl@sss.pgh.pa.us> writes: TL> Jack Orenstein <jao@geophile.com> writes: >> I'm looking at one case in which two successive transactions, each >> updating a handful of records, take 26 and 18 *seconds* (not msec) to >> complete. These transactions normally complete in under 30 msec. TL> I've seen installations in which it seemed that the "normal" query load TL> was close to saturating the available disk bandwidth, and the extra load TL> imposed by a background VACUUM just pushed the entire system's response TL> time over a cliff. In an installation that has I/O capacity to spare, me stand up waving hand... ;-) This is my only killer problem left. I always peg my disk usage at 100% when vacuum runs, and other queries are slow too. When not running vacuum, my queries are incredibly zippy fast, including joins and counts and group by's on upwards of 100k rows at a time. TL> I suspect that the same observations hold true for checkpoints, though TL> I haven't specifically seen an installation suffering from that effect. I don't see that. But I also set checkpoint segments to about 50 on my big server. -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Vivek Khera, Ph.D. Khera Communications, Inc. Internet: khera@kciLink.com Rockville, MD +1-301-869-4449 x806 AIM: vivekkhera Y!: vivek_khera http://www.khera.org/~vivek/
>>>>> "JAR" == J Andrew Rogers <jrogers@neopolitan.com> writes: JAR> The LSI MegaRAID reading/writing/caching behavior is user configurable. JAR> It will support both write-back and write-through, and IIRC, three JAR> different algorithms for reading (none, read-ahead, adaptive). Plenty JAR> of configuration options. For PG max performance, you want to set it for write-back and read-ahead (adaptive has been claimed to be bad, but I got similar performace from read-ahead and adaptive, so YMMV). -- =-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-=-= Vivek Khera, Ph.D. Khera Communications, Inc. Internet: khera@kciLink.com Rockville, MD +1-301-869-4449 x806 AIM: vivekkhera Y!: vivek_khera http://www.khera.org/~vivek/
Quoting Vivek Khera <khera@kcilink.com>: > >>>>> "TL" == Tom Lane <tgl@sss.pgh.pa.us> writes: > > TL> Jack Orenstein <jao@geophile.com> writes: > >> I'm looking at one case in which two successive transactions, each > >> updating a handful of records, take 26 and 18 *seconds* (not msec) to > >> complete. These transactions normally complete in under 30 msec. > > TL> I've seen installations in which it seemed that the "normal" query load > TL> was close to saturating the available disk bandwidth, and the extra load > TL> imposed by a background VACUUM just pushed the entire system's response > TL> time over a cliff. In an installation that has I/O capacity to spare, > ... > TL> I suspect that the same observations hold true for checkpoints, though > TL> I haven't specifically seen an installation suffering from that effect. > > I don't see that. But I also set checkpoint segments to about 50 on > my big server. But wouldn't that affect checkpoint frequency, not checkpoint cost? Jack Orenstein ---------------------------------------------------------------- This message was sent using IMP, the Internet Messaging Program.
On May 12, 2004, at 3:22 PM, jao@geophile.com wrote: >> >> I don't see that. But I also set checkpoint segments to about 50 on >> my big server. > > But wouldn't that affect checkpoint frequency, not checkpoint cost Seems reasonable. I suppose checkpointing doesn't cost as much disk I/O as vacuum does. My checkpoints are also on a separate RAID volume on a separate RAID channel, so perhaps that gives me extra bandwidth to perform the checkpoints.