Thread: Quad Opteron stuck in the mud
Gurus, A table in one of my databases has just crossed the 30 million row mark and has begun to feel very sluggish for just about anything I do with it. I keep the entire database vacuumed regularly. And, as long as I'm not doing a sequential scan, things seem reasonably quick most of the time. I'm now thinking that my problem is IO because anything that involves heavy ( like a seq scan ) IO seems to slow to a crawl. Even if I am using indexed fields to grab a few thousand rows, then going to sequential scans it gets very very slow. I have also had the occurrence where queries will not finish for days ( I eventually have to kill them ). I was hoping to provide an explain analyze for them, but if they never finish... even the explain never finishes when I try that. For example, as I'm writing this, I am running an UPDATE statement that will affect a small part of the table, and is querying on an indexed boolean field. I have been waiting for over an hour and a half as I write this and it still hasn't finished. I'm thinking "I bet Tom, Simon or Josh wouldn't put up with this kind of wait time..", so I thought I would see if anyone here had some pointers. Maybe I have a really stupid setting in my conf file that is causing this. I really can't believe I am at the limits of this hardware, however. The query: update eventactivity set ftindex = false where ftindex = true; ( added the where clause because I don't want to alter where ftindex is null ) The table: Column | Type | Modifiers -------------+-----------------------------+----------- entrydate | timestamp without time zone | incidentid | character varying(40) | statustype | character varying(20) | unitid | character varying(20) | recordtext | character varying(255) | recordtext2 | character varying(255) | insertdate | timestamp without time zone | ftindex | boolean | Indexes: eventactivity1 btree (incidentid), eventactivity_entrydate_idx btree (entrydate), eventactivity_ftindex_idx btree (ftindex), eventactivity_oid_idx btree (oid) The hardware: 4 x 2.2GHz Opterons 12 GB of RAM 4x10k 73GB Ultra320 SCSI drives in RAID 0+1 1GB hardware cache memory on the RAID controller The OS: Fedora, kernel 2.6.6-1.435.2.3smp ( redhat stock kernel ) filesystem is mounted as ext2 ##### vmstat output ( as I am waiting for this to finish ): procs -----------memory---------- ---swap-- -----io---- --system-- ----cpu---- r b swpd free buff cache si so bi bo in cs us sy id wa 0 1 5436 2823908 26140 9183704 0 1 2211 540 694 336 9 2 76 13 ##### iostat output ( as I am waiting for this to finish ): avg-cpu: %user %nice %sys %iowait %idle 9.19 0.00 2.19 13.08 75.53 Device: tps Blk_read/s Blk_wrtn/s Blk_read Blk_wrtn cciss/c0d0 329.26 17686.03 4317.57 161788630 39496378 ##### This is a dedicated postgresql server, so maybe some of these settings are more liberal than they should be? relevant ( I hope ) postgresql.conf options are: shared_buffers = 50000 effective_cache_size = 1348000 random_page_cost = 3 work_mem = 512000 max_fsm_pages = 80000 log_min_duration_statement = 60000 fsync = true ( not sure if I'm daring enough to run without this ) wal_buffers = 1000 checkpoint_segments = 64 checkpoint_timeout = 3000 #---- FOR PG_AUTOVACUUM --# stats_command_string = true stats_row_level = true Thanks in advance, Dan
So sorry, I forgot to mention I'm running version 8.0.1 Thanks
Dan Harris wrote: > Gurus, > > even the explain never > finishes when I try that. Just a short bit. If "EXPLAIN SELECT" doesn't return, there seems to be a very serious problem. Because I think EXPLAIN doesn't actually run the query, just has the query planner run. And the query planner shouldn't ever get heavily stuck. I might be wrong, but there may be something much more substantially wrong than slow i/o. John =:->
Attachment
On Jul 13, 2005, at 1:11 PM, John A Meinel wrote: > > I might be wrong, but there may be something much more substantially > wrong than slow i/o. > John > Yes, I'm afraid of that too. I just don't know what tools I should use to figure that out. I have some 20 other databases on this system, same schema but varying sizes, and the small ones perform very well. It feels like there is an O(n) increase in wait time that has recently become very noticeable on the largest of them. -Dan
* Dan Harris (fbsd@drivefaster.net) wrote: > On Jul 13, 2005, at 1:11 PM, John A Meinel wrote: > >I might be wrong, but there may be something much more substantially > >wrong than slow i/o. > > Yes, I'm afraid of that too. I just don't know what tools I should > use to figure that out. I have some 20 other databases on this > system, same schema but varying sizes, and the small ones perform > very well. It feels like there is an O(n) increase in wait time that > has recently become very noticeable on the largest of them. Could you come up w/ a test case that others could reproduce where explain isn't returning? I think that would be very useful towards solving at least that issue... Thanks, Stephen
Attachment
On Wed, Jul 13, 2005 at 01:16:25PM -0600, Dan Harris wrote: > On Jul 13, 2005, at 1:11 PM, John A Meinel wrote: > > >I might be wrong, but there may be something much more substantially > >wrong than slow i/o. > > Yes, I'm afraid of that too. I just don't know what tools I should > use to figure that out. I have some 20 other databases on this > system, same schema but varying sizes, and the small ones perform > very well. It feels like there is an O(n) increase in wait time that > has recently become very noticeable on the largest of them. I'd guess it's stuck on some lock. Try that EXPLAIN, and when it blocks, watch the pg_locks view for locks not granted to the process executing the EXPLAIN. Then check what else is holding the locks. -- Alvaro Herrera (<alvherre[a]alvh.no-ip.org>) "La rebeldía es la virtud original del hombre" (Arthur Schopenhauer)
On Jul 13, 2005, at 2:17 PM, Stephen Frost wrote: > > Could you come up w/ a test case that others could reproduce where > explain isn't returning? This was simply due to my n00bness :) I had always been doing explain analyze, instead of just explain. Next time one of these queries comes up, I will be sure to do the explain without analyze. FYI that update query I mentioned in the initial thread just finished after updating 8.3 million rows. -Dan
On Jul 13, 2005, at 2:54 PM, Dan Harris wrote: > 4 x 2.2GHz Opterons > 12 GB of RAM > 4x10k 73GB Ultra320 SCSI drives in RAID 0+1 > 1GB hardware cache memory on the RAID controller > if it is taking that long to update about 25% of your table, then you must be I/O bound. check I/o while you're running a big query. also, what RAID controller are you running? be sure you have the latest BIOS and drivers for it. on a pair of dual opterons, I can do large operations on tables with 100 million rows much faster than you seem to be able. I have MegaRAID 320-2x controllers with 15kRPM drives. Vivek Khera, Ph.D. +1-301-869-4449 x806
Attachment
On Wed, 2005-07-13 at 12:54 -0600, Dan Harris wrote: > For example, as I'm writing this, I am running an UPDATE statement > that will affect a small part of the table, and is querying on an > indexed boolean field. An indexed boolean field? Hopefully, ftindex is false for very few rows of the table? Try changing the ftindex to be a partial index, so only index the false values. Or don't index it at all. Split the table up into smaller pieces. Don't use an UPDATE statement. Keep a second table, and insert records into it when you would have updated previously. If a row is not found, you know that it has ftindex=true. That way, you'll never have row versions building up in the main table, which you'll still get even if you VACUUM. Best Regards, Simon Riggs
Dan Harris <fbsd@drivefaster.net> writes: > I keep the entire database vacuumed regularly. How often is "regularly"? We get frequent posts from people who think daily or every 4 hours is often enough. If the table is very busy you can need vacuums as often as every 15 minutes. Also, if you've done occasional massive batch updates like you describe here you may need a VACUUM FULL or alternatively a CLUSTER command to compact the table -- vacuum identifies the free space but if you've doubled the size of your table with a large update that's a lot more free space than you want hanging around waiting to be used. > For example, as I'm writing this, I am running an UPDATE statement that will > affect a small part of the table, and is querying on an indexed boolean field. ... > update eventactivity set ftindex = false where ftindex = true; ( added the > where clause because I don't want to alter where ftindex is null ) It's definitely worthwhile doing an "EXPLAIN UPDATE..." to see if this even used the index. It sounds like it did a sequential scan. Sequential scans during updates are especially painful. If there isn't free space lying around in the page where the updated record lies then another page has to be used or a new page added. If you're doing a massive update you can exhaust the free space available making the update have to go back and forth between the page being read and the end of the table where pages are being written. > ##### > > vmstat output ( as I am waiting for this to finish ): > procs -----------memory---------- ---swap-- -----io---- --system-- > ----cpu---- > r b swpd free buff cache si so bi bo in cs us sy id wa > 0 1 5436 2823908 26140 9183704 0 1 2211 540 694 336 9 2 76 13 [I assume you ran "vmstat 10" or some other interval and then waited for at least the second line? The first line outputted from vmstat is mostly meaningless] Um. That's a pretty meager i/o rate. Just over 2MB/s. The cpu is 76% idle which sounds fine but that could be one processor pegged at 100% while the others are idle. If this query is the only one running on the system then it would behave just like that. Is it possible you have some foreign keys referencing these records that you're updating? In which case every record being updated might be causing a full table scan on another table (or multiple other tables). If those tables are entirely in cache then it could cause these high cpu low i/o symptoms. Or are there any triggers on this table? -- greg
On Jul 14, 2005, at 12:12 AM, Greg Stark wrote: > Dan Harris <fbsd@drivefaster.net> writes: > > >> I keep the entire database vacuumed regularly. >> > > How often is "regularly"? Well, once every day, but there aren't a ton of inserts or updates going on a daily basis. Maybe 1,000 total inserts? > > Also, if you've done occasional massive batch updates like you > describe here > you may need a VACUUM FULL or alternatively a CLUSTER command to > compact the > table -- vacuum identifies the free space but if you've doubled the > size of > your table with a large update that's a lot more free space than > you want > hanging around waiting to be used. > I have a feeling I'm going to need to do a cluster soon. I have done several mass deletes and reloads on it. > >> For example, as I'm writing this, I am running an UPDATE >> statement that will >> affect a small part of the table, and is querying on an indexed >> boolean field. >> > ... > >> update eventactivity set ftindex = false where ftindex = true; >> ( added the >> where clause because I don't want to alter where ftindex is null ) >> > > It's definitely worthwhile doing an "EXPLAIN UPDATE..." to see if > this even > used the index. It sounds like it did a sequential scan. > I tried that, and indeed it was using an index, although after reading Simon's post, I realize that was kind of dumb to have an index on a bool. I have since removed it. > Sequential scans during updates are especially painful. If there > isn't free > space lying around in the page where the updated record lies then > another page > has to be used or a new page added. If you're doing a massive > update you can > exhaust the free space available making the update have to go back > and forth > between the page being read and the end of the table where pages > are being > written. This is great info, thanks. > > >> ##### >> >> vmstat output ( as I am waiting for this to finish ): >> procs -----------memory---------- ---swap-- -----io---- --system-- >> ----cpu---- >> r b swpd free buff cache si so bi bo in >> cs us sy id wa >> 0 1 5436 2823908 26140 9183704 0 1 2211 540 694 >> 336 9 2 76 13 >> > > [I assume you ran "vmstat 10" or some other interval and then > waited for at > least the second line? The first line outputted from vmstat is mostly > meaningless] Yeah, this was at least 10 or so down the list ( the last one before ctrl-c ) > > Um. That's a pretty meager i/o rate. Just over 2MB/s. The cpu is > 76% idle > which sounds fine but that could be one processor pegged at 100% > while the > others are idle. If this query is the only one running on the > system then it > would behave just like that. Well, none of my processors had ever reached 100% until I changed to ext2 today ( read below for more info ) > > Is it possible you have some foreign keys referencing these records > that > you're updating? In which case every record being updated might be > causing a > full table scan on another table (or multiple other tables). If > those tables > are entirely in cache then it could cause these high cpu low i/o > symptoms. > No foreign keys or triggers. Ok, so I remounted this drive as ext2 shortly before sending my first email today. It wasn't enough time for me to notice the ABSOLUTELY HUGE difference in performance change. Ext3 must really be crappy for postgres, or at least is on this box. Now that it's ext2, this thing is flying like never before. My CPU utilization has skyrocketed, telling me that the disk IO was constraining it immensely. I always knew that it might be a little faster, but the box feels like it can "breathe" again and things that used to be IO intensive and run for an hour or more are now running in < 5 minutes. I'm a little worried about not having a journalized file system, but that performance difference will keep me from switching back ( at least to ext3! ). Maybe someday I will try XFS. I would be surprised if everyone who ran ext3 had this kind of problem, maybe it's specific to my kernel, raid controller, I don't know. But, this is amazing. It's like I have a new server. Thanks to everyone for their valuable input and a big thanks to all the dedicated pg developers on here who make this possible! -Dan
On Thu, Jul 14, 2005 at 12:28:05AM -0600, Dan Harris wrote: > Ok, so I remounted this drive as ext2 shortly before sending my first > email today. It wasn't enough time for me to notice the ABSOLUTELY > HUGE difference in performance change. Ext3 must really be crappy > for postgres, or at least is on this box. Now that it's ext2, this > thing is flying like never before. My CPU utilization has > skyrocketed, telling me that the disk IO was constraining it immensely. Were you using the default journal settings for ext3? An interesting experiment would be to use the other journal options (particularly data=writeback). From the mount manpage: data=journal / data=ordered / data=writeback Specifies the journalling mode for file data. Metadata is always journaled. To use modes other than ordered on the root file system, pass the mode to the kernel as boot parameter, e.g. rootflags=data=journal. journal All data is committed into the journal prior to being written into the main file system. ordered This is the default mode. All data is forced directly out to the main file system prior to its metadata being committed to the journal. writeback Data ordering is not preserved - data may be written into the main file system after its metadata has been commit- ted to the journal. This is rumoured to be the highest- throughput option. It guarantees internal file system integrity, however it can allow old data to appear in files after a crash and journal recovery. -- Alvaro Herrera (<alvherre[a]alvh.no-ip.org>) Officer Krupke, what are we to do? Gee, officer Krupke, Krup you! (West Side Story, "Gee, Officer Krupke")
On Jul 14, 2005, at 9:47 AM, Alvaro Herrera wrote: > On Thu, Jul 14, 2005 at 12:28:05AM -0600, Dan Harris wrote: > >> . Ext3 must really be crappy >> for postgres, or at least is on this box. > > Were you using the default journal settings for ext3? Yes, I was. Next time I get a chance to reboot this box, I will try writeback and compare the benchmarks to my previous config. Thanks for the tip.
Dan Harris <fbsd@drivefaster.net> writes: > Well, once every day, but there aren't a ton of inserts or updates going on a > daily basis. Maybe 1,000 total inserts? It's actually deletes and updates that matter. not inserts. > I have a feeling I'm going to need to do a cluster soon. I have done several > mass deletes and reloads on it. CLUSTER effectively does a VACUUM FULL but takes a different approach and writes out a whole new table, which if there's lots of free space is faster than moving records around to compact the table. > I tried that, and indeed it was using an index, although after reading Simon's > post, I realize that was kind of dumb to have an index on a bool. I have since > removed it. If there are very few records (like well under 10%) with that column equal to false (or very few equal to true) then it's not necessarily useless. But probably more useful is a partial index on some other column. Something like CREATE INDEX ON pk WHERE flag = false; > No foreign keys or triggers. Note that I'm talking about foreign keys in *other* tables that refer to columns in this table. Every update on this table would have to scan those other tables looking for records referencing the updated rows. > Ok, so I remounted this drive as ext2 shortly before sending my first email > today. It wasn't enough time for me to notice the ABSOLUTELY HUGE difference > in performance change. Ext3 must really be crappy for postgres, or at least > is on this box. Now that it's ext2, this thing is flying like never before. > My CPU utilization has skyrocketed, telling me that the disk IO was > constraining it immensely. > > I always knew that it might be a little faster, but the box feels like it can > "breathe" again and things that used to be IO intensive and run for an hour or > more are now running in < 5 minutes. I'm a little worried about not having a > journalized file system, but that performance difference will keep me from > switching back ( at least to ext3! ). Maybe someday I will try XFS. @spock(Fascinating). I wonder if ext3 might be issuing IDE cache flushes on every fsync (to sync the journal) whereas ext2 might not be issuing any cache flushes at all. If the IDE cache is never being flushed then you'll see much better performance but run the risk of data loss in a power failure or hardware failure. (But not in the case of an OS crash, or at least no more than otherwise.) You could also try using the "-O journal_dev" option to put the ext3 journal on a separate device. -- greg