Thread: Quad Opteron stuck in the mud

Quad Opteron stuck in the mud

From
Dan Harris
Date:
Gurus,

A table in one of my databases has just crossed the 30 million row
mark and has begun to feel very sluggish for just about anything I do
with it.  I keep the entire database vacuumed regularly.  And, as
long as I'm not doing a sequential scan, things seem reasonably quick
most of the time.  I'm now thinking that my problem is IO because
anything that involves heavy ( like a seq scan ) IO seems to slow to
a crawl.  Even if I am using indexed fields to grab a few thousand
rows, then going to sequential scans it gets very very slow.

I have also had the occurrence where queries will not finish for days
( I eventually have to kill them ).  I was hoping to provide an
explain analyze for them, but if they never finish... even the
explain never finishes when I try that.

For example, as I'm writing this, I am running an UPDATE statement
that will affect a small part of the table, and is querying on an
indexed boolean field.

I have been waiting for over an hour and a half as I write this and
it still hasn't finished.  I'm thinking "I bet Tom, Simon or Josh
wouldn't put up with this kind of wait time..", so I thought I would
see if anyone here had some pointers.  Maybe I have a really stupid
setting in my conf file that is causing this.  I really can't believe
I am at the limits of this hardware, however.


The query:
update eventactivity set ftindex = false where ftindex = true;
( added the where clause because I don't want to alter where ftindex
is null )



The table:
   Column    |            Type             | Modifiers
-------------+-----------------------------+-----------
entrydate   | timestamp without time zone |
incidentid  | character varying(40)       |
statustype  | character varying(20)       |
unitid      | character varying(20)       |
recordtext  | character varying(255)      |
recordtext2 | character varying(255)      |
insertdate  | timestamp without time zone |
ftindex     | boolean                     |
Indexes: eventactivity1 btree (incidentid),
          eventactivity_entrydate_idx btree (entrydate),
          eventactivity_ftindex_idx btree (ftindex),
          eventactivity_oid_idx btree (oid)




The hardware:

4 x 2.2GHz Opterons
12 GB of RAM
4x10k 73GB Ultra320 SCSI drives in RAID 0+1
1GB hardware cache memory on the RAID controller

The OS:
Fedora, kernel 2.6.6-1.435.2.3smp ( redhat stock kernel )
filesystem is mounted as ext2

#####

vmstat output ( as I am waiting for this to finish ):
procs -----------memory---------- ---swap-- -----io---- --system--
----cpu----
r  b   swpd   free   buff  cache   si   so    bi    bo   in    cs us
sy id wa
0  1   5436 2823908  26140 9183704    0    1  2211   540  694   336
9  2 76 13

#####

iostat output ( as I am waiting for this to finish ):
avg-cpu:  %user   %nice    %sys %iowait   %idle
            9.19    0.00    2.19   13.08   75.53

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
cciss/c0d0      329.26     17686.03      4317.57  161788630   39496378


#####
This is a dedicated postgresql server, so maybe some of these
settings are more liberal than they should be?

relevant ( I hope ) postgresql.conf options are:

shared_buffers = 50000
effective_cache_size = 1348000
random_page_cost = 3
work_mem = 512000
max_fsm_pages = 80000
log_min_duration_statement = 60000
fsync = true ( not sure if I'm daring enough to run without this )
wal_buffers = 1000
checkpoint_segments = 64
checkpoint_timeout = 3000


#---- FOR PG_AUTOVACUUM --#
stats_command_string = true
stats_row_level = true

Thanks in advance,
Dan








Re: Quad Opteron stuck in the mud

From
Dan Harris
Date:
So sorry, I forgot to mention I'm running version 8.0.1

Thanks


Re: Quad Opteron stuck in the mud

From
John A Meinel
Date:
Dan Harris wrote:
> Gurus,
>

 > even the  explain never
> finishes when I try that.

Just a short bit. If "EXPLAIN SELECT" doesn't return, there seems to be
a very serious problem. Because I think EXPLAIN doesn't actually run the
query, just has the query planner run. And the query planner shouldn't
ever get heavily stuck.

I might be wrong, but there may be something much more substantially
wrong than slow i/o.
John
=:->

Attachment

Re: Quad Opteron stuck in the mud

From
Dan Harris
Date:
On Jul 13, 2005, at 1:11 PM, John A Meinel wrote:

>
> I might be wrong, but there may be something much more substantially
> wrong than slow i/o.
> John
>

Yes, I'm afraid of that too.  I just don't know what tools I should
use to figure that out.  I have some 20 other databases on this
system, same schema but varying sizes, and the small ones perform
very well.  It feels like there is an O(n) increase in wait time that
has recently become very noticeable on the largest of them.

-Dan

Re: Quad Opteron stuck in the mud

From
Stephen Frost
Date:
* Dan Harris (fbsd@drivefaster.net) wrote:
> On Jul 13, 2005, at 1:11 PM, John A Meinel wrote:
> >I might be wrong, but there may be something much more substantially
> >wrong than slow i/o.
>
> Yes, I'm afraid of that too.  I just don't know what tools I should
> use to figure that out.  I have some 20 other databases on this
> system, same schema but varying sizes, and the small ones perform
> very well.  It feels like there is an O(n) increase in wait time that
> has recently become very noticeable on the largest of them.

Could you come up w/ a test case that others could reproduce where
explain isn't returning?  I think that would be very useful towards
solving at least that issue...

    Thanks,

        Stephen

Attachment

Re: Quad Opteron stuck in the mud

From
Alvaro Herrera
Date:
On Wed, Jul 13, 2005 at 01:16:25PM -0600, Dan Harris wrote:

> On Jul 13, 2005, at 1:11 PM, John A Meinel wrote:
>
> >I might be wrong, but there may be something much more substantially
> >wrong than slow i/o.
>
> Yes, I'm afraid of that too.  I just don't know what tools I should
> use to figure that out.  I have some 20 other databases on this
> system, same schema but varying sizes, and the small ones perform
> very well.  It feels like there is an O(n) increase in wait time that
> has recently become very noticeable on the largest of them.

I'd guess it's stuck on some lock.  Try that EXPLAIN, and when it
blocks, watch the pg_locks view for locks not granted to the process
executing the EXPLAIN.  Then check what else is holding the locks.

--
Alvaro Herrera (<alvherre[a]alvh.no-ip.org>)
"La rebeldía es la virtud original del hombre" (Arthur Schopenhauer)

Re: Quad Opteron stuck in the mud

From
Dan Harris
Date:
On Jul 13, 2005, at 2:17 PM, Stephen Frost wrote:
>
> Could you come up w/ a test case that others could reproduce where
> explain isn't returning?

This was simply due to my n00bness :)  I had always been doing
explain analyze, instead of just explain.  Next time one of these
queries comes up, I will be sure to do the explain without analyze.

FYI that update query I mentioned in the initial thread just finished
after updating 8.3 million rows.

-Dan


Re: Quad Opteron stuck in the mud

From
Vivek Khera
Date:
On Jul 13, 2005, at 2:54 PM, Dan Harris wrote:

> 4 x 2.2GHz Opterons
> 12 GB of RAM
> 4x10k 73GB Ultra320 SCSI drives in RAID 0+1
> 1GB hardware cache memory on the RAID controller
>

if it is taking that long to update about 25% of your table, then you
must be I/O bound. check I/o while you're running a big query.

also, what RAID controller are you running?  be sure you have the
latest BIOS and drivers for it.

on a pair of dual opterons, I can do large operations on tables with
100 million rows much faster than you seem to be able.  I have
MegaRAID 320-2x controllers with 15kRPM drives.

Vivek Khera, Ph.D.
+1-301-869-4449 x806



Attachment

Re: Quad Opteron stuck in the mud

From
Simon Riggs
Date:
On Wed, 2005-07-13 at 12:54 -0600, Dan Harris wrote:
> For example, as I'm writing this, I am running an UPDATE statement
> that will affect a small part of the table, and is querying on an
> indexed boolean field.

An indexed boolean field?

Hopefully, ftindex is false for very few rows of the table?

Try changing the ftindex to be a partial index, so only index the false
values. Or don't index it at all.

Split the table up into smaller pieces.

Don't use an UPDATE statement. Keep a second table, and insert records
into it when you would have updated previously. If a row is not found,
you know that it has ftindex=true. That way, you'll never have row
versions building up in the main table, which you'll still get even if
you VACUUM.

Best Regards, Simon Riggs




Re: Quad Opteron stuck in the mud

From
Greg Stark
Date:
Dan Harris <fbsd@drivefaster.net> writes:

> I keep the entire database vacuumed regularly.

How often is "regularly"? We get frequent posts from people who think daily or
every 4 hours is often enough. If the table is very busy you can need vacuums
as often as every 15 minutes.

Also, if you've done occasional massive batch updates like you describe here
you may need a VACUUM FULL or alternatively a CLUSTER command to compact the
table -- vacuum identifies the free space but if you've doubled the size of
your table with a large update that's a lot more free space than you want
hanging around waiting to be used.

> For example, as I'm writing this, I am running an UPDATE statement  that will
> affect a small part of the table, and is querying on an  indexed boolean field.
...
> update eventactivity set ftindex = false where ftindex = true;  ( added the
> where clause because I don't want to alter where ftindex  is null )

It's definitely worthwhile doing an "EXPLAIN UPDATE..." to see if this even
used the index. It sounds like it did a sequential scan.

Sequential scans during updates are especially painful. If there isn't free
space lying around in the page where the updated record lies then another page
has to be used or a new page added. If you're doing a massive update you can
exhaust the free space available making the update have to go back and forth
between the page being read and the end of the table where pages are being
written.

> #####
>
> vmstat output ( as I am waiting for this to finish ):
> procs -----------memory---------- ---swap-- -----io---- --system--
> ----cpu----
> r  b   swpd   free    buff   cache   si   so    bi    bo   in    cs  us sy id wa
> 0  1   5436 2823908  26140 9183704    0    1  2211   540  694   336   9  2 76 13

[I assume you ran "vmstat 10" or some other interval and then waited for at
least the second line? The first line outputted from vmstat is mostly
meaningless]

Um. That's a pretty meager i/o rate. Just over 2MB/s. The cpu is 76% idle
which sounds fine but that could be one processor pegged at 100% while the
others are idle. If this query is the only one running on the system then it
would behave just like that.

Is it possible you have some foreign keys referencing these records that
you're updating? In which case every record being updated might be causing a
full table scan on another table (or multiple other tables). If those tables
are entirely in cache then it could cause these high cpu low i/o symptoms.

Or are there any triggers on this table?


--
greg

Re: Quad Opteron stuck in the mud

From
Dan Harris
Date:
On Jul 14, 2005, at 12:12 AM, Greg Stark wrote:

> Dan Harris <fbsd@drivefaster.net> writes:
>
>
>> I keep the entire database vacuumed regularly.
>>
>
> How often is "regularly"?
Well, once every day, but there aren't a ton of inserts or updates
going on a daily basis.  Maybe 1,000 total inserts?
>
> Also, if you've done occasional massive batch updates like you
> describe here
> you may need a VACUUM FULL or alternatively a CLUSTER command to
> compact the
> table -- vacuum identifies the free space but if you've doubled the
> size of
> your table with a large update that's a lot more free space than
> you want
> hanging around waiting to be used.
>
I have a feeling I'm going to need to do a cluster soon.  I have done
several mass deletes and reloads on it.

>
>> For example, as I'm writing this, I am running an UPDATE
>> statement  that will
>> affect a small part of the table, and is querying on an  indexed
>> boolean field.
>>
> ...
>
>> update eventactivity set ftindex = false where ftindex = true;
>> ( added the
>> where clause because I don't want to alter where ftindex  is null )
>>
>
> It's definitely worthwhile doing an "EXPLAIN UPDATE..." to see if
> this even
> used the index. It sounds like it did a sequential scan.
>

I tried that, and indeed it was using an index, although after
reading Simon's post, I realize that was kind of dumb to have an
index on a bool. I have since removed it.

> Sequential scans during updates are especially painful. If there
> isn't free
> space lying around in the page where the updated record lies then
> another page
> has to be used or a new page added. If you're doing a massive
> update you can
> exhaust the free space available making the update have to go back
> and forth
> between the page being read and the end of the table where pages
> are being
> written.

This is great info, thanks.

>
>
>> #####
>>
>> vmstat output ( as I am waiting for this to finish ):
>> procs -----------memory---------- ---swap-- -----io---- --system--
>> ----cpu----
>> r  b   swpd   free    buff   cache   si   so    bi    bo   in
>> cs  us sy id wa
>> 0  1   5436 2823908  26140 9183704    0    1  2211   540  694
>> 336   9  2 76 13
>>
>
> [I assume you ran "vmstat 10" or some other interval and then
> waited for at
> least the second line? The first line outputted from vmstat is mostly
> meaningless]

Yeah, this was at least 10 or so down the list ( the last one before
ctrl-c )

>
> Um. That's a pretty meager i/o rate. Just over 2MB/s. The cpu is
> 76% idle
> which sounds fine but that could be one processor pegged at 100%
> while the
> others are idle. If this query is the only one running on the
> system then it
> would behave just like that.
Well, none of my processors had ever reached 100% until I changed to
ext2 today ( read below for more info )
>
> Is it possible you have some foreign keys referencing these records
> that
> you're updating? In which case every record being updated might be
> causing a
> full table scan on another table (or multiple other tables). If
> those tables
> are entirely in cache then it could cause these high cpu low i/o
> symptoms.
>

No foreign keys or triggers.


Ok, so I remounted this drive as ext2 shortly before sending my first
email today.  It wasn't enough time for me to notice the ABSOLUTELY
HUGE difference in performance change.  Ext3 must really be crappy
for postgres, or at least is on this box.  Now that it's ext2, this
thing is flying like never before.   My CPU utilization has
skyrocketed, telling me that the disk IO was constraining it immensely.

I always knew that it might be a little faster, but the box feels
like it can "breathe" again and things that used to be IO intensive
and run for an hour or more are now running in < 5 minutes.  I'm a
little worried about not having a journalized file system, but that
performance difference will keep me from switching back ( at least to
ext3! ).  Maybe someday I will try XFS.

I would be surprised if everyone who ran ext3 had this kind of
problem, maybe it's specific to my kernel, raid controller, I don't
know.  But, this is amazing.  It's like I have a new server.

Thanks to everyone for their valuable input and a big thanks to all
the dedicated pg developers on here who make this possible!

-Dan


Re: Quad Opteron stuck in the mud

From
Alvaro Herrera
Date:
On Thu, Jul 14, 2005 at 12:28:05AM -0600, Dan Harris wrote:

> Ok, so I remounted this drive as ext2 shortly before sending my first
> email today.  It wasn't enough time for me to notice the ABSOLUTELY
> HUGE difference in performance change.  Ext3 must really be crappy
> for postgres, or at least is on this box.  Now that it's ext2, this
> thing is flying like never before.   My CPU utilization has
> skyrocketed, telling me that the disk IO was constraining it immensely.

Were you using the default journal settings for ext3?

An interesting experiment would be to use the other journal options
(particularly data=writeback).  From the mount manpage:

       data=journal / data=ordered / data=writeback
              Specifies  the  journalling  mode  for  file  data.  Metadata is
              always journaled.  To use modes other than ordered on  the  root
              file system, pass the mode to the kernel as boot parameter, e.g.
              rootflags=data=journal.

              journal
                     All data is committed into the  journal  prior  to  being
                     written into the main file system.

              ordered
                     This  is  the  default mode.  All data is forced directly
                     out to the main file system prior to its  metadata  being
                     committed to the journal.

              writeback
                     Data ordering is not preserved - data may be written into
                     the main file system after its metadata has been  commit-
                     ted  to the journal.  This is rumoured to be the highest-
                     throughput option.  It guarantees  internal  file  system
                     integrity,  however  it  can  allow old data to appear in
                     files after a crash and journal recovery.


--
Alvaro Herrera (<alvherre[a]alvh.no-ip.org>)
Officer Krupke, what are we to do?
Gee, officer Krupke, Krup you! (West Side Story, "Gee, Officer Krupke")

Re: Quad Opteron stuck in the mud

From
Dan Harris
Date:
On Jul 14, 2005, at 9:47 AM, Alvaro Herrera wrote:

> On Thu, Jul 14, 2005 at 12:28:05AM -0600, Dan Harris wrote:
>
>> .  Ext3 must really be crappy
>> for postgres, or at least is on this box.
>
> Were you using the default journal settings for ext3?

Yes, I was.  Next time I get a chance to reboot this box, I will try
writeback and compare the benchmarks to my previous config.  Thanks
for the tip.


Re: Quad Opteron stuck in the mud

From
Greg Stark
Date:
Dan Harris <fbsd@drivefaster.net> writes:

> Well, once every day, but there aren't a ton of inserts or updates  going on a
> daily basis.  Maybe 1,000 total inserts?

It's actually deletes and updates that matter. not inserts.

> I have a feeling I'm going to need to do a cluster soon.  I have done  several
> mass deletes and reloads on it.

CLUSTER effectively does a VACUUM FULL but takes a different approach and
writes out a whole new table, which if there's lots of free space is faster
than moving records around to compact the table.

> I tried that, and indeed it was using an index, although after  reading Simon's
> post, I realize that was kind of dumb to have an  index on a bool. I have since
> removed it.

If there are very few records (like well under 10%) with that column equal to
false (or very few equal to true) then it's not necessarily useless. But
probably more useful is a partial index on some other column.

Something like

CREATE INDEX ON pk WHERE flag = false;

> No foreign keys or triggers.

Note that I'm talking about foreign keys in *other* tables that refer to
columns in this table. Every update on this table would have to scan those
other tables looking for records referencing the updated rows.


> Ok, so I remounted this drive as ext2 shortly before sending my first  email
> today.  It wasn't enough time for me to notice the ABSOLUTELY  HUGE difference
> in performance change.  Ext3 must really be crappy  for postgres, or at least
> is on this box.  Now that it's ext2, this  thing is flying like never before.
> My CPU utilization has  skyrocketed, telling me that the disk IO was
> constraining it immensely.
>
> I always knew that it might be a little faster, but the box feels  like it can
> "breathe" again and things that used to be IO intensive  and run for an hour or
> more are now running in < 5 minutes.  I'm a  little worried about not having a
> journalized file system, but that  performance difference will keep me from
> switching back ( at least to  ext3! ).  Maybe someday I will try XFS.

@spock(Fascinating).

I wonder if ext3 might be issuing IDE cache flushes on every fsync (to sync
the journal) whereas ext2 might not be issuing any cache flushes at all.

If the IDE cache is never being flushed then you'll see much better
performance but run the risk of data loss in a power failure or hardware
failure. (But not in the case of an OS crash, or at least no more than
otherwise.)

You could also try using the "-O journal_dev" option to put the ext3 journal
on a separate device.

--
greg