Re: Waiting on ExclusiveLock on extension - Mailing list pgsql-general

From Jim Nasby
Subject Re: Waiting on ExclusiveLock on extension
Date
Msg-id 55340B46.1010506@BlueTreble.com
Whole thread Raw
In response to Re: Waiting on ExclusiveLock on extension  (Andomar <andomar@aule.net>)
Responses Re: Waiting on ExclusiveLock on extension
Re: Waiting on ExclusiveLock on extension
List pgsql-general
On 4/19/15 4:24 AM, Andomar wrote:
> To put the top question first:
>
> How can table extension locks explain a a massive spike in CPU usage?
>
> I can imagine 400 connections waiting on disk I/O, but then, wouldn't
> they all be sleeping?

Not necessarily. Spinlocks don't put the process to sleep, but they're
also supposed to be very short lived.

>  > Ok, that's a MAJOR hint, because relation 1249 is a system catalog;
>  > namely pg_attribute. So I think what's happening here is that your
>  > catalog has become horrifically bloated. I'm 99% certain that VACUUM ALL
>  > will not vacuum the catalog tables.
>  >
>  > Do you by chance have autovacuum turned off?
>  >
>  > A manual VACUUM VERBOSE pg_attribute might provide some immediate
> relief.
>  >
> Autovacuum is turned on.  In addition, we do a manual VACUUM ALL at
> night.  VACUUM VERBOSE pg_attribute ran in 0 seconds and processed a few
> hundred rows.
>
>  > Are you using a connection pool? Establishing 50 new database
>  > connections per second won't do anything to help performance...
>  >
> As I understand it, a pool reduces network and CPU load.  We have never
> seen any issues with those.  So the extra monitoring and maintenance
> cost of a pool seems hard to justify.

Well, it sounds like you are CPU bound here... :P I don't know if this
is related or not, but it wouldn't hurt. If you install pg_bouncer on
the database server itself (which it's designed for) it shouldn't add
much maintenance cost.

>  > I think what that means is that there was suddenly a big spike in memory
>  > demand at the OS level, so now the OS is frantically dumping cached
>  > pages. That in itself won't explain this, but it may be a clue.
>  >
> We monitor memory usage with Cacti.  It's a dedicated server and nearly
> all memory is used as cache.  If a script runs and demands memory, that
> becomes visible as cache is cleared out.  There is no change in the
> amount of memory used as cache around the outage.
>
>  > In order to extend a relation we need to ask the filesystem to actually
>  > extend the file (which presumably means at least writing some metadata
>  > to disk), and then I think we create a WAL record. Creating the WAL
>  > record won't by itself write to disk... *unless* the wal_buffers are all
>  > already full.
>  >
> I have a question here, we have "synchronous_commit = off".  So when
> Postgres extends a page, would it do that just in memory, or does part
> of the extend operation require synchronous I/O?

Turning that off doesn't mean there will never be an fsync, it just
means that we don't wait for one before returning from COMMIT. I don't
think relation extension itself requires a fsnyc, but see below.

>  > So if you also see an I/O spike when this happens you could well
>  > just be starved from the I/O system (though obviously it'd be
>  > better if we handled that situation more elegantly than this).
>
> The SAR data shows no increase in pgpgin/s and pgpgout/s, which if I
> understand it correctly, means that there is no I/O spike.  There is
> however an enormous increase in CPU usage.

I'm not familiar enough with SAR to know if that's correct or not;
iostat would be a good way to confirm it.

>  > I do suspect your pgfree/s is very high though; putting 200k pages/s on
>  > the free list seems like something's broken.
>  >
> The system has constant and considerable load of small writes.  The
> pg_activity tool shows 300 IOPs sustained (it claims max IPs above
> 11000.)  Postgres 9.3 had a comparable pgfree/s.

That leads me to a new theory... you may be running into problems
finding free buffers in the buffer pool. We need to have a buffer before
we can extend a relation, and if you have a lot of pressure on shared
buffers it can take quite a bit of CPU to find one. To make matters
worse, that search for a buffer takes place while holding the extension
lock.

Would you be able to get a stack trace of a backend that's holding an
extension lock? Or maybe perf would provide some insight.

> Would you know a good resource to get more knowledgeable about pgfree,
> pgpin, pgsteal?

Unfortunately I'm not strong on the system tools.
--
Jim Nasby, Data Architect, Blue Treble Consulting
Data in Trouble? Get it in Treble! http://BlueTreble.com


pgsql-general by date:

Previous
From: Jeff Janes
Date:
Subject: Re: Running pg_upgrade under Debian
Next
From: Jim Nasby
Date:
Subject: Re: Waiting on ExclusiveLock on extension