Re: Waiting on ExclusiveLock on extension - Mailing list pgsql-general
From | Jim Nasby |
---|---|
Subject | Re: Waiting on ExclusiveLock on extension |
Date | |
Msg-id | 55340B46.1010506@BlueTreble.com Whole thread Raw |
In response to | Re: Waiting on ExclusiveLock on extension (Andomar <andomar@aule.net>) |
Responses |
Re: Waiting on ExclusiveLock on extension
Re: Waiting on ExclusiveLock on extension |
List | pgsql-general |
On 4/19/15 4:24 AM, Andomar wrote: > To put the top question first: > > How can table extension locks explain a a massive spike in CPU usage? > > I can imagine 400 connections waiting on disk I/O, but then, wouldn't > they all be sleeping? Not necessarily. Spinlocks don't put the process to sleep, but they're also supposed to be very short lived. > > Ok, that's a MAJOR hint, because relation 1249 is a system catalog; > > namely pg_attribute. So I think what's happening here is that your > > catalog has become horrifically bloated. I'm 99% certain that VACUUM ALL > > will not vacuum the catalog tables. > > > > Do you by chance have autovacuum turned off? > > > > A manual VACUUM VERBOSE pg_attribute might provide some immediate > relief. > > > Autovacuum is turned on. In addition, we do a manual VACUUM ALL at > night. VACUUM VERBOSE pg_attribute ran in 0 seconds and processed a few > hundred rows. > > > Are you using a connection pool? Establishing 50 new database > > connections per second won't do anything to help performance... > > > As I understand it, a pool reduces network and CPU load. We have never > seen any issues with those. So the extra monitoring and maintenance > cost of a pool seems hard to justify. Well, it sounds like you are CPU bound here... :P I don't know if this is related or not, but it wouldn't hurt. If you install pg_bouncer on the database server itself (which it's designed for) it shouldn't add much maintenance cost. > > I think what that means is that there was suddenly a big spike in memory > > demand at the OS level, so now the OS is frantically dumping cached > > pages. That in itself won't explain this, but it may be a clue. > > > We monitor memory usage with Cacti. It's a dedicated server and nearly > all memory is used as cache. If a script runs and demands memory, that > becomes visible as cache is cleared out. There is no change in the > amount of memory used as cache around the outage. > > > In order to extend a relation we need to ask the filesystem to actually > > extend the file (which presumably means at least writing some metadata > > to disk), and then I think we create a WAL record. Creating the WAL > > record won't by itself write to disk... *unless* the wal_buffers are all > > already full. > > > I have a question here, we have "synchronous_commit = off". So when > Postgres extends a page, would it do that just in memory, or does part > of the extend operation require synchronous I/O? Turning that off doesn't mean there will never be an fsync, it just means that we don't wait for one before returning from COMMIT. I don't think relation extension itself requires a fsnyc, but see below. > > So if you also see an I/O spike when this happens you could well > > just be starved from the I/O system (though obviously it'd be > > better if we handled that situation more elegantly than this). > > The SAR data shows no increase in pgpgin/s and pgpgout/s, which if I > understand it correctly, means that there is no I/O spike. There is > however an enormous increase in CPU usage. I'm not familiar enough with SAR to know if that's correct or not; iostat would be a good way to confirm it. > > I do suspect your pgfree/s is very high though; putting 200k pages/s on > > the free list seems like something's broken. > > > The system has constant and considerable load of small writes. The > pg_activity tool shows 300 IOPs sustained (it claims max IPs above > 11000.) Postgres 9.3 had a comparable pgfree/s. That leads me to a new theory... you may be running into problems finding free buffers in the buffer pool. We need to have a buffer before we can extend a relation, and if you have a lot of pressure on shared buffers it can take quite a bit of CPU to find one. To make matters worse, that search for a buffer takes place while holding the extension lock. Would you be able to get a stack trace of a backend that's holding an extension lock? Or maybe perf would provide some insight. > Would you know a good resource to get more knowledgeable about pgfree, > pgpin, pgsteal? Unfortunately I'm not strong on the system tools. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
pgsql-general by date: