Thread: Waiting on ExclusiveLock on extension
Hi, After upgrading our database from 9.3.5 to 9.4.1 last night, the server suffers from high CPU spikes. During these spikes, there are a lot of these messages in the logs: process X still waiting for ExclusiveLock on extension of relation Y of database Z after 1036.234 ms And: process X acquired ExclusiveLock on extension of relation Y of database Z after 2788.050 ms What looks suspicious is that there are sometimes several "acquired" messages for the exact same relation number in the exact same millisecond. a) I'm assuming "extension" means growing the physical diskspace used by a table-- is that correct? b) How can you find the name of the relation being extended? based on the relation number. c) Why would Postgres grow a table twice in the same millisecond? Could it be an index with a high fill factor? Any suggestions on how to approach this issue are welcome. Thanks for your time, Andomar P.S. The upgrade was done with pg_dump. So the database was converted to SQL and then imported into 9.4.
On Thu, Apr 16, 2015 at 1:24 PM, Andomar <andomar@aule.net> wrote: > After upgrading our database from 9.3.5 to 9.4.1 last night, the server > suffers from high CPU spikes. During these spikes, there are a lot of these > messages in the logs: > > process X still waiting for ExclusiveLock on extension of relation Y of > database Z after 1036.234 ms > This issue has been complained several times, and here is the most recent one: http://www.postgresql.org/message-id/0DDFB621-7282-4A2B-8879-A47F7CECBCE4@simply.name PG 9.4.1 shall have much alleviated it by relaxing buffer pool related locks. PG 9.4.1 shall be actually better in relation extension handling - a possible explanation is that your 9.3.5 database has been used for a while thus there are holes in pages, so not many extensions are required. > > a) I'm assuming "extension" means growing the physical diskspace used by a > table-- is that correct? Yes. > b) How can you find the name of the relation being extended? based on the > relation number. select <number>::regclass; > c) Why would Postgres grow a table twice in the same millisecond? Could it > be an index with a high fill factor? PG extends one page each time when new space needed. > Any suggestions on how to approach this issue are welcome. > There are some diagnosis in above link, see if it rings any bell. From PG kernel side, I think fundamentally we may want to extend many pages each time instead of one. Regards, Qingqing
Thanks for your reply. > This issue has been complained several times, and here is the most recent one: > http://www.postgresql.org/message-id/0DDFB621-7282-4A2B-8879-A47F7CECBCE4@simply.name That post is about a server with huge shared_buffers, but ours is just 8GB. Total memory 48GB memory on a dedicated server. Checkpoints write around 2% of the buffers. > PG 9.4.1 shall have much alleviated it by relaxing buffer pool related > locks. PG 9.4.1 shall be actually better in relation extension > handling - a possible explanation is that your 9.3.5 database has been > used for a while thus there are holes in pages, so not many extensions > are required. The 9.3.5 version went live as an upgrade from 9.1.x in the same way. So it started from an SQL dump. The load has not changed much since the 9.3.5 upgrade. With holes in pages, I suppose you mean the fill factor? Is there a way to see the current fillfactor of a table and its indices? Kind regards, Andomar
On Thu, Apr 16, 2015 at 2:39 PM, Andomar <andomar@aule.net> wrote: > That post is about a server with huge shared_buffers, but ours is just 8GB. > Total memory 48GB memory on a dedicated server. Checkpoints write around 2% > of the buffers. Are you able to take some 'perf top' during high CPU spike and see what's burning CPU there? Though the issue is related to blocking, but high CPU spikes may hint some spinning to acquire behavior. > With holes in pages, I suppose you mean the fill factor? Is there a way to > see the current fillfactor of a table and its indices? > Yes, holes meaning free space within a page. It can come from <100% fillfactor or vacuum collected dead tuples. You can see fillfactor value in pg_class.reloptions. If you nothing there, that's 100% and 90% for heap and btree by default respectively. If your previous relation size is smaller than after upgrade, that's a signal that you do have holes in relation, thus extension can be avoided sometimes for new tuples. Regards, Qingqing
> Are you able to take some 'perf top' during high CPU spike and see > what's burning CPU there? Though the issue is related to blocking, but > high CPU spikes may hint some spinning to acquire behavior. Will do, although hopefully the spikes were only growing pains after the upgrade. > If your previous relation size is smaller than after upgrade, that's a > signal that you do have holes in relation, thus extension can be > avoided sometimes for new tuples. The relation between high CPU and page splits is not immediately obvious to me. We run with synchronous_commit off, but there does seem to be a peak in I/O requests around the CPU spikes. Is a page split by nature a synchronous I/O activity? And do the other connections wait in some kind of CPU intensive form (like a spinlock?) Kind regards, Andomar
On 2015-04-16 14:23:25 -0700, Qingqing Zhou wrote: > On Thu, Apr 16, 2015 at 1:24 PM, Andomar <andomar@aule.net> wrote: > > After upgrading our database from 9.3.5 to 9.4.1 last night, the server > > suffers from high CPU spikes. During these spikes, there are a lot of these > > messages in the logs: > > > > process X still waiting for ExclusiveLock on extension of relation Y of > > database Z after 1036.234 ms > > > This issue has been complained several times, and here is the most recent one: > http://www.postgresql.org/message-id/0DDFB621-7282-4A2B-8879-A47F7CECBCE4@simply.name > > PG 9.4.1 shall have much alleviated it by relaxing buffer pool related > locks. Hm. I'm not aware of related changes in 9.4? 9.5 should be a bit better, but I don't think 9.4 will make much of a difference. > > b) How can you find the name of the relation being extended? based on the > > relation number. > select <number>::regclass; That's not correct. The relfilenode is only equivalent to the relation's oid when you initially create it. But once rewritten it'll change. Rather use SELECT oid::regclass FROM pg_class WHERE pg_relation_filenode(oid) = 2937136; > > Any suggestions on how to approach this issue are welcome. > > > There are some diagnosis in above link, see if it rings any bell. From > PG kernel side, I think fundamentally we may want to extend many pages > each time instead of one. I don't really agree that that's the most important bit. See http://archives.postgresql.org/message-id/20150329185619.GA29062%40alap3.anarazel.de Greetings, Andres Freund
On Fri, Apr 17, 2015 at 1:14 AM, Andres Freund <andres@anarazel.de> wrote:
On 2015-04-16 14:23:25 -0700, Qingqing Zhou wrote:
> On Thu, Apr 16, 2015 at 1:24 PM, Andomar <andomar@aule.net> wrote:
> > b) How can you find the name of the relation being extended? based on the
> > relation number.
> select <number>::regclass;
That's not correct. The relfilenode is only equivalent to the relation's
oid when you initially create it. But once rewritten it'll change.
Rather use
SELECT oid::regclass FROM pg_class WHERE pg_relation_filenode(oid) = 2937136;
But the thing being reported in the lock wait log message is the relation OID itself, not the relfilenode.
You would use your query if you get the number by watching the file names in the file system as they grow,
but not if you get it from the log message.
Cheers,
Jeff
On Fri, Apr 17, 2015 at 1:14 AM, Andres Freund <andres@anarazel.de> wrote:
On 2015-04-16 14:23:25 -0700, Qingqing Zhou wrote:
> On Thu, Apr 16, 2015 at 1:24 PM, Andomar <andomar@aule.net> wrote:
> > After upgrading our database from 9.3.5 to 9.4.1 last night, the server
> > suffers from high CPU spikes. During these spikes, there are a lot of these
> > messages in the logs:
> >
> > process X still waiting for ExclusiveLock on extension of relation Y of
> > database Z after 1036.234 ms
> >
> This issue has been complained several times, and here is the most recent one:
> http://www.postgresql.org/message-id/0DDFB621-7282-4A2B-8879-A47F7CECBCE4@simply.name
>
> PG 9.4.1 shall have much alleviated it by relaxing buffer pool related
> locks.
Hm. I'm not aware of related changes in 9.4? 9.5 should be a bit better,
but I don't think 9.4 will make much of a difference.
> > b) How can you find the name of the relation being extended? based on the
> > relation number.
> select <number>::regclass;
That's not correct. The relfilenode is only equivalent to the relation's
oid when you initially create it. But once rewritten it'll change.
Rather use
SELECT oid::regclass FROM pg_class WHERE pg_relation_filenode(oid) = 2937136;
> > Any suggestions on how to approach this issue are welcome.
> >
> There are some diagnosis in above link, see if it rings any bell. From
> PG kernel side, I think fundamentally we may want to extend many pages
> each time instead of one.
I don't really agree that that's the most important bit. See
http://archives.postgresql.org/message-id/20150329185619.GA29062%40alap3.anarazel.de
Greetings,
Andres Freund
--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general
On Fri, Apr 17, 2015 at 1:14 AM, Andres Freund <andres@anarazel.de> wrote: > Hm. I'm not aware of related changes in 9.4? 9.5 should be a bit better, > but I don't think 9.4 will make much of a difference. > You are right. I mis-read the check-in log. > I don't really agree that that's the most important bit. See > http://archives.postgresql.org/message-id/20150329185619.GA29062%40alap3.anarazel.de > Thanks for the link. I actually think changes in a more fundamental way - will follow on that thread. Regards, Qingqing
On 4/16/15 4:39 PM, Andomar wrote: > Thanks for your reply. >> This issue has been complained several times, and here is the most >> recent one: >> http://www.postgresql.org/message-id/0DDFB621-7282-4A2B-8879-A47F7CECBCE4@simply.name >> > That post is about a server with huge shared_buffers, but ours is just > 8GB. Total memory 48GB memory on a dedicated server. Checkpoints write > around 2% of the buffers. Yeah, I suspect the OP in that thread was seeing something different than you are, but that's just a guess. >> PG 9.4.1 shall have much alleviated it by relaxing buffer pool related >> locks. PG 9.4.1 shall be actually better in relation extension >> handling - a possible explanation is that your 9.3.5 database has been >> used for a while thus there are holes in pages, so not many extensions >> are required. > The 9.3.5 version went live as an upgrade from 9.1.x in the same way. So > it started from an SQL dump. The load has not changed much since the > 9.3.5 upgrade. Yes, but did you have the same workload when you upgraded to 9.3 as you do today? > With holes in pages, I suppose you mean the fill factor? The fill factor plays a role in whether a page has free space, yes, *especially* during an initial load (like restoring from pg_dump). Keep in mind that as the system runs you're going to be creating free space in pages as data is updated or deleted. So the amount of free space per page on the 9.3 database you moved from wouldn't be anything like what it was when you moved to 9.3. > Is there a way > to see the current fillfactor of a table and its indices? Possibly the bloat query (https://wiki.postgresql.org/wiki/Show_database_bloat) would. The page inspect utility certainly would tell you. Possibly more useful would be to see what the pg_freespacemap contrib module shows in the free space map, since that's what the system will actually use to see where it can find a page to insert data on. BTW, something else to be aware of: because you essentially re-loaded all your data in a single transaction, that means that a: you'll be doing a lot of hint bit updates until all data has been read a second time, and b: a lot of this data will come up for freezing at around the same time, creating a big chunk of work for autovacuum. That's caused problems for me in the past, though that was on a database that had a pretty high workload. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
> Yes, but did you have the same workload when you upgraded to 9.3 as > you do today? The workload is very similar. We upgraded from 9.1 to 9.3 only two months ago, and our usage statistics have not changed much. There were no "remaining connection slots are reserved for non-replication superuser connections" messages in the weeks after the 9.3 upgrade. > BTW, something else to be aware of: because you essentially re-loaded > all your data in a single transaction, that means that a: you'll be > doing a lot of hint bit updates until all data has been read a second > time, and b: a lot of this data will come up for freezing at around > the same time, creating a big chunk of work for autovacuum. That's > caused problems for me in the past, though that was on a database that > had a pretty high workload. We run "VACUUM ALL" every night and although that gave a bit more CPU and I/O but nothing major. When the problem occurs, the symptoms are: - spike to a very high load average (= CPU usage) - messages about long waits on ExclusiveLock on extension appear - remaining connection slots are reserved (with corresponding end user impact) An example of a message with 1 second wait duration and a lot of waiting queries: process 667 still waiting for ExclusiveLock on extension of relation 1249 of database 16406 after 1005.226 ms","Process holding the lock: 36279. Wait queue: 36725, 36405, 36511, 36721, 36280, 36048, 36566, 36636, 36466, 36734, 36723, 36621, 36423, 36931, 36720, 36429, 35500, 36735, 37015, 36717, 36938, 36870, 36732, 36587, 36869, 36285, 36573, 37101, 36937, 36414, 36834, 37105, 36867, 36724, 36991, 37102, 36882, 36802, 37163, 39964, 39723, 40044, 39821, 40150, 40218, 40203, 40054, 40060, 40173, 40091, 40174, 40058, 40658, 40370, 40177, 40920, 41085, 41103, 41117, 41154, 41161, 41066, 41053, 41380, 40661, 40632, 40698, 41242, 40681, 41174, 41328, 41075, 41245, 41326, 41523, 41153, 41170, 40543, 41314, 41526, 41490, 41157, 41353, 41472, 41730, 41546, 45087, 41535, 41474, 41362, 41450, 41948, 41929, 41459, 41508, 42117, 42127, 41950, 41922, 42414, 41939, 42565, 42643, 42242, 41796, 42324, 42358, 42411, 42487, 41758, 42120, 42570, 41820, 41925, 43356, 43381, 43360, 43351, 43364, 42336, 42871, 43007, 42455, 43363, 42287, 43336, 42652, 42803, 43567, 43706, 43795, 43630, 43716, 43810, 43596, 43061, 43954, 44014, 43377, 43366, 43825, 43454, 43840, 43582, 43839, 44009, 43842, 43693, 44320, 43824, 43456, 43852, 43863, 44708, 43848, 44255, 44587, 44936, 44915, 44759, 44700, 44948, 45051, 44808, 45189, 45137, 45037, 45303, 45294, 45710, 45711, 45755, 45660, 45120, 45576, 46221, 46125, 46703, 46512, 46399, 46684, 46762, 46373, 46929, 46443, 46817, 46858, 47017, 46886, 46805, 46890, 47593, 47548, 47272, 47454, 47906, 47462, 47801, 47939, 47994, 48098, 48363, 47815, 48393, 48250, 48213, 48470, 48408, 48509, 48499, 48115, 48448, 48877, 461, 560, 637, 48902, 641, 49019, 667, 877, 1306, 1070, 1265.",,,,"SQL statement ""CREATE TEMPORARY TABLE An example of a message with 140 second wait duration: process 27173 acquired ExclusiveLock on extension of relation 16787 of database 16406 after 138522.699 ms",,,,,"SQL statement ""INSERT INTO There are about 50 queries a second, so a 10 second lock will exhaust the number of available connections. We keep sar logs, and they show an unusual amount of pgscand/s and pgsteal/s around the time of the problem: 07:25:01 PM pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s pgscand/s pgsteal/s %vmeff 08:25:01 PM 29.93 1069.13 393727.33 0.00 170655.03 0.00 90.20 90.20 100.00 08:26:01 PM 20.14 1325.55 388751.51 0.00 157188.40 0.00 315.81 315.81 100.00 08:27:02 PM 59.33 1215.97 612453.98 0.00 271483.71 0.00 180.40 180.40 100.00 08:28:01 PM 43.28 1881.71 473067.50 0.00 236778.10 0.00 114.87 114.87 100.00 08:29:01 PM 23.49 665.73 406865.61 0.00 213178.17 0.00 293.11 293.11 100.00 08:30:01 PM 31.39 1317.51 448565.03 0.02 193320.76 0.00 225.88 225.83 99.98 -- Problem starts here 08:31:02 PM 21.16 1499.32 468467.44 0.10 211767.21 0.00 4311.87 4311.84 100.00 08:32:01 PM 10.19 648.87 261067.16 0.05 167231.84 0.00 1071.01 1070.98 100.00 08:33:01 PM 63.59 950.94 422101.19 0.22 242284.12 0.00 1243.37 1243.34 100.00 -- Problem ends here 08:34:01 PM 24.97 1321.61 412294.87 0.00 273734.03 0.00 0.00 0.00 0.00 08:35:01 PM 11.60 1094.22 353741.41 0.00 238541.73 0.00 0.00 0.00 0.00 08:36:01 PM 39.22 976.60 368450.80 0.10 240632.57 0.00 0.00 0.00 0.00 08:37:01 PM 19.83 967.31 320415.39 0.00 217557.42 0.00 0.00 0.00 0.00 08:38:01 PM 15.68 1884.09 301785.58 0.00 200274.51 0.00 0.00 0.00 0.00 08:39:01 PM 62.61 858.31 487099.01 0.00 330130.65 0.00 0.00 0.00 0.0 pgscand/s = Number of pages scanned directly per second. pgsteal/s = Number of pages the system has reclaimed from cache (pagecache and swapcache) per second to satisfy its memory demands. Could the pgscand and pgsteal numbers provide a hint? They're sometimes zero for more than half an hour, so they don't seem related to checkpoints. Kind regards, Andomar
On 4/17/15 4:22 PM, Andomar wrote: > >> Yes, but did you have the same workload when you upgraded to 9.3 as >> you do today? > > The workload is very similar. We upgraded from 9.1 to 9.3 only two > months ago, and our usage statistics have not changed much. There were > no "remaining connection slots are reserved for non-replication > superuser connections" messages in the weeks after the 9.3 upgrade. > >> BTW, something else to be aware of: because you essentially re-loaded >> all your data in a single transaction, that means that a: you'll be >> doing a lot of hint bit updates until all data has been read a second >> time, and b: a lot of this data will come up for freezing at around >> the same time, creating a big chunk of work for autovacuum. That's >> caused problems for me in the past, though that was on a database that >> had a pretty high workload. > > We run "VACUUM ALL" every night and although that gave a bit more CPU > and I/O but nothing major. > > When the problem occurs, the symptoms are: > - spike to a very high load average (= CPU usage) > - messages about long waits on ExclusiveLock on extension appear > - remaining connection slots are reserved (with corresponding end user > impact) > > An example of a message with 1 second wait duration and a lot of waiting > queries: > > process 667 still waiting for ExclusiveLock on extension of relation > 1249 of database 16406 after 1005.226 ms","Process holding the lock: > 36279. Wait queue: 36725, 36405, 36511, 36721, 36280, 36048, 36566, ... > 49019, 667, 877, 1306, 1070, 1265.",,,,"SQL statement ""CREATE TEMPORARY > TABLE Ok, that's a MAJOR hint, because relation 1249 is a system catalog; namely pg_attribute. So I think what's happening here is that your catalog has become horrifically bloated. I'm 99% certain that VACUUM ALL will not vacuum the catalog tables. Do you by chance have autovacuum turned off? A manual VACUUM VERBOSE pg_attribute might provide some immediate relief. > An example of a message with 140 second wait duration: > > process 27173 acquired ExclusiveLock on extension of relation 16787 of > database 16406 after 138522.699 ms",,,,,"SQL statement ""INSERT INTO This is a different case though, because 16787 is not a catalog table. (SELECT 16787::regclass; will tell you what table that is). > There are about 50 queries a second, so a 10 second lock will exhaust > the number of available connections. Are you using a connection pool? Establishing 50 new database connections per second won't do anything to help performance... Running out of connections in this scenario isn't surprising. > We keep sar logs, and they show an unusual amount of pgscand/s and > pgsteal/s around the time of the problem: > > 07:25:01 PM pgpgin/s pgpgout/s fault/s majflt/s pgfree/s pgscank/s > pgscand/s pgsteal/s %vmeff > 08:25:01 PM 29.93 1069.13 393727.33 0.00 170655.03 0.00 > 90.20 90.20 100.00 > 08:26:01 PM 20.14 1325.55 388751.51 0.00 157188.40 0.00 > 315.81 315.81 100.00 ... > pgscand/s = Number of pages scanned directly per second. > pgsteal/s = Number of pages the system has reclaimed from cache > (pagecache and swapcache) per second to satisfy its memory demands. > > Could the pgscand and pgsteal numbers provide a hint? They're sometimes > zero for more than half an hour, so they don't seem related to checkpoints. I think what that means is that there was suddenly a big spike in memory demand at the OS level, so now the OS is frantically dumping cached pages. That in itself won't explain this, but it may be a clue. In order to extend a relation we need to ask the filesystem to actually extend the file (which presumably means at least writing some metadata to disk), and then I think we create a WAL record. Creating the WAL record won't by itself write to disk... *unless* the wal_buffers are all already full. So if you also see an I/O spike when this happens you could well just be starved from the I/O system (though obviously it'd be better if we handled that situation more elegantly than this). Another possibility given the big uptick in page scanning in the OS is that you're saturating the memory bus... but you'd need WAY more than 5k pages/s to do that. I do suspect your pgfree/s is very high though; putting 200k pages/s on the free list seems like something's broken. BTW, http://www.postgresql.org/message-id/466D72D2-68EC-4FF4-93B8-43B687E7B705@simply.name contains a list of URLs about relation extension problems. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
To put the top question first: How can table extension locks explain a a massive spike in CPU usage? I can imagine 400 connections waiting on disk I/O, but then, wouldn't they all be sleeping? > Ok, that's a MAJOR hint, because relation 1249 is a system catalog; > namely pg_attribute. So I think what's happening here is that your > catalog has become horrifically bloated. I'm 99% certain that VACUUM ALL > will not vacuum the catalog tables. > > Do you by chance have autovacuum turned off? > > A manual VACUUM VERBOSE pg_attribute might provide some immediate relief. > Autovacuum is turned on. In addition, we do a manual VACUUM ALL at night. VACUUM VERBOSE pg_attribute ran in 0 seconds and processed a few hundred rows. > Are you using a connection pool? Establishing 50 new database > connections per second won't do anything to help performance... > As I understand it, a pool reduces network and CPU load. We have never seen any issues with those. So the extra monitoring and maintenance cost of a pool seems hard to justify. > I think what that means is that there was suddenly a big spike in memory > demand at the OS level, so now the OS is frantically dumping cached > pages. That in itself won't explain this, but it may be a clue. > We monitor memory usage with Cacti. It's a dedicated server and nearly all memory is used as cache. If a script runs and demands memory, that becomes visible as cache is cleared out. There is no change in the amount of memory used as cache around the outage. > In order to extend a relation we need to ask the filesystem to actually > extend the file (which presumably means at least writing some metadata > to disk), and then I think we create a WAL record. Creating the WAL > record won't by itself write to disk... *unless* the wal_buffers are all > already full. > I have a question here, we have "synchronous_commit = off". So when Postgres extends a page, would it do that just in memory, or does part of the extend operation require synchronous I/O? > So if you also see an I/O spike when this happens you could well > just be starved from the I/O system (though obviously it'd be > better if we handled that situation more elegantly than this). The SAR data shows no increase in pgpgin/s and pgpgout/s, which if I understand it correctly, means that there is no I/O spike. There is however an enormous increase in CPU usage. > I do suspect your pgfree/s is very high though; putting 200k pages/s on > the free list seems like something's broken. > The system has constant and considerable load of small writes. The pg_activity tool shows 300 IOPs sustained (it claims max IPs above 11000.) Postgres 9.3 had a comparable pgfree/s. Would you know a good resource to get more knowledgeable about pgfree, pgpin, pgsteal? Kind regards, Andomar
On 4/19/15 4:24 AM, Andomar wrote: > To put the top question first: > > How can table extension locks explain a a massive spike in CPU usage? > > I can imagine 400 connections waiting on disk I/O, but then, wouldn't > they all be sleeping? Not necessarily. Spinlocks don't put the process to sleep, but they're also supposed to be very short lived. > > Ok, that's a MAJOR hint, because relation 1249 is a system catalog; > > namely pg_attribute. So I think what's happening here is that your > > catalog has become horrifically bloated. I'm 99% certain that VACUUM ALL > > will not vacuum the catalog tables. > > > > Do you by chance have autovacuum turned off? > > > > A manual VACUUM VERBOSE pg_attribute might provide some immediate > relief. > > > Autovacuum is turned on. In addition, we do a manual VACUUM ALL at > night. VACUUM VERBOSE pg_attribute ran in 0 seconds and processed a few > hundred rows. > > > Are you using a connection pool? Establishing 50 new database > > connections per second won't do anything to help performance... > > > As I understand it, a pool reduces network and CPU load. We have never > seen any issues with those. So the extra monitoring and maintenance > cost of a pool seems hard to justify. Well, it sounds like you are CPU bound here... :P I don't know if this is related or not, but it wouldn't hurt. If you install pg_bouncer on the database server itself (which it's designed for) it shouldn't add much maintenance cost. > > I think what that means is that there was suddenly a big spike in memory > > demand at the OS level, so now the OS is frantically dumping cached > > pages. That in itself won't explain this, but it may be a clue. > > > We monitor memory usage with Cacti. It's a dedicated server and nearly > all memory is used as cache. If a script runs and demands memory, that > becomes visible as cache is cleared out. There is no change in the > amount of memory used as cache around the outage. > > > In order to extend a relation we need to ask the filesystem to actually > > extend the file (which presumably means at least writing some metadata > > to disk), and then I think we create a WAL record. Creating the WAL > > record won't by itself write to disk... *unless* the wal_buffers are all > > already full. > > > I have a question here, we have "synchronous_commit = off". So when > Postgres extends a page, would it do that just in memory, or does part > of the extend operation require synchronous I/O? Turning that off doesn't mean there will never be an fsync, it just means that we don't wait for one before returning from COMMIT. I don't think relation extension itself requires a fsnyc, but see below. > > So if you also see an I/O spike when this happens you could well > > just be starved from the I/O system (though obviously it'd be > > better if we handled that situation more elegantly than this). > > The SAR data shows no increase in pgpgin/s and pgpgout/s, which if I > understand it correctly, means that there is no I/O spike. There is > however an enormous increase in CPU usage. I'm not familiar enough with SAR to know if that's correct or not; iostat would be a good way to confirm it. > > I do suspect your pgfree/s is very high though; putting 200k pages/s on > > the free list seems like something's broken. > > > The system has constant and considerable load of small writes. The > pg_activity tool shows 300 IOPs sustained (it claims max IPs above > 11000.) Postgres 9.3 had a comparable pgfree/s. That leads me to a new theory... you may be running into problems finding free buffers in the buffer pool. We need to have a buffer before we can extend a relation, and if you have a lot of pressure on shared buffers it can take quite a bit of CPU to find one. To make matters worse, that search for a buffer takes place while holding the extension lock. Would you be able to get a stack trace of a backend that's holding an extension lock? Or maybe perf would provide some insight. > Would you know a good resource to get more knowledgeable about pgfree, > pgpin, pgsteal? Unfortunately I'm not strong on the system tools. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
On 4/19/15 3:08 PM, Jim Nasby wrote: >> > I do suspect your pgfree/s is very high though; putting 200k >> pages/s on >> > the free list seems like something's broken. >> > >> The system has constant and considerable load of small writes. The >> pg_activity tool shows 300 IOPs sustained (it claims max IPs above >> 11000.) Postgres 9.3 had a comparable pgfree/s. > > That leads me to a new theory... you may be running into problems > finding free buffers in the buffer pool. We need to have a buffer before > we can extend a relation, and if you have a lot of pressure on shared > buffers it can take quite a bit of CPU to find one. To make matters > worse, that search for a buffer takes place while holding the extension > lock. > > Would you be able to get a stack trace of a backend that's holding an > extension lock? Or maybe perf would provide some insight. BTW, http://postgresql.org/message-id/flat/20150329185619.GA29062@alap3.anarazel.de has some useful info about this. -- Jim Nasby, Data Architect, Blue Treble Consulting Data in Trouble? Get it in Treble! http://BlueTreble.com
> Would you be able to get a stack trace of a backend that's holding an > extension lock? Or maybe perf would provide some insight. > The outage occurred again but briefer. There were no ExclusiveLock messages, presumably because the timeout for logging locks was not exceeded. But all available connection slots were used and many incoming requests were denied. Below you'll find the "perf report" and the "zoomed in" (I suppose that's the callstack) of the top entry. The top entry's call stack has these 4 postmaster functions near the top: PinBuffer LockRelease hash_any HeapTupleSatisfiesMVCC We'll be rolling back from 9.4.1 to 9.3.6 tonight, hoping that will resolve the issue. ===== BELOW A 10 SECOND perf top CAPTURE DURING THE OUTAGE # ======== # captured on: Mon Apr 20 20:34:43 2015 # hostname : db1a # os release : 2.6.32-504.1.3.el6.x86_64 # perf version : 2.6.32-504.1.3.el6.x86_64.debug # arch : x86_64 # nrcpus online : 24 # nrcpus avail : 24 # cpudesc : Intel(R) Xeon(R) CPU E5-2667 0 @ 2.90GHz # cpuid : GenuineIntel,6,45,7 # total memory : 49373964 kB # cmdline : /usr/bin/perf record -a -o 2015-04-20_20:34:28 sleep 10 # event : name = cycles, type = 0, config = 0x0, config1 = 0x0, config2 = 0x0, excl_usr = 0, excl_kern = 0, excl_host = 0, excl_guest = 1, precise_ip = 0, attr_mmap2 = 0, attr_mmap = 1, attr_mmap_data = 0 # HEADER_CPU_TOPOLOGY info available, use -I to display # HEADER_NUMA_TOPOLOGY info available, use -I to display # pmu mappings: uncore_r3qpi_1 = 23, uncore_r3qpi_0 = 22, uncore_r2pcie = 21, uncore_qpi_1 = 20, uncore_qpi_0 = 19, uncore_imc_3 = 18, uncore_imc_2 = 17, uncore_imc_1 = 16, uncore_imc_0 = 15, uncore_ha = 14, uncore_pcu = 13, uncore_cbox_5 = 12, uncore_c # ======== # # Samples: 960K of event 'cycles' # Event count (approx.): 757057831613 # # Overhead Command Shared Object Symbol # ........ ............... ..................... ................................................. # 59.73% postmaster [kernel.kallsyms] [k] compaction_alloc 1.31% postmaster [kernel.kallsyms] [k] _spin_lock 0.94% postmaster [kernel.kallsyms] [k] __reset_isolation_suitable 0.78% postmaster [kernel.kallsyms] [k] compact_zone 0.67% postmaster [kernel.kallsyms] [k] get_pageblock_flags_group 0.64% postmaster [kernel.kallsyms] [k] copy_page_c 0.48% :13410 [kernel.kallsyms] [k] compaction_alloc 0.45% :13465 [kernel.kallsyms] [k] compaction_alloc 0.45% postmaster [kernel.kallsyms] [k] clear_page_c 0.44% postmaster postgres [.] hash_search_with_hash_value 0.41% :13324 [kernel.kallsyms] [k] compaction_alloc 0.40% :13561 [kernel.kallsyms] [k] compaction_alloc 0.38% :13374 [kernel.kallsyms] [k] compaction_alloc 0.37% :13272 [kernel.kallsyms] [k] compaction_alloc 0.37% postmaster [kernel.kallsyms] [k] unmap_vmas 0.36% postmaster [kernel.kallsyms] [k] page_fault 0.36% :13380 [kernel.kallsyms] [k] compaction_alloc 0.35% :13482 [kernel.kallsyms] [k] compaction_alloc 0.34% :13555 [kernel.kallsyms] [k] compaction_alloc 0.34% postmaster [kernel.kallsyms] [k] set_pageblock_flags_group 0.34% postmaster [kernel.kallsyms] [k] page_check_address 0.33% :13528 [kernel.kallsyms] [k] compaction_alloc 0.33% :13464 [kernel.kallsyms] [k] compaction_alloc 0.31% :13547 [kernel.kallsyms] [k] compaction_alloc 0.30% postmaster [kernel.kallsyms] [k] _spin_lock_irqsave 0.29% :13395 [kernel.kallsyms] [k] compaction_alloc 0.29% :13546 [kernel.kallsyms] [k] compaction_alloc 0.28% postmaster [kernel.kallsyms] [k] remove_migration_pte 0.28% :13355 [kernel.kallsyms] [k] compaction_alloc 0.28% postmaster [kernel.kallsyms] [k] list_del 0.28% :13432 [kernel.kallsyms] [k] compaction_alloc 0.27% :13258 [kernel.kallsyms] [k] compaction_alloc 0.27% :13328 [kernel.kallsyms] [k] compaction_alloc 0.26% postmaster [kernel.kallsyms] [k] __wake_up_bit 0.26% :13361 [kernel.kallsyms] [k] compaction_alloc 0.25% :13334 [kernel.kallsyms] [k] compaction_alloc 0.25% :13366 [kernel.kallsyms] [k] compaction_alloc 0.25% :13549 [kernel.kallsyms] [k] compaction_alloc 0.25% :13530 [kernel.kallsyms] [k] compaction_alloc 0.24% :13391 [kernel.kallsyms] [k] compaction_alloc 0.24% :13387 [kernel.kallsyms] [k] compaction_alloc 0.24% :13364 [kernel.kallsyms] [k] compaction_alloc 0.24% :13529 [kernel.kallsyms] [k] compaction_alloc 0.24% postmaster [kernel.kallsyms] [k] find_get_page 0.23% postmaster postgres [.] _bt_compare 0.22% :13522 [kernel.kallsyms] [k] compaction_alloc 0.22% :13579 [kernel.kallsyms] [k] compaction_alloc 0.22% :13412 [kernel.kallsyms] [k] compaction_alloc 0.21% postmaster postgres [.] AllocSetAlloc 0.21% :13407 [kernel.kallsyms] [k] compaction_alloc 0.20% :13569 [kernel.kallsyms] [k] compaction_alloc 0.20% :13322 [kernel.kallsyms] [k] compaction_alloc 0.20% :13554 [kernel.kallsyms] [k] compaction_alloc 0.20% :13327 [kernel.kallsyms] [k] compaction_alloc ===== BELOW IS THE ZOOM OF THE TOP ENTRY Samples: 604K of event 'cycles', Event count (approx.): 476360854137, Thread: postmaster(11565) 94.92% postmaster [kernel.kallsyms] [k] compaction_alloc 1.49% postmaster [kernel.kallsyms] [k] __reset_isolation_suitable 1.01% postmaster [kernel.kallsyms] [k] copy_page_c 0.54% postmaster [kernel.kallsyms] [k] set_pageblock_flags_group 0.41% postmaster [kernel.kallsyms] [k] __wake_up_bit 0.37% postmaster [kernel.kallsyms] [k] find_get_page 0.16% postmaster [kernel.kallsyms] [k] find_vma 0.16% postmaster libc-2.12.so [.] _int_malloc 0.16% postmaster postgres [.] PinBuffer –’0.14% postmaster [kernel.kallsyms] [k] __alloc_pages_nodemask –’0.11% postmaster [kernel.kallsyms] [k] filemap_fault –’0.08% postmaster postgres [.] LockRelease –’0.08% postmaster postgres [.] hash_any –’0.06% postmaster libc-2.12.so [.] _int_free –’0.06% postmaster [kernel.kallsyms] [k] page_remove_rmap –’0.03% postmaster postgres [.] HeapTupleSatisfiesMVCC –’0.03% postmaster [kernel.kallsyms] [k] free_pages_and_swap_cache –’0.03% postmaster postgres [.] s_lock –’0.02% postmaster [kernel.kallsyms] [k] unlink_anon_vmas –’0.02% postmaster [kernel.kallsyms] [k] mem_cgroup_lru_add_list –’0.02% postmaster postgres [.] ReindexIsProcessingIndex –’0.01% postmaster [kernel.kallsyms] [k] set_page_dirty –’0.01% postmaster postgres [.] eqjoinsel –’0.01% postmaster [kernel.kallsyms] [k] perf_event_aux_ctx –’0.01% postmaster postgres [.] hash_seq_search –’0.01% postmaster postgres [.] get_mergejoin_opfamilies –’0.01% postmaster [kernel.kallsyms] [k] acl_permission_check –’0.01% postmaster libc-2.12.so [.] __sigsetjmp –’0.01% postmaster postgres [.] GetUserId –’0.00% postmaster postgres [.] pull_up_subqueries_recurse –’0.00% postmaster [kernel.kallsyms] [k] do_brk –’0.00% postmaster postgres [.] SearchSysCacheList –’0.00% postmaster postgres [.] join_search_one_level
Another update (let me know if I'm mailing too much). > 59.73% postmaster [kernel.kallsyms] [k] > compaction_alloc The compaction_alloc function seemed to point to issues with transparent huge pages (THP.) I tried to turn this off with: echo never > /sys/kernel/mm/transparent_hugepage/enabled This change yielded immediate results. The "migration/xx" processes disappeared from "top", the function "compaction_alloc" disappeared from "perf top", and system load dropped from 5 to 0.3. We've called off the 9.4 -> 9.3 downgrade for now. For anyone else finding this thread, we are running CentOS 6.5 with kernel 2.6.32 (which was released in Dec 2009.) Cheers, Andomar