Thread: Blocking excessively in FOR UPDATE
Hi list, I've been experiencing a weird performance issue lately. I have a very simple (and usually very fast) query: SELECT track_logs.id FROM track_logs WHERE track_logs.track_id = <some id> AND track_logs.track_status_id = 1 AND track_logs.date >= now() - interval '1 hours' FOR UPDATE Whose plan is: "LockRows (cost=0.00..26.73 rows=1 width=14)" " -> Index Scan using idx_track_logs_track_id on track_logs (cost=0.00..26.72 rows=1 width=14)" " Index Cond: (track_id = <some id>)" " Filter: ((track_status_id = 1) AND (date >= (now() - '01:00:00'::interval)))" The same query, without FOR UPDATE, takes just 68 milliseconds. With the FOR UPDATE, it takes like half a minute or more to finish. Now, I understand the for update part may be blocking on some other transaction, and it's probably the case. But I cannot figure out which transaction it would be. There *are*, in fact, connections in <idle in transaction> state, which makes me think those would be the culprit. But for the life of me, I cannot make sense of the pg_locks view, which shows all locks as granted: PID Relation XID TX Mode Granted Start 14751 5551986 154/4038460 AccessShareLock Yes 2011-11-03 12:45:03.551516-05 14751 5526310 154/4038460 RowShareLock Yes 2011-11-03 12:45:03.551516-05 14751 5552008 154/4038460 RowExclusiveLock Yes 2011-11-03 12:45:03.551516-05 14751 5552020 154/4038460 RowExclusiveLock Yes 2011-11-03 12:45:03.551516-05 14751 5552008 154/4038460 AccessShareLock Yes 2011-11-03 12:45:03.551516-05 14751 5525296 154/4038460 RowShareLock Yes 2011-11-03 12:45:03.551516-05 14751 5525292 154/4038460 RowShareLock Yes 2011-11-03 12:45:03.551516-05 14751 5552019 154/4038460 AccessShareLock Yes 2011-11-03 12:45:03.551516-05 14751 5552019 154/4038460 RowExclusiveLock Yes 2011-11-03 12:45:03.551516-05 14751 5552020 154/4038460 AccessShareLock Yes 2011-11-03 12:45:03.551516-05 14751 5525292 154/4038460 RowExclusiveLock Yes 2011-11-03 12:45:03.551516-05 14751 154/4038460 154/4038460 ExclusiveLock Yes 2011-11-03 12:45:03.551516-05 14751 154/4038460 ExclusiveLock Yes 2011-11-03 12:45:03.551516-05 14751 5526308 154/4038460 AccessShareLock Yes 2011-11-03 12:45:03.551516-05 Where should I look? What other information should I provide?
On Thu, Nov 3, 2011 at 2:51 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > What other information should I provide? Forgot all the usual details: Server is postgresql 9.0.3, running in linux, quite loaded (load average ~7), WAL on raid 1 2 spindles, data on raid 10 4 spindles, 16G RAM. Could it be high contention between the worker processes? (because of the high load)
Claudio Freire <klaussfreire@gmail.com> writes: > The same query, without FOR UPDATE, takes just 68 milliseconds. > With the FOR UPDATE, it takes like half a minute or more to finish. > Now, I understand the for update part may be blocking on some other > transaction, and it's probably the case. Yeah, that's what I'd guess. > But I cannot figure out which transaction it would be. There *are*, in > fact, connections in <idle in transaction> state, which makes me think > those would be the culprit. But for the life of me, I cannot make > sense of the pg_locks view, which shows all locks as granted: A block on a row would typically show up as one transaction waiting on another's XID. Did you capture this *while* the query was blocked? Also, I'm suspicious that you may be using a view that filters out the relevant lock types --- that's obviously not a raw display of pg_locks. regards, tom lane
On Thu, Nov 3, 2011 at 3:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Claudio Freire <klaussfreire@gmail.com> writes: >> But I cannot figure out which transaction it would be. There *are*, in >> fact, connections in <idle in transaction> state, which makes me think >> those would be the culprit. But for the life of me, I cannot make >> sense of the pg_locks view, which shows all locks as granted: > > A block on a row would typically show up as one transaction waiting on > another's XID. Did you capture this *while* the query was blocked? Yes > Also, I'm suspicious that you may be using a view that filters out > the relevant lock types --- that's obviously not a raw display of > pg_locks. It's pgadmin, which I usually use to monitor pg_stats_activity and pg_locks in a "pretty" view. pg_locks does not show the query, only the pid, so it's harder to spot. Next time I find it blocking, I will check pg_locks directly and post the output. I did that once, and they were all granted. I didn't correlate with other XIDs since I thought the "granted" column meant it wasn't waiting. Is that wrong?
On Thu, Nov 3, 2011 at 4:29 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > Next time I find it blocking, I will check pg_locks directly and post > the output. Here it is, two instances of the query, while blocked: select * from pg_locks where pid = 22636; locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted ---------------+----------+----------+------+-------+-------------+---------------+---------+-------+----------+--------------------+-------+------------------+--------- transactionid | | | | | | 360992199 | | | | 89/22579344 | 22636 | ExclusiveLock | t virtualxid | | | | | 89/22579344 | | | | | 89/22579344 | 22636 | ExclusiveLock | t relation | 16398 | 5552020 | | | | | | | | 89/22579344 | 22636 | AccessShareLock | t relation | 16398 | 5552020 | | | | | | | | 89/22579344 | 22636 | RowExclusiveLock | t relation | 16398 | 5552019 | | | | | | | | 89/22579344 | 22636 | AccessShareLock | t relation | 16398 | 5552019 | | | | | | | | 89/22579344 | 22636 | RowExclusiveLock | t relation | 16398 | 5525292 | | | | | | | | 89/22579344 | 22636 | RowShareLock | t relation | 16398 | 5525292 | | | | | | | | 89/22579344 | 22636 | RowExclusiveLock | t relation | 16398 | 5552008 | | | | | | | | 89/22579344 | 22636 | AccessShareLock | t relation | 16398 | 5552008 | | | | | | | | 89/22579344 | 22636 | RowExclusiveLock | t (10 rows) select * from pg_locks where pid = 22618; locktype | database | relation | page | tuple | virtualxid | transactionid | classid | objid | objsubid | virtualtransaction | pid | mode | granted ---------------+----------+----------+------+-------+-------------+---------------+---------+-------+----------+--------------------+-------+------------------+--------- virtualxid | | | | | 159/2706505 | | | | | 159/2706505 | 22618 | ExclusiveLock | t relation | 16398 | 5551986 | | | | | | | | 159/2706505 | 22618 | AccessShareLock | t transactionid | | | | | | 360992478 | | | | 159/2706505 | 22618 | ExclusiveLock | t relation | 16398 | 5552008 | | | | | | | | 159/2706505 | 22618 | AccessShareLock | t relation | 16398 | 5552008 | | | | | | | | 159/2706505 | 22618 | RowExclusiveLock | t relation | 16398 | 5526310 | | | | | | | | 159/2706505 | 22618 | RowShareLock | t relation | 16398 | 5552020 | | | | | | | | 159/2706505 | 22618 | AccessShareLock | t relation | 16398 | 5552020 | | | | | | | | 159/2706505 | 22618 | RowExclusiveLock | t relation | 16398 | 5526308 | | | | | | | | 159/2706505 | 22618 | AccessShareLock | t relation | 16398 | 5552019 | | | | | | | | 159/2706505 | 22618 | AccessShareLock | t relation | 16398 | 5552019 | | | | | | | | 159/2706505 | 22618 | RowExclusiveLock | t relation | 16398 | 5525296 | | | | | | | | 159/2706505 | 22618 | RowShareLock | t relation | 16398 | 5525292 | | | | | | | | 159/2706505 | 22618 | RowShareLock | t relation | 16398 | 5525292 | | | | | | | | 159/2706505 | 22618 | RowExclusiveLock | t (14 rows)
Claudio Freire <klaussfreire@gmail.com> writes: > On Thu, Nov 3, 2011 at 4:29 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> Next time I find it blocking, I will check pg_locks directly and post >> the output. > Here it is, two instances of the query, while blocked: Hmm ... definitely seems that you're not blocked on a FOR UPDATE tuple lock. If you were, there'd be an ungranted ShareLock on a transactionid in there. It seems possible that you're blocked on an LWLock, which would not show in pg_locks. But before pursuing that idea, probably first you should back up and confirm whether the process is actually waiting, or running, or just really slow due to CPU contention. It might be useful to see what strace has to say about it. regards, tom lane
On Thu, Nov 3, 2011 at 8:45 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > But before pursuing that idea, probably first you should > back up and confirm whether the process is actually waiting, or running, > or just really slow due to CPU contention. It might be useful to see > what strace has to say about it. Thanks for the tip, it seems locks had nothing to do with it. strace suggests those queries get blocked on write(). This is an explain analyze without for update: "Index Scan using idx_track_logs_track_id on track_logs (cost=0.00..26.75 rows=1 width=8) (actual time=0.056..38.119 rows=1 loops=1)" " Index Cond: (track_id = <some id>)" " Filter: ((track_status_id = 1) AND (date >= (now() - '01:00:00'::interval)))" " Buffers: shared hit=140 read=3127" "Total runtime: 38.147 ms" This is with for update that goes fast: "LockRows (cost=0.00..26.76 rows=1 width=14) (actual time=0.075..37.420 rows=1 loops=1)" " Buffers: shared hit=63 read=3205" " -> Index Scan using idx_track_logs_track_id on track_logs (cost=0.00..26.75 rows=1 width=14) (actual time=0.058..37.402 rows=1 loops=1)" " Index Cond: (track_id = <some id>)" " Filter: ((track_status_id = 1) AND (date >= (now() - '01:00:00'::interval)))" " Buffers: shared hit=62 read=3205" "Total runtime: 37.462 ms" I cannot hit one that goes slow yet, but when I did (and didn't capture the output :( ) it was kinda like: "LockRows (cost=0.00..26.76 rows=1 width=14) (actual time=0.075..37.420 rows=1 loops=1)" " Buffers: shared hit=63 read=3205" " -> Index Scan using idx_track_logs_track_id on track_logs (cost=0.00..26.75 rows=1 width=14) (actual time=0.058..37.402 rows=1 loops=1)" " Index Cond: (track_id = <some id>)" " Filter: ((track_status_id = 1) AND (date >= (now() - '01:00:00'::interval)))" " Buffers: shared hit=62 read=3205 written=135" "Total runtime: 37000.462 ms" Now, I'm thinking those writes are catching the DB at a bad moment - we do have regular very write-intensive peaks. Maybe I should look into increasing shared buffers? Checkpoints are well spread and very fast What are those writes about? HOT vacuuming perhaps?
On Fri, Nov 4, 2011 at 12:07 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > What are those writes about? HOT vacuuming perhaps? Every tuple lock requires dirtying the page. Those writes are all those dirty pages getting flushed out to disk. It's possible that the OS is allowing the writes to happen asynchronously for a while, but then when you get too much dirty data in the cache, it starts blocking. The only thing I'm fuzzy about is why it's locking so many rows, given that the output says rows=1. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
Claudio Freire <klaussfreire@gmail.com> wrote: > Now, I'm thinking those writes are catching the DB at a bad moment - > we do have regular very write-intensive peaks. > > Maybe I should look into increasing shared buffers? As already pointed out, SELECT FOR UPDATE will require a disk write of the tuple(s) read. If these are glutting, increasing shared_buffers would tend to make things worse. You might want to make the background writer more aggressive or *reduce* shared_buffers to better spread the output. Hopefully you have a BBU RAID controller configured for write-back. If not you should. If you do, another thing which might help is increasing the cache on that controller. Or you could move WAL to a separate file system on a separate controller with its own BBU write-back cache. -Kevin
On Fri, Nov 4, 2011 at 1:26 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > As already pointed out, SELECT FOR UPDATE will require a disk write > of the tuple(s) read. If these are glutting, increasing > shared_buffers would tend to make things worse. I thought shared_buffers improved write caching. We do tend to write onto the same rows over and over. > Hopefully you have a BBU RAID controller configured for write-back. > If not you should. If you do, another thing which might help is > increasing the cache on that controller. Write-back on the RAID is another issue. The controller is old and has a very small cache, we're already in the process of securing a replacement. > Or you could move WAL to a > separate file system on a separate controller with its own BBU > write-back cache. WAL is already on a separate controller and array.
Claudio Freire <klaussfreire@gmail.com> wrote: > On Fri, Nov 4, 2011 at 1:26 PM, Kevin Grittner > <Kevin.Grittner@wicourts.gov> wrote: >> As already pointed out, SELECT FOR UPDATE will require a disk >> write of the tuple(s) read. If these are glutting, increasing >> shared_buffers would tend to make things worse. > > I thought shared_buffers improved write caching. > We do tend to write onto the same rows over and over. PostgreSQL is very aggressive about holding on to dirty buffers as long as possible, in hopes of reducing duplicate page writes. This can work against the goal of consistent latency. In our shop we needed to make the background writer more aggressive and keep shared buffers in the 0.5GB to 2GB range (depending on hardware and workload) to prevent write gluts leading to latency spikes. In the mildly surprising department, the OS seemed to do a decent job of spotting pages receiving repeated writes and hold back on an OS level write, while pushing other pages out in a more timely fashion -- there was no discernible increase in OS write activity from making these changes. I know other people have had other experiences, based on their workloads (and OS versions?). Before anything else, you might want to make sure you've spread your checkpoint activity as much as possible by setting checkpoint_completion_target = 0.9. -Kevin
On Fri, Nov 4, 2011 at 2:07 PM, Kevin Grittner <Kevin.Grittner@wicourts.gov> wrote: > Before anything else, you might want to make sure you've spread your > checkpoint activity as much as possible by setting > checkpoint_completion_target = 0.9. We have shared_buffers = 2G bgwriter_delay = 1000ms effective_io_concurrency=8 synchronous_commit=off wal_buffers=16M wal_writer_delay=2000ms commit_delay=10000 checkpoint_segments=72 checkpoint_timeout=60min checkpoint_completion_target=0.8 I'm thinking bgwriter_delay and wal_writer_delay might not be working as I expected, and that maybe checkpoint_segments=72 is a bit too high, but we were having much worse I/O storms before I pushed it that high. Looking at checkpoint logging for the last few days, it goes almost always like: checkpoint complete: wrote 589 buffers (3.6%); 0 transaction log file(s) added, 0 removed, 8 recycled; write=590.325 s, sync=0.055 s, total=590.417 s 590s seems an awful lot for 589 buffers.
On 11/04/2011 12:22 PM, Claudio Freire wrote: > bgwriter_delay = 1000ms > wal_writer_delay=2000ms > commit_delay=10000 !? Maybe someone can back me up on this, but my interpretation of these settings suggests they're *way* too high. That commit_delay especially makes me want to cry. From the manual: "Setting commit_delay can only help when there are many concurrently committing transactions, and it is difficult to tune it to a value that actually helps rather than hurt throughput." Meaning it may halt all of your commits up to *ten seconds* if it doesn't think there was enough activity to warrant a write. Ouch. Your bgwriter_delay and wal_writer_delay settings are equally odd. You've made the background writer so passive that when it does finally run, it's going to have a ton of work to do, causing giant write spikes. I'm not sure whether or not this also causes compounding problems with locks and backend write delays. > checkpoint complete: wrote 589 buffers (3.6%); 0 transaction log > file(s) added, 0 removed, 8 recycled; write=590.325 s, sync=0.055 s, > total=590.417 s > > 590s seems an awful lot for 589 buffers. You're misinterpreting this. The checkpoint completion target is multiplied against your checkpoint timeout. 590 seconds is roughly ten minutes, and for 589 buffers, it wrote one per second. That's about as slow as it can possibly write that many buffers. It had *up to* 24 minutes, and if it had more buffers available to write, it would have written them. The number you really care about is "sync=0.055 s" which is how much time the controller spent syncing that data to disk. If you're having real problems writing or lock delays due to IO stalls, you'll see that sync parameter shoot way up. This can also be elevated in certain NVRAM-based solutions. Once you start seeing whole seconds, or minutes, it might actually matter. -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604 312-676-8870 sthomas@peak6.com ______________________________________________ See http://www.peak6.com/email-disclaimer/ for terms and conditions related to this email
On Fri, Nov 4, 2011 at 3:26 PM, Shaun Thomas <sthomas@peak6.com> wrote: > On 11/04/2011 12:22 PM, Claudio Freire wrote: > >> bgwriter_delay = 1000ms >> wal_writer_delay=2000ms >> commit_delay=10000 > > !? >snip > "Setting commit_delay can only help when there are many concurrently > committing transactions, and it is difficult to tune it to a value that > actually helps rather than hurt throughput." > > Meaning it may halt all of your commits up to *ten seconds* if it doesn't > think there was enough activity to warrant a write. Ouch. I think you're misinterpreting the value. It's in microseconds, that's 10 *milli*seconds > Your bgwriter_delay and wal_writer_delay settings are equally odd. You've > made the background writer so passive that when it does finally run, it's > going to have a ton of work to do, causing giant write spikes. I'm not sure > whether or not this also causes compounding problems with locks and backend > write delays. I don't think 1 second can be such a big difference for the bgwriter, but I might be wrong. The wal_writer makes me doubt, though. If logged activity was higher than 8MB/s, then that setting would block it all. I guess I really should lower it. > The number you > really care about is "sync=0.055 s" which is how much time the controller > spent syncing that data to disk. > > If you're having real problems writing or lock delays due to IO stalls, > you'll see that sync parameter shoot way up. This can also be elevated in > certain NVRAM-based solutions. Once you start seeing whole seconds, or > minutes, it might actually matter. Nice to know, I thought so, now I know so. :-) So... I'm leaning towards lowering wal_writer_delay and see how it goes.
On Fri, Nov 4, 2011 at 2:45 PM, Claudio Freire <klaussfreire@gmail.com> wrote: > I don't think 1 second can be such a big difference for the bgwriter, > but I might be wrong. Well, the default value is 200 ms. And I've never before heard of anyone tuning it up, except maybe to save on power consumption on a system with very low utilization. Nearly always you want to reduce it. > The wal_writer makes me doubt, though. If logged activity was higher > than 8MB/s, then that setting would block it all. > I guess I really should lower it. Here again, you've set it to ten times the default value. That doesn't seem like a good idea. I would start with the default and tune down. -- Robert Haas EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company
On 11/04/2011 01:45 PM, Claudio Freire wrote: > I think you're misinterpreting the value. > It's in microseconds, that's 10 *milli*seconds Wow. My brain totally skimmed over that section. Everything else is in milliseconds, so I never even considered it. Sorry about that! I stand by everything else though. ;) -- Shaun Thomas OptionsHouse | 141 W. Jackson Blvd. | Suite 800 | Chicago IL, 60604 312-676-8870 sthomas@peak6.com ______________________________________________ See http://www.peak6.com/email-disclaimer/ for terms and conditions related to this email
On Fri, Nov 4, 2011 at 3:54 PM, Robert Haas <robertmhaas@gmail.com> wrote: > On Fri, Nov 4, 2011 at 2:45 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> I don't think 1 second can be such a big difference for the bgwriter, >> but I might be wrong. > > Well, the default value is 200 ms. And I've never before heard of > anyone tuning it up, except maybe to save on power consumption on a > system with very low utilization. Nearly always you want to reduce > it. Will try >> The wal_writer makes me doubt, though. If logged activity was higher >> than 8MB/s, then that setting would block it all. >> I guess I really should lower it. > > Here again, you've set it to ten times the default value. That > doesn't seem like a good idea. I would start with the default and > tune down. Already did that. Waiting to see how it turns out.
On Fri, Nov 4, 2011 at 4:07 PM, Claudio Freire <klaussfreire@gmail.com> wrote: >> Here again, you've set it to ten times the default value. That >> doesn't seem like a good idea. I would start with the default and >> tune down. > > Already did that. Waiting to see how it turns out. Nope, still happening with those changes. Though it did make sense that those settings were too high, it didn't fix the strange blocking. Is it possible that the query is locking all the tuples hit, rather than only the ones selected? Because the index used to reach the tuple has to walk across around 3k tuples before finding the one that needs locking. They're supposed to be in memory already (they're quite hot), that's why selecting is fast, but maybe it's trying to lock all 3k tuples? I don't know, I'm just throwing punches blindly at this point.