Thread: vacuum issues under load?
We have recently discovered a problem with our slony-1 cluster of 8.1.19 installs. Specifically, we are unable to vacuum a table on the master node; vacuum always hangs on the same index of the same table. If we do a slony switchover and make the other node the master, then *it* will become unable to vacuum that index. Vacuum on the slave always works quickly and without issue. Vacuum does not hang anywhere else. When we tried to strace the vacuuming backend, it appeared as if it was trying to acquire a lock, but pg_lock showed nothing unexpected for that index. We can also manually acquire an access exclusive lock on the offending table if we desire. FWIW, we have about 1,000 sessions, and while most of them are idle, we still average a couple hundred queries/s. The index in question is a simple unique btree over a column of type character(40). Has anybody else experienced anything like this? We are hoping this problem magically goes away when we upgrade to 8.4 next month, but it would be great if we could solve it before then. Thanks for any suggestions....
Ben Chobot <bench@silentmedia.com> writes: > We have recently discovered a problem with our slony-1 cluster of 8.1.19 > installs. Specifically, we are unable to vacuum a table on the master > node; vacuum always hangs on the same index of the same table. If we do a > slony switchover and make the other node the master, then *it* will become > unable to vacuum that index. Vacuum on the slave always works quickly and > without issue. Vacuum does not hang anywhere else. > When we tried to strace the vacuuming backend, it appeared as if it was > trying to acquire a lock, but pg_lock showed nothing unexpected for that > index. Try attaching to the process with gdb and getting a stack trace. Also ask on the slony lists if anyone's seen anything like this. I don't know a reason for slony to have any effect like that, but there's got to be *something* weird going on. regards, tom lane
On Fri, 15 Jan 2010, Tom Lane wrote: > Ben Chobot <bench@silentmedia.com> writes: >> We have recently discovered a problem with our slony-1 cluster of 8.1.19 >> installs. Specifically, we are unable to vacuum a table on the master >> node; vacuum always hangs on the same index of the same table. If we do a >> slony switchover and make the other node the master, then *it* will become >> unable to vacuum that index. Vacuum on the slave always works quickly and >> without issue. Vacuum does not hang anywhere else. > >> When we tried to strace the vacuuming backend, it appeared as if it was >> trying to acquire a lock, but pg_lock showed nothing unexpected for that >> index. > > Try attaching to the process with gdb and getting a stack trace. #0 0x00007f59a5d78ca7 in semop () from /lib/libc.so.6 #1 0x0000000000546656 in PGSemaphoreLock () #2 0x00000000005618ee in LockBufferForCleanup () #3 0x000000000045c8e2 in btbulkdelete () #4 0x00000000005eb179 in FunctionCall3 () #5 0x00000000004ed81a in ?? () #6 0x00000000004ee132 in lazy_vacuum_rel () #7 0x00000000004ec0e4 in ?? () #8 0x00000000004ecdeb in vacuum () #9 0x0000000000577837 in ?? () #10 0x0000000000578da1 in PortalRun () #11 0x0000000000574b2b in ?? () #12 0x000000000057611e in PostgresMain () #13 0x000000000054e66b in ?? () #14 0x000000000054f5d1 in PostmasterMain () #15 0x00000000005128de in main ()
Ben Chobot <bench@silentmedia.com> writes: > On Fri, 15 Jan 2010, Tom Lane wrote: >> Try attaching to the process with gdb and getting a stack trace. > #0 0x00007f59a5d78ca7 in semop () from /lib/libc.so.6 > #1 0x0000000000546656 in PGSemaphoreLock () > #2 0x00000000005618ee in LockBufferForCleanup () > #3 0x000000000045c8e2 in btbulkdelete () OK, so it's trying to get an exclusive page-level lock on some page of the index. The only thing that could block that for any long period is if some other process is sitting with an open indexscan. Look around for processes that have been "idle in transaction" for a long time ... regards, tom lane
On Fri, 15 Jan 2010, Tom Lane wrote: >> #0 0x00007f59a5d78ca7 in semop () from /lib/libc.so.6 >> #1 0x0000000000546656 in PGSemaphoreLock () >> #2 0x00000000005618ee in LockBufferForCleanup () >> #3 0x000000000045c8e2 in btbulkdelete () > > OK, so it's trying to get an exclusive page-level lock on some page of > the index. The only thing that could block that for any long period is > if some other process is sitting with an open indexscan. Look around > for processes that have been "idle in transaction" for a long time ... There are some, but they've all been running less than a second. There's no better place to look than pg_stat_activity, is there?
Ben Chobot <bench@silentmedia.com> writes: > On Fri, 15 Jan 2010, Tom Lane wrote: >> OK, so it's trying to get an exclusive page-level lock on some page of >> the index. The only thing that could block that for any long period is >> if some other process is sitting with an open indexscan. Look around >> for processes that have been "idle in transaction" for a long time ... > There are some, but they've all been running less than a second. There's > no better place to look than pg_stat_activity, is there? Hmph. Given that, what we seem to be looking at is a leaked buffer pin count. There isn't any easy way to fix that except a database restart --- perhaps you can manage one over the weekend? I don't recall having heard of any such bugs lately, but 8.1.x is kind of an old branch. I'm still wondering if slony might've been involved somehow. regards, tom lane
On Fri, Jan 15, 2010 at 7:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote: > Ben Chobot <bench@silentmedia.com> writes: >> We have recently discovered a problem with our slony-1 cluster of 8.1.19 >> installs. Specifically, we are unable to vacuum a table on the master >> node; vacuum always hangs on the same index of the same table. If we do a >> slony switchover and make the other node the master, then *it* will become >> unable to vacuum that index. Vacuum on the slave always works quickly and >> without issue. Vacuum does not hang anywhere else. > >> When we tried to strace the vacuuming backend, it appeared as if it was >> trying to acquire a lock, but pg_lock showed nothing unexpected for that >> index. > > Try attaching to the process with gdb and getting a stack trace. > > Also ask on the slony lists if anyone's seen anything like this. I > don't know a reason for slony to have any effect like that, but there's > got to be *something* weird going on. I've seen a situation like there during ddl ops (using the slony execute command to run them on the cluster) where slony halts waiting on vacuum and everything else halts waiting on slony. That was with slony 1.2 With slony 2.0.3 or so, I had occasional complete lockups of my database that I didn't have time to troubleshoot as it was a live cluster and I had to restart slony and the databases to get them back up and running.
On Fri, Jan 15, 2010 at 08:41:38AM -0700, Scott Marlowe wrote: > With slony 2.0.3 or so, I had occasional complete lockups of my > database that I didn't have time to troubleshoot as it was a live > cluster and I had to restart slony and the databases to get them back > up and running. With slony 2.0.2 I had similar problems. A CREATE INDEX on the slave caused slony on the slave to block while inserting into the table which eventually blocked the server during the log switch (TRUNCATE) which eventually blocked everything else. It occurs to me that the slony daemon should be able to get the TRUNCATE command to abort if it takes too long. Have a nice day, -- Martijn van Oosterhout <kleptog@svana.org> http://svana.org/kleptog/ > Please line up in a tree and maintain the heap invariant while > boarding. Thank you for flying nlogn airlines.
Attachment
On Mon, Jan 18, 2010 at 12:28 AM, Martijn van Oosterhout <kleptog@svana.org> wrote: > On Fri, Jan 15, 2010 at 08:41:38AM -0700, Scott Marlowe wrote: >> With slony 2.0.3 or so, I had occasional complete lockups of my >> database that I didn't have time to troubleshoot as it was a live >> cluster and I had to restart slony and the databases to get them back >> up and running. > > With slony 2.0.2 I had similar problems. A CREATE INDEX on the slave > caused slony on the slave to block while inserting into the table which > eventually blocked the server during the log switch (TRUNCATE) which > eventually blocked everything else. You could have used create index interactively. > It occurs to me that the slony daemon should be able to get the > TRUNCATE command to abort if it takes too long. No, I don't want it stopping my truncates. The problem I ran into was on a db with no creating indexes, no truncates nothing like that going on. It runs fine for a week or two, then hangs hard. No updates, no selects, nothing works. There are no locks in the way that I can see, just hung database.
On Mon, Jan 18, 2010 at 6:35 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > You could have used create index interactively. s/interactively/concurrently/
On Mon, Jan 18, 2010 at 6:35 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote: > On Mon, Jan 18, 2010 at 12:28 AM, Martijn van Oosterhout > <kleptog@svana.org> wrote: >> On Fri, Jan 15, 2010 at 08:41:38AM -0700, Scott Marlowe wrote: >>> With slony 2.0.3 or so, I had occasional complete lockups of my >>> database that I didn't have time to troubleshoot as it was a live >>> cluster and I had to restart slony and the databases to get them back >>> up and running. >> >> With slony 2.0.2 I had similar problems. A CREATE INDEX on the slave >> caused slony on the slave to block while inserting into the table which >> eventually blocked the server during the log switch (TRUNCATE) which >> eventually blocked everything else. > > You could have used create index interactively. > >> It occurs to me that the slony daemon should be able to get the >> TRUNCATE command to abort if it takes too long. > > No, I don't want it stopping my truncates. > > The problem I ran into was on a db with no creating indexes, no > truncates nothing like that going on. It runs fine for a week or two, > then hangs hard. No updates, no selects, nothing works. There are no > locks in the way that I can see, just hung database. Wait a minute, I'm describing the scenario we had with 2.0.3, where it would just hang. The vacuum issue is different. Vacuum starts, slony goes to lock the table for ddl, waits on lock from vacuum, then everything backs up behind slony. With large numbers of tables, the problem gets much worse much faster.