Thread: vacuum issues under load?

vacuum issues under load?

From
Ben Chobot
Date:
We have recently discovered a problem with our slony-1 cluster of 8.1.19
installs. Specifically, we are unable to vacuum a table on the master
node; vacuum always hangs on the same index of the same table. If we do a
slony switchover and make the other node the master, then *it* will become
unable to vacuum that index. Vacuum on the slave always works quickly and
without issue. Vacuum does not hang anywhere else.

When we tried to strace the vacuuming backend, it appeared as if it was
trying to acquire a lock, but pg_lock showed nothing unexpected for that
index. We can also manually acquire an access exclusive lock on the
offending table if we desire. FWIW, we have about 1,000 sessions, and
while most of them are idle, we still average a couple hundred queries/s.
The index in question is a simple unique btree over a column of type
character(40).

Has anybody else experienced anything like this? We are hoping this
problem magically goes away when we upgrade to 8.4 next month, but it
would be great if we could solve it before then.

Thanks for any suggestions....

Re: vacuum issues under load?

From
Tom Lane
Date:
Ben Chobot <bench@silentmedia.com> writes:
> We have recently discovered a problem with our slony-1 cluster of 8.1.19
> installs. Specifically, we are unable to vacuum a table on the master
> node; vacuum always hangs on the same index of the same table. If we do a
> slony switchover and make the other node the master, then *it* will become
> unable to vacuum that index. Vacuum on the slave always works quickly and
> without issue. Vacuum does not hang anywhere else.

> When we tried to strace the vacuuming backend, it appeared as if it was
> trying to acquire a lock, but pg_lock showed nothing unexpected for that
> index.

Try attaching to the process with gdb and getting a stack trace.

Also ask on the slony lists if anyone's seen anything like this.  I
don't know a reason for slony to have any effect like that, but there's
got to be *something* weird going on.

            regards, tom lane

Re: vacuum issues under load?

From
Ben Chobot
Date:
On Fri, 15 Jan 2010, Tom Lane wrote:

> Ben Chobot <bench@silentmedia.com> writes:
>> We have recently discovered a problem with our slony-1 cluster of 8.1.19
>> installs. Specifically, we are unable to vacuum a table on the master
>> node; vacuum always hangs on the same index of the same table. If we do a
>> slony switchover and make the other node the master, then *it* will become
>> unable to vacuum that index. Vacuum on the slave always works quickly and
>> without issue. Vacuum does not hang anywhere else.
>
>> When we tried to strace the vacuuming backend, it appeared as if it was
>> trying to acquire a lock, but pg_lock showed nothing unexpected for that
>> index.
>
> Try attaching to the process with gdb and getting a stack trace.

#0  0x00007f59a5d78ca7 in semop () from /lib/libc.so.6
#1  0x0000000000546656 in PGSemaphoreLock ()
#2  0x00000000005618ee in LockBufferForCleanup ()
#3  0x000000000045c8e2 in btbulkdelete ()
#4  0x00000000005eb179 in FunctionCall3 ()
#5  0x00000000004ed81a in ?? ()
#6  0x00000000004ee132 in lazy_vacuum_rel ()
#7  0x00000000004ec0e4 in ?? ()
#8  0x00000000004ecdeb in vacuum ()
#9  0x0000000000577837 in ?? ()
#10 0x0000000000578da1 in PortalRun ()
#11 0x0000000000574b2b in ?? ()
#12 0x000000000057611e in PostgresMain ()
#13 0x000000000054e66b in ?? ()
#14 0x000000000054f5d1 in PostmasterMain ()
#15 0x00000000005128de in main ()



Re: vacuum issues under load?

From
Tom Lane
Date:
Ben Chobot <bench@silentmedia.com> writes:
> On Fri, 15 Jan 2010, Tom Lane wrote:
>> Try attaching to the process with gdb and getting a stack trace.

> #0  0x00007f59a5d78ca7 in semop () from /lib/libc.so.6
> #1  0x0000000000546656 in PGSemaphoreLock ()
> #2  0x00000000005618ee in LockBufferForCleanup ()
> #3  0x000000000045c8e2 in btbulkdelete ()

OK, so it's trying to get an exclusive page-level lock on some page of
the index.  The only thing that could block that for any long period is
if some other process is sitting with an open indexscan.  Look around
for processes that have been "idle in transaction" for a long time ...

            regards, tom lane

Re: vacuum issues under load?

From
Ben Chobot
Date:
On Fri, 15 Jan 2010, Tom Lane wrote:

>> #0  0x00007f59a5d78ca7 in semop () from /lib/libc.so.6
>> #1  0x0000000000546656 in PGSemaphoreLock ()
>> #2  0x00000000005618ee in LockBufferForCleanup ()
>> #3  0x000000000045c8e2 in btbulkdelete ()
>
> OK, so it's trying to get an exclusive page-level lock on some page of
> the index.  The only thing that could block that for any long period is
> if some other process is sitting with an open indexscan.  Look around
> for processes that have been "idle in transaction" for a long time ...

There are some, but they've all been running less than a second. There's
no better place to look than pg_stat_activity, is there?


Re: vacuum issues under load?

From
Tom Lane
Date:
Ben Chobot <bench@silentmedia.com> writes:
> On Fri, 15 Jan 2010, Tom Lane wrote:
>> OK, so it's trying to get an exclusive page-level lock on some page of
>> the index.  The only thing that could block that for any long period is
>> if some other process is sitting with an open indexscan.  Look around
>> for processes that have been "idle in transaction" for a long time ...

> There are some, but they've all been running less than a second. There's
> no better place to look than pg_stat_activity, is there?

Hmph.  Given that, what we seem to be looking at is a leaked buffer pin
count.  There isn't any easy way to fix that except a database restart
--- perhaps you can manage one over the weekend?

I don't recall having heard of any such bugs lately, but 8.1.x is kind
of an old branch.  I'm still wondering if slony might've been involved
somehow.

            regards, tom lane

Re: vacuum issues under load?

From
Scott Marlowe
Date:
On Fri, Jan 15, 2010 at 7:45 AM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Ben Chobot <bench@silentmedia.com> writes:
>> We have recently discovered a problem with our slony-1 cluster of 8.1.19
>> installs. Specifically, we are unable to vacuum a table on the master
>> node; vacuum always hangs on the same index of the same table. If we do a
>> slony switchover and make the other node the master, then *it* will become
>> unable to vacuum that index. Vacuum on the slave always works quickly and
>> without issue. Vacuum does not hang anywhere else.
>
>> When we tried to strace the vacuuming backend, it appeared as if it was
>> trying to acquire a lock, but pg_lock showed nothing unexpected for that
>> index.
>
> Try attaching to the process with gdb and getting a stack trace.
>
> Also ask on the slony lists if anyone's seen anything like this.  I
> don't know a reason for slony to have any effect like that, but there's
> got to be *something* weird going on.

I've seen a situation like there during ddl ops (using the slony
execute command to run them on the cluster) where slony halts waiting
on vacuum and everything else halts waiting on slony.  That was with
slony 1.2

With slony 2.0.3 or so, I had occasional complete lockups of my
database that I didn't have time to troubleshoot as it was a live
cluster and I had to restart slony and the databases to get them back
up and running.

Re: vacuum issues under load?

From
Martijn van Oosterhout
Date:
On Fri, Jan 15, 2010 at 08:41:38AM -0700, Scott Marlowe wrote:
> With slony 2.0.3 or so, I had occasional complete lockups of my
> database that I didn't have time to troubleshoot as it was a live
> cluster and I had to restart slony and the databases to get them back
> up and running.

With slony 2.0.2 I had similar problems. A CREATE INDEX on the slave
caused slony on the slave to block while inserting into the table which
eventually blocked the server during the log switch (TRUNCATE) which
eventually blocked everything else.

It occurs to me that the slony daemon should be able to get the
TRUNCATE command to abort if it takes too long.

Have a nice day,
--
Martijn van Oosterhout   <kleptog@svana.org>   http://svana.org/kleptog/
> Please line up in a tree and maintain the heap invariant while
> boarding. Thank you for flying nlogn airlines.

Attachment

Re: vacuum issues under load?

From
Scott Marlowe
Date:
On Mon, Jan 18, 2010 at 12:28 AM, Martijn van Oosterhout
<kleptog@svana.org> wrote:
> On Fri, Jan 15, 2010 at 08:41:38AM -0700, Scott Marlowe wrote:
>> With slony 2.0.3 or so, I had occasional complete lockups of my
>> database that I didn't have time to troubleshoot as it was a live
>> cluster and I had to restart slony and the databases to get them back
>> up and running.
>
> With slony 2.0.2 I had similar problems. A CREATE INDEX on the slave
> caused slony on the slave to block while inserting into the table which
> eventually blocked the server during the log switch (TRUNCATE) which
> eventually blocked everything else.

You could have used create index interactively.

> It occurs to me that the slony daemon should be able to get the
> TRUNCATE command to abort if it takes too long.

No, I don't want it stopping my truncates.

The problem I ran into was on a db with no creating indexes, no
truncates nothing like that going on.  It runs fine for a week or two,
then hangs hard.  No updates, no selects, nothing works.  There are no
locks in the way that I can see, just hung database.

Re: vacuum issues under load?

From
Scott Marlowe
Date:
On Mon, Jan 18, 2010 at 6:35 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> You could have used create index interactively.

s/interactively/concurrently/

Re: vacuum issues under load?

From
Scott Marlowe
Date:
On Mon, Jan 18, 2010 at 6:35 PM, Scott Marlowe <scott.marlowe@gmail.com> wrote:
> On Mon, Jan 18, 2010 at 12:28 AM, Martijn van Oosterhout
> <kleptog@svana.org> wrote:
>> On Fri, Jan 15, 2010 at 08:41:38AM -0700, Scott Marlowe wrote:
>>> With slony 2.0.3 or so, I had occasional complete lockups of my
>>> database that I didn't have time to troubleshoot as it was a live
>>> cluster and I had to restart slony and the databases to get them back
>>> up and running.
>>
>> With slony 2.0.2 I had similar problems. A CREATE INDEX on the slave
>> caused slony on the slave to block while inserting into the table which
>> eventually blocked the server during the log switch (TRUNCATE) which
>> eventually blocked everything else.
>
> You could have used create index interactively.
>
>> It occurs to me that the slony daemon should be able to get the
>> TRUNCATE command to abort if it takes too long.
>
> No, I don't want it stopping my truncates.
>
> The problem I ran into was on a db with no creating indexes, no
> truncates nothing like that going on.  It runs fine for a week or two,
> then hangs hard.  No updates, no selects, nothing works.  There are no
> locks in the way that I can see, just hung database.

Wait a minute, I'm describing the scenario we had with 2.0.3, where it
would just hang.

The vacuum issue is different.  Vacuum starts, slony goes to lock the
table for ddl, waits on lock from vacuum, then everything backs up
behind slony.  With large numbers of tables, the problem gets much
worse much faster.