Thread: BUG #11444: autovacuum stuck for 5 days and waits on a lock

BUG #11444: autovacuum stuck for 5 days and waits on a lock

From
alexk@hintbits.com
Date:
The following bug has been logged on the website:

Bug reference:      11444
Logged by:          Alexey Klyukin
Email address:      alexk@hintbits.com
PostgreSQL version: 9.3.5
Operating system:   Ubuntu 12.04 3.2.0-52-generic #78-Ubuntu SMP x64
Description:

Greetings,

We've got a problem with autovacuum process running for already 5 days,
without being blocked by any particular process.
The table in question is around 500MB. Usually we don't see vacuums running
there for more than couple of hours.
Other autovacuum processes (on different tables) start and finish without
issues.

The vacuum and autovacuum settings are left at default values. We don't have
debug information compiled on this server, but here's what we have after the
gdb bt output, attaching to the autovacuum process.

#0  0x00007f423cbf34f7 in semop () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007f423e8581f8 in PGSemaphoreLock ()
#2  0x00007f423e889a45 in LockBufferForCleanup ()
#3  0x00007f423e6d9cd6 in ?? ()
#4  0x00007f423e6da08f in ?? ()
#5  0x00007f423e6daaf6 in btbulkdelete ()
#6  0x00007f423e982c7c in FunctionCall4Coll ()
#7  0x00007f423e6d4a93 in index_bulk_delete ()
#8  0x00007f423e7c8322 in ?? ()
#9  0x00007f423e7c937d in lazy_vacuum_rel ()
#10 0x00007f423e7c6a26 in ?? ()
#11 0x00007f423e7c78cc in vacuum ()
#12 0x00007f423e85ac02 in ?? ()
#13 0x00007f423e85b107 in ?? ()
#14 0x00007f423e85b1c2 in StartAutoVacWorker ()
#15 0x00007f423e8683f8 in ?? ()
#16 <signal handler called>
#17 0x00007f423cbea743 in select () from /lib/x86_64-linux-gnu/libc.so.6
#18 0x00007f423e867626 in PostmasterMain ()
#19 0x00007f423e6a3205 in main ()
(gdb)

The strace -p $pid produced the following output at one time:

semop(2523174, {{12, -1, 0}}, 1
semop(2523174, {{12, -1, 0}}, 1


)        = 0
lseek(21, 0, SEEK_END)                  = 32276480
select(0, NULL, NULL, NULL, {0, 21000}) = 0 (Timeout)
semop(2523174, {{12, -1, 0}}, 1

Most of the time it is stuck with semop.
The file in the lseek argument is an index  on the table autovacuum is
trying to vacuum.
I've tried running pgbreecheck on this index, but it did not show any issues
with it:

NOTICE: fast root: block 412 at level 2
NOTICE: setting page parents
NOTICE: done setting parents
NOTICE: Checking forward scan of level 0, starting at block 1
NOTICE: done checking level 0
NOTICE: checking visited nodes on level 0, starting at block 1
NOTICE: done checking visited nodes on level 0
NOTICE: Checking forward scan of level 1, starting at block 3
NOTICE: done checking level 1
NOTICE: checking visited nodes on level 1, starting at block 3
NOTICE: done checking visited nodes on level 1
NOTICE: Checking forward scan of level 2, starting at block 412
NOTICE: done checking level 2
NOTICE: checking visited nodes on level 2, starting at block 412
NOTICE: done checking visited nodes on level 2
started checking at root 412

There are also UPDATE statements constantly running on this table, in the
overlapping manner, so at a single moment there is at least one update
running on it. We are investigating why is it done this way, but can it be a
reason behind this strange vacuum behavior?

Re: BUG #11444: autovacuum stuck for 5 days and waits on a lock

From
Andres Freund
Date:
Hi,

On 2014-09-17 16:02:16 +0000, alexk@hintbits.com wrote:
> We've got a problem with autovacuum process running for already 5 days,
> without being blocked by any particular process.
> The table in question is around 500MB. Usually we don't see vacuums running
> there for more than couple of hours.
> Other autovacuum processes (on different tables) start and finish without
> issues.
>
> The vacuum and autovacuum settings are left at default values. We don't have
> debug information compiled on this server, but here's what we have after the
> gdb bt output, attaching to the autovacuum process.

> #0  0x00007f423cbf34f7 in semop () from /lib/x86_64-linux-gnu/libc.so.6
> #1  0x00007f423e8581f8 in PGSemaphoreLock ()
> #2  0x00007f423e889a45 in LockBufferForCleanup ()
> #3  0x00007f423e6d9cd6 in ?? ()
> #4  0x00007f423e6da08f in ?? ()
> #5  0x00007f423e6daaf6 in btbulkdelete ()

What is happening here is that vacuum (of a btree index) is waiting for a so called
'cleanup' lock on a page. That is it waits for the page not being pinned
by any other backend.

> semop(2523174, {{12, -1, 0}}, 1
> semop(2523174, {{12, -1, 0}}, 1
>
>
> )        = 0
> lseek(21, 0, SEEK_END)                  = 32276480
> select(0, NULL, NULL, NULL, {0, 21000}) = 0 (Timeout)
> semop(2523174, {{12, -1, 0}}, 1
>
> Most of the time it is stuck with semop.

That indicates it's waiting most of the time. But it does make progress.

> There are also UPDATE statements constantly running on this table, in the
> overlapping manner, so at a single moment there is at least one update
> running on it. We are investigating why is it done this way, but can it be a
> reason behind this strange vacuum behavior?

Which quite possibly is caused by this.


I've recently commented on -hackers that this is a very hard to debug
behaviour and should be made more visible, but IIRC we didn't come to a
conclusion...

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Re: BUG #11444: autovacuum stuck for 5 days and waits on a lock

From
Alexey Klyukin
Date:
On Wed, Sep 17, 2014 at 6:12 PM, Andres Freund <andres@2ndquadrant.com> wrote:

>> There are also UPDATE statements constantly running on this table, in the
>> overlapping manner, so at a single moment there is at least one update
>> running on it. We are investigating why is it done this way, but can it be a
>> reason behind this strange vacuum behavior?
>
> Which quite possibly is caused by this.
>
>
> I've recently commented on -hackers that this is a very hard to debug
> behaviour and should be made more visible, but IIRC we didn't come to a
> conclusion...

Thank you, we've verified that the index in question is scanned as a
result of the sub-SELECT statement producing the data for the
resulting UPDATE. We are looking into simplifying it to avoid the
long-running scans, hopefully, we'll pull this before the transaction
wraparound will come for this cluster :-)

--
Regards,
Alexey Klyukin