Re: barnacle (running CLOBBER_CACHE_RECURSIVELY) seems stuck since November - Mailing list pgsql-hackers

From Tomas Vondra
Subject Re: barnacle (running CLOBBER_CACHE_RECURSIVELY) seems stuck since November
Date
Msg-id 551040D8.4040104@2ndquadrant.com
Whole thread Raw
In response to Re: barnacle (running CLOBBER_CACHE_RECURSIVELY) seems stuck since November  (Robert Haas <robertmhaas@gmail.com>)
List pgsql-hackers
On 23.3.2015 16:02, Robert Haas wrote:
> On Sun, Mar 22, 2015 at 6:26 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> This time I've however checked the log, and what caught my eye is that
>> the last log message is from November. There were regular messages until
>> then (a few messages per day), but since Nov 19 there's nothing.
>
> I'd try strace on the backends, to see if they are issuing any system calls.

Good idea. Both backends do a bunch of open/lseek/read/close - seems
like a pattern:

open("base/16384/pg_filenode.map", O_RDONLY) = 10
read(10,
"\27'Y\0\17\0\0\0\353\4\0\0\235m\0\0\341\4\0\0\3261\0\0\347\4\0\0\3321\0\0"...,
512) = 512
close(10)                               = 0
close(9)                                = 0
open("base/16384/28061", O_RDWR)        = 9
lseek(9, 0, SEEK_END)                   = 155648
lseek(9, 0, SEEK_END)                   = 155648
open("global/pg_filenode.map", O_RDONLY) = 10
read(10,
"\27'Y\0\33\0\0\0\356\4\0\0\243m\0\0\224\v\0\0T2\0\0\275\4\0\0Z2\0\0"..., 512)
= 512
close(10)                               = 0
open("base/16384/pg_filenode.map", O_RDONLY) = 10
read(10,
"\27'Y\0\17\0\0\0\353\4\0\0\235m\0\0\341\4\0\0\3261\0\0\347\4\0\0\3321\0\0"...,
512) = 512
close(10)                               = 0
close(9)                                = 0
open("base/16384/28061", O_RDWR)        = 9
lseek(9, 0, SEEK_END)                   = 155648
lseek(9, 0, SEEK_END)                   = 155648
open("base/16384/28056", O_RDWR)        = 10
lseek(10, 0, SEEK_END)                  = 8192
lseek(10, 0, SEEK_END)                  = 8192
lseek(10, 0, SEEK_END)                  = 8192

It seems to vary a bit over time (or maybe the pattern is longer). Seems
like a consequence of the recursive cache resets, I guess.

Attached a few seconds worth of strace from each backend.


> I'd also try attaching gdb and pulling a back trace, then continue,
> then hit ^C, and pull another backtrace, repeat about 10 times, and
> see if you can figure out where it's stuck.

Hmmm, this seems a bit more interesting - see the backtraces attached.

The first thing that caught my eye is that the back traces are quite
long, and the top part is a sequence of alternating heap_open() and
AcceptInvalidationMessages() messages. Moreover, it seems to be about
the same relation - judging by the relid:

==================================================================
#7  AcceptInvalidationMessages () at inval.c:691
#8  0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
#10 heap_open (relationId=31196760, lockmode=1) at heapam.c:1195
#16 AcceptInvalidationMessages () at inval.c:691
#17 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#26 AcceptInvalidationMessages () at inval.c:691
#27 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#36 AcceptInvalidationMessages () at inval.c:691
#37 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#47 AcceptInvalidationMessages () at inval.c:691
#48 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#50 heap_open (relationId=31196760, lockmode=1) at heapam.c:1195
...
#56 AcceptInvalidationMessages () at inval.c:691
#57 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#59 heap_open (relationId=31196760, lockmode=1) at heapam.c:1195
...
#65 AcceptInvalidationMessages () at inval.c:691
#66 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#75 AcceptInvalidationMessages () at inval.c:691
#76 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#78 heap_open (relationId=31196760, lockmode=1) at heapam.c:1195
...
#84 AcceptInvalidationMessages () at inval.c:691
#85 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
==================================================================

The other thing is that there's a fair amount of pointers resolved like
this:

#157 0x00000000007d00ed in PostgresMain (argc=31196760, argv=0x1,
dbname=0x1c86a40 "regression", username=0x3 <Address 0x3 out of bounds>)
at postgres.c:4010

while in the next backtrace from the same backend it looks like this:

#117 0x00000000007d00ed in PostgresMain (argc=12, argv=0xf32ae0,
dbname=0x1c86a40 "regression", username=0xffffffffffffffff <Address
0xffffffffffffffff out of bounds>) at postgres.c:4010

That seems a bit strange - I would not expect this to change once the
backend is started, but maybe I'm missing something.

And finally there's a fair amount of pointers resolved as 0x7f71e74fd770
which seems odd too.



--
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: PATCH: pgbench - merging transaction logs
Next
From: David Steele
Date:
Subject: Re: Auditing extension for PostgreSQL (Take 2)