On 23.3.2015 16:02, Robert Haas wrote:
> On Sun, Mar 22, 2015 at 6:26 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> This time I've however checked the log, and what caught my eye is that
>> the last log message is from November. There were regular messages until
>> then (a few messages per day), but since Nov 19 there's nothing.
>
> I'd try strace on the backends, to see if they are issuing any system calls.
Good idea. Both backends do a bunch of open/lseek/read/close - seems
like a pattern:
open("base/16384/pg_filenode.map", O_RDONLY) = 10
read(10,
"\27'Y\0\17\0\0\0\353\4\0\0\235m\0\0\341\4\0\0\3261\0\0\347\4\0\0\3321\0\0"...,
512) = 512
close(10) = 0
close(9) = 0
open("base/16384/28061", O_RDWR) = 9
lseek(9, 0, SEEK_END) = 155648
lseek(9, 0, SEEK_END) = 155648
open("global/pg_filenode.map", O_RDONLY) = 10
read(10,
"\27'Y\0\33\0\0\0\356\4\0\0\243m\0\0\224\v\0\0T2\0\0\275\4\0\0Z2\0\0"..., 512)
= 512
close(10) = 0
open("base/16384/pg_filenode.map", O_RDONLY) = 10
read(10,
"\27'Y\0\17\0\0\0\353\4\0\0\235m\0\0\341\4\0\0\3261\0\0\347\4\0\0\3321\0\0"...,
512) = 512
close(10) = 0
close(9) = 0
open("base/16384/28061", O_RDWR) = 9
lseek(9, 0, SEEK_END) = 155648
lseek(9, 0, SEEK_END) = 155648
open("base/16384/28056", O_RDWR) = 10
lseek(10, 0, SEEK_END) = 8192
lseek(10, 0, SEEK_END) = 8192
lseek(10, 0, SEEK_END) = 8192
It seems to vary a bit over time (or maybe the pattern is longer). Seems
like a consequence of the recursive cache resets, I guess.
Attached a few seconds worth of strace from each backend.
> I'd also try attaching gdb and pulling a back trace, then continue,
> then hit ^C, and pull another backtrace, repeat about 10 times, and
> see if you can figure out where it's stuck.
Hmmm, this seems a bit more interesting - see the backtraces attached.
The first thing that caught my eye is that the back traces are quite
long, and the top part is a sequence of alternating heap_open() and
AcceptInvalidationMessages() messages. Moreover, it seems to be about
the same relation - judging by the relid:
==================================================================
#7 AcceptInvalidationMessages () at inval.c:691
#8 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
#10 heap_open (relationId=31196760, lockmode=1) at heapam.c:1195
#16 AcceptInvalidationMessages () at inval.c:691
#17 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#26 AcceptInvalidationMessages () at inval.c:691
#27 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#36 AcceptInvalidationMessages () at inval.c:691
#37 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#47 AcceptInvalidationMessages () at inval.c:691
#48 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#50 heap_open (relationId=31196760, lockmode=1) at heapam.c:1195
...
#56 AcceptInvalidationMessages () at inval.c:691
#57 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#59 heap_open (relationId=31196760, lockmode=1) at heapam.c:1195
...
#65 AcceptInvalidationMessages () at inval.c:691
#66 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#75 AcceptInvalidationMessages () at inval.c:691
#76 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#78 heap_open (relationId=31196760, lockmode=1) at heapam.c:1195
...
#84 AcceptInvalidationMessages () at inval.c:691
#85 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
==================================================================
The other thing is that there's a fair amount of pointers resolved like
this:
#157 0x00000000007d00ed in PostgresMain (argc=31196760, argv=0x1,
dbname=0x1c86a40 "regression", username=0x3 <Address 0x3 out of bounds>)
at postgres.c:4010
while in the next backtrace from the same backend it looks like this:
#117 0x00000000007d00ed in PostgresMain (argc=12, argv=0xf32ae0,
dbname=0x1c86a40 "regression", username=0xffffffffffffffff <Address
0xffffffffffffffff out of bounds>) at postgres.c:4010
That seems a bit strange - I would not expect this to change once the
backend is started, but maybe I'm missing something.
And finally there's a fair amount of pointers resolved as 0x7f71e74fd770
which seems odd too.
--
Tomas Vondra http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services