Re: barnacle (running CLOBBER_CACHE_RECURSIVELY) seems stuck since November - Mailing list pgsql-hackers
From | Tomas Vondra |
---|---|
Subject | Re: barnacle (running CLOBBER_CACHE_RECURSIVELY) seems stuck since November |
Date | |
Msg-id | 551040D8.4040104@2ndquadrant.com Whole thread Raw |
In response to | Re: barnacle (running CLOBBER_CACHE_RECURSIVELY) seems stuck since November (Robert Haas <robertmhaas@gmail.com>) |
List | pgsql-hackers |
On 23.3.2015 16:02, Robert Haas wrote: > On Sun, Mar 22, 2015 at 6:26 PM, Tomas Vondra > <tomas.vondra@2ndquadrant.com> wrote: >> This time I've however checked the log, and what caught my eye is that >> the last log message is from November. There were regular messages until >> then (a few messages per day), but since Nov 19 there's nothing. > > I'd try strace on the backends, to see if they are issuing any system calls. Good idea. Both backends do a bunch of open/lseek/read/close - seems like a pattern: open("base/16384/pg_filenode.map", O_RDONLY) = 10 read(10, "\27'Y\0\17\0\0\0\353\4\0\0\235m\0\0\341\4\0\0\3261\0\0\347\4\0\0\3321\0\0"..., 512) = 512 close(10) = 0 close(9) = 0 open("base/16384/28061", O_RDWR) = 9 lseek(9, 0, SEEK_END) = 155648 lseek(9, 0, SEEK_END) = 155648 open("global/pg_filenode.map", O_RDONLY) = 10 read(10, "\27'Y\0\33\0\0\0\356\4\0\0\243m\0\0\224\v\0\0T2\0\0\275\4\0\0Z2\0\0"..., 512) = 512 close(10) = 0 open("base/16384/pg_filenode.map", O_RDONLY) = 10 read(10, "\27'Y\0\17\0\0\0\353\4\0\0\235m\0\0\341\4\0\0\3261\0\0\347\4\0\0\3321\0\0"..., 512) = 512 close(10) = 0 close(9) = 0 open("base/16384/28061", O_RDWR) = 9 lseek(9, 0, SEEK_END) = 155648 lseek(9, 0, SEEK_END) = 155648 open("base/16384/28056", O_RDWR) = 10 lseek(10, 0, SEEK_END) = 8192 lseek(10, 0, SEEK_END) = 8192 lseek(10, 0, SEEK_END) = 8192 It seems to vary a bit over time (or maybe the pattern is longer). Seems like a consequence of the recursive cache resets, I guess. Attached a few seconds worth of strace from each backend. > I'd also try attaching gdb and pulling a back trace, then continue, > then hit ^C, and pull another backtrace, repeat about 10 times, and > see if you can figure out where it's stuck. Hmmm, this seems a bit more interesting - see the backtraces attached. The first thing that caught my eye is that the back traces are quite long, and the top part is a sequence of alternating heap_open() and AcceptInvalidationMessages() messages. Moreover, it seems to be about the same relation - judging by the relid: ================================================================== #7 AcceptInvalidationMessages () at inval.c:691 #8 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1) at lmgr.c:107 #10 heap_open (relationId=31196760, lockmode=1) at heapam.c:1195 #16 AcceptInvalidationMessages () at inval.c:691 #17 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1) at lmgr.c:107 ... #26 AcceptInvalidationMessages () at inval.c:691 #27 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1) at lmgr.c:107 ... #36 AcceptInvalidationMessages () at inval.c:691 #37 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1) at lmgr.c:107 ... #47 AcceptInvalidationMessages () at inval.c:691 #48 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1) at lmgr.c:107 ... #50 heap_open (relationId=31196760, lockmode=1) at heapam.c:1195 ... #56 AcceptInvalidationMessages () at inval.c:691 #57 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1) at lmgr.c:107 ... #59 heap_open (relationId=31196760, lockmode=1) at heapam.c:1195 ... #65 AcceptInvalidationMessages () at inval.c:691 #66 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1) at lmgr.c:107 ... #75 AcceptInvalidationMessages () at inval.c:691 #76 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1) at lmgr.c:107 ... #78 heap_open (relationId=31196760, lockmode=1) at heapam.c:1195 ... #84 AcceptInvalidationMessages () at inval.c:691 #85 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1) at lmgr.c:107 ================================================================== The other thing is that there's a fair amount of pointers resolved like this: #157 0x00000000007d00ed in PostgresMain (argc=31196760, argv=0x1, dbname=0x1c86a40 "regression", username=0x3 <Address 0x3 out of bounds>) at postgres.c:4010 while in the next backtrace from the same backend it looks like this: #117 0x00000000007d00ed in PostgresMain (argc=12, argv=0xf32ae0, dbname=0x1c86a40 "regression", username=0xffffffffffffffff <Address 0xffffffffffffffff out of bounds>) at postgres.c:4010 That seems a bit strange - I would not expect this to change once the backend is started, but maybe I'm missing something. And finally there's a fair amount of pointers resolved as 0x7f71e74fd770 which seems odd too. -- Tomas Vondra http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services
Attachment
pgsql-hackers by date: