Thread: barnacle (running CLOBBER_CACHE_RECURSIVELY) seems stuck since November

barnacle (running CLOBBER_CACHE_RECURSIVELY) seems stuck since November

From

Tomas Vondra

Date:

22 March 2015, 22:27:18

Hi,

I've been checking progress on barnacle, one of the animals running with
CLOBBER_CACHE_RECURSIVELY. It's running for ~170 days (250k minutes).

This time I've however checked the log, and what caught my eye is that
the last log message is from November. There were regular messages until
then (a few messages per day), but since Nov 19 there's nothing.

The last few messages are these:

2014-11-28 22:19:24 CET 7953 [540097d4.1f11:532] LOG:  statement: CREATE
FUNCTION city_insert() RETURNS trigger LANGUAGE plpgsql AS $$
2014-11-28 22:19:27 CET 7953 [540097d4.1f11:533] LOG:  statement: CREATE
TRIGGER city_insert_trig INSTEAD OF INSERT ON city_view
2014-11-28 22:25:13 CET 7953 [540097d4.1f11:534] LOG:  statement: CREATE
FUNCTION city_delete() RETURNS trigger LANGUAGE plpgsql AS $$
2014-11-28 22:25:15 CET 7953 [540097d4.1f11:535] LOG:  statement: CREATE
TRIGGER city_delete_trig INSTEAD OF DELETE ON city_view
2014-11-29 03:10:01 CET 7953 [540097d4.1f11:536] LOG:  statement: CREATE
FUNCTION city_update() RETURNS trigger LANGUAGE plpgsql AS $$
2014-11-29 03:10:03 CET 7953 [540097d4.1f11:537] LOG:  statement: CREATE
TRIGGER city_update_trig INSTEAD OF UPDATE ON city_view
2014-11-29 07:44:50 CET 7953 [540097d4.1f11:538] LOG:  statement: INSERT
INTO city_view(city_name) VALUES('Tokyo') RETURNING *;

which matches commands halfway-through 'triggers' tests.

There are currently two queries running - one from 'triggers', one from
'updatable_views':

-[ RECORD 1 ]----+----------------------------------------------
 datid            | 16384
 datname          | regression
 pid              | 7953
 usesysid         | 10
 usename          | pgbuild
 application_name | pg_regress
 client_addr      |
 client_hostname  |
 client_port      | -1
 backend_start    | 2014-08-29 17:10:12.243979+02
 xact_start       | 2014-11-29 07:44:50.452678+01
 query_start      | 2014-11-29 07:44:50.452678+01
 state_change     | 2014-11-29 07:44:50.45268+01
 waiting          | f
 state            | active
 backend_xid      |
 backend_xmin     | 3989
 query            | INSERT INTO city_view(city_name) VALUES('Tokyo')
RETURNING *;
 -[ RECORD 2 ]----+----------------------------------------------
 datid            | 16384
 datname          | regression
 pid              | 7956
 usesysid         | 10
 usename          | pgbuild
 application_name | pg_regress
 client_addr      |
 client_hostname  |
 client_port      | -1
 backend_start    | 2014-08-29 17:10:12.272223+02
 xact_start       | 2014-10-06 15:35:25.822488+02
 query_start      | 2014-10-06 15:35:25.822488+02
 state_change     | 2014-10-06 15:35:25.822491+02
 waiting          | f
 state            | active
 backend_xid      |
 backend_xmin     | 3862
 query            | UPDATE rw_view2 SET b='Row three' WHERE a=3 RETURNING *;

Top currently looks like this:

  PID USER      PR  NI  VIRT  RES  SHR S %CPU %MEM    TIME+  COMMAND
 7956 pgbuild   20   0  242m  13m  10m R 98.8  0.2 251152:27 postgres:
pgbuild regression [local] UPDATE
 7953 pgbuild   20   0 1023m 785m  11m R 98.4 10.0 248806:18 postgres:
pgbuild regression [local] INSERT

Both backends started on August 29, but the 'updatable_views' query is
running since October 6. 5 months seems like a rather long runtime, even
with CLOBBER_CACHE_RECURSIVELY).

Not sure if it's relevant, but this is the machine with Intel C compiler
(2013).

Attached is a memory context info for the 7953 backend (with more than
1GB VIRT) - not sure if it's relevant, so just in case.


--
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Attachment

memory.log

Re: barnacle (running CLOBBER_CACHE_RECURSIVELY) seems stuck since November

From

Robert Haas

Date:

23 March 2015, 15:02:25

On Sun, Mar 22, 2015 at 6:26 PM, Tomas Vondra
<tomas.vondra@2ndquadrant.com> wrote:
> This time I've however checked the log, and what caught my eye is that
> the last log message is from November. There were regular messages until
> then (a few messages per day), but since Nov 19 there's nothing.

I'd try strace on the backends, to see if they are issuing any system calls.

I'd also try attaching gdb and pulling a back trace, then continue,
then hit ^C, and pull another backtrace, repeat about 10 times, and
see if you can figure out where it's stuck.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: barnacle (running CLOBBER_CACHE_RECURSIVELY) seems stuck since November

From

Tomas Vondra

Date:

23 March 2015, 16:36:12

On 23.3.2015 16:02, Robert Haas wrote:
> On Sun, Mar 22, 2015 at 6:26 PM, Tomas Vondra
> <tomas.vondra@2ndquadrant.com> wrote:
>> This time I've however checked the log, and what caught my eye is that
>> the last log message is from November. There were regular messages until
>> then (a few messages per day), but since Nov 19 there's nothing.
>
> I'd try strace on the backends, to see if they are issuing any system calls.

Good idea. Both backends do a bunch of open/lseek/read/close - seems
like a pattern:

open("base/16384/pg_filenode.map", O_RDONLY) = 10
read(10,
"\27'Y\0\17\0\0\0\353\4\0\0\235m\0\0\341\4\0\0\3261\0\0\347\4\0\0\3321\0\0"...,
512) = 512
close(10)                               = 0
close(9)                                = 0
open("base/16384/28061", O_RDWR)        = 9
lseek(9, 0, SEEK_END)                   = 155648
lseek(9, 0, SEEK_END)                   = 155648
open("global/pg_filenode.map", O_RDONLY) = 10
read(10,
"\27'Y\0\33\0\0\0\356\4\0\0\243m\0\0\224\v\0\0T2\0\0\275\4\0\0Z2\0\0"..., 512)
= 512
close(10)                               = 0
open("base/16384/pg_filenode.map", O_RDONLY) = 10
read(10,
"\27'Y\0\17\0\0\0\353\4\0\0\235m\0\0\341\4\0\0\3261\0\0\347\4\0\0\3321\0\0"...,
512) = 512
close(10)                               = 0
close(9)                                = 0
open("base/16384/28061", O_RDWR)        = 9
lseek(9, 0, SEEK_END)                   = 155648
lseek(9, 0, SEEK_END)                   = 155648
open("base/16384/28056", O_RDWR)        = 10
lseek(10, 0, SEEK_END)                  = 8192
lseek(10, 0, SEEK_END)                  = 8192
lseek(10, 0, SEEK_END)                  = 8192

It seems to vary a bit over time (or maybe the pattern is longer). Seems
like a consequence of the recursive cache resets, I guess.

Attached a few seconds worth of strace from each backend.


> I'd also try attaching gdb and pulling a back trace, then continue,
> then hit ^C, and pull another backtrace, repeat about 10 times, and
> see if you can figure out where it's stuck.

Hmmm, this seems a bit more interesting - see the backtraces attached.

The first thing that caught my eye is that the back traces are quite
long, and the top part is a sequence of alternating heap_open() and
AcceptInvalidationMessages() messages. Moreover, it seems to be about
the same relation - judging by the relid:

==================================================================
#7  AcceptInvalidationMessages () at inval.c:691
#8  0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
#10 heap_open (relationId=31196760, lockmode=1) at heapam.c:1195
#16 AcceptInvalidationMessages () at inval.c:691
#17 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#26 AcceptInvalidationMessages () at inval.c:691
#27 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#36 AcceptInvalidationMessages () at inval.c:691
#37 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#47 AcceptInvalidationMessages () at inval.c:691
#48 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#50 heap_open (relationId=31196760, lockmode=1) at heapam.c:1195
...
#56 AcceptInvalidationMessages () at inval.c:691
#57 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#59 heap_open (relationId=31196760, lockmode=1) at heapam.c:1195
...
#65 AcceptInvalidationMessages () at inval.c:691
#66 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#75 AcceptInvalidationMessages () at inval.c:691
#76 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
...
#78 heap_open (relationId=31196760, lockmode=1) at heapam.c:1195
...
#84 AcceptInvalidationMessages () at inval.c:691
#85 0x00000000007a8eae in LockRelationOid (relid=31196760, lockmode=1)
at lmgr.c:107
==================================================================

The other thing is that there's a fair amount of pointers resolved like
this:

#157 0x00000000007d00ed in PostgresMain (argc=31196760, argv=0x1,
dbname=0x1c86a40 "regression", username=0x3 <Address 0x3 out of bounds>)
at postgres.c:4010

while in the next backtrace from the same backend it looks like this:

#117 0x00000000007d00ed in PostgresMain (argc=12, argv=0xf32ae0,
dbname=0x1c86a40 "regression", username=0xffffffffffffffff <Address
0xffffffffffffffff out of bounds>) at postgres.c:4010

That seems a bit strange - I would not expect this to change once the
backend is started, but maybe I'm missing something.

And finally there's a fair amount of pointers resolved as 0x7f71e74fd770
which seems odd too.



--
Tomas Vondra                http://www.2ndQuadrant.com/
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services