Thread: [HACKERS] check failure with -DRELCACHE_FORCE_RELEASE -DCLOBBER_FREED_MEMORY

[HACKERS] check failure with -DRELCACHE_FORCE_RELEASE -DCLOBBER_FREED_MEMORY

From
Andrew Dunstan
Date:
I have been setting up a buildfarm member with -DRELCACHE_FORCE_RELEASE
-DCLOBBER_FREED_MEMORY, settings which Alvaro suggested to me.I got core
dumps with these stack traces. The platform is Amazon Linux.


================== stack trace:
pgsql.build/src/test/regress/tmp_check/data/core.4149 ==================
[New LWP 4149]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: ec2-user regression [local]
VACUUM                                  '.
Program terminated with signal 11, Segmentation fault.
#0  0x00000000005916bf in rebuild_relation (verbose=0 '\000',
indexOid=0, OldHeap=0x1dd7ae0) at cluster.c:576
576             OIDNewHeap = make_new_heap(tableOid, tableSpace,
#0  0x00000000005916bf in rebuild_relation (verbose=0 '\000',
indexOid=0, OldHeap=0x1dd7ae0) at cluster.c:576
#1  cluster_rel (tableOid=tableOid@entry=28441,
indexOid=indexOid@entry=0, recheck=recheck@entry=0 '\000',
verbose=verbose@entry=0 '\000') at cluster.c:404
#2  0x00000000005ef228 in vacuum_rel (relid=relid@entry=28441,
relation=relation@entry=0x1dab408, options=options@entry=17,
params=params@entry=0x7ffdd87d72a0) at vacuum.c:1441
#3  0x00000000005f0542 in vacuum (options=17, relation=0x1dab408,
relid=relid@entry=0, params=params@entry=0x7ffdd87d72a0, va_cols=0x0,
bstrategy=<optimized out>, bstrategy@entry=0x0, isTopLevel=1 '\001') at
vacuum.c:304
#4  0x00000000005f093e in ExecVacuum (vacstmt=vacstmt@entry=0x1dab460,
isTopLevel=isTopLevel@entry=1 '\001') at vacuum.c:122
#5  0x0000000000728925 in standard_ProcessUtility (pstmt=0x1dab7c0,
queryString=0x1daa9a8 "VACUUM FULL concur_heap;",
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, dest=0x1dab8b8,
completionTag=0x7ffdd87d76a0 "") at utility.c:670
#6  0x0000000000725d82 in PortalRunUtility (portal=0x1d48a68,
pstmt=0x1dab7c0, isTopLevel=<optimized out>, setHoldSnapshot=<optimized
out>, dest=<optimized out>, completionTag=0x7ffdd87d76a0 "") at
pquery.c:1165
#7  0x0000000000726819 in PortalRunMulti (portal=portal@entry=0x1d48a68,
isTopLevel=isTopLevel@entry=1 '\001',
setHoldSnapshot=setHoldSnapshot@entry=0 '\000',
dest=dest@entry=0x1dab8b8, altdest=altdest@entry=0x1dab8b8,
completionTag=completionTag@entry=0x7ffdd87d76a0 "") at pquery.c:1315
#8  0x0000000000727488 in PortalRun (portal=portal@entry=0x1d48a68,
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1
'\001', dest=dest@entry=0x1dab8b8, altdest=altdest@entry=0x1dab8b8,
completionTag=completionTag@entry=0x7ffdd87d76a0 "") at pquery.c:788
#9  0x000000000072500a in exec_simple_query (query_string=0x1daa9a8
"VACUUM FULL concur_heap;") at postgres.c:1101
#10 PostgresMain (argc=<optimized out>, argv=argv@entry=0x1d561e0,
dbname=0x1d55f30 "regression", username=<optimized out>) at postgres.c:4066
#11 0x00000000004765b4 in BackendRun (port=0x1d51420) at postmaster.c:4317
#12 BackendStartup (port=0x1d51420) at postmaster.c:3989
#13 ServerLoop () at postmaster.c:1729
#14 0x00000000006b9a0a in PostmasterMain (argc=argc@entry=8,
argv=argv@entry=0x1d2a260) at postmaster.c:1337
#15 0x00000000004775c2 in main (argc=8, argv=0x1d2a260) at main.c:228


================== stack trace:
pgsql.build/src/test/regress/tmp_check/data/core.4180 ==================
[New LWP 4180]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Core was generated by `postgres: ec2-user regression [local]
VACUUM                                  '.
Program terminated with signal 11, Segmentation fault.
#0  0x00000000005916bf in rebuild_relation (verbose=0 '\000',
indexOid=0, OldHeap=0x7f460d159930) at cluster.c:576
576             OIDNewHeap = make_new_heap(tableOid, tableSpace,
#0  0x00000000005916bf in rebuild_relation (verbose=0 '\000',
indexOid=0, OldHeap=0x7f460d159930) at cluster.c:576
#1  cluster_rel (tableOid=tableOid@entry=28479,
indexOid=indexOid@entry=0, recheck=recheck@entry=0 '\000',
verbose=verbose@entry=0 '\000') at cluster.c:404
#2  0x00000000005ef228 in vacuum_rel (relid=relid@entry=28479,
relation=relation@entry=0x1dab400, options=options@entry=17,
params=params@entry=0x7ffdd87d72a0) at vacuum.c:1441
#3  0x00000000005f0542 in vacuum (options=17, relation=0x1dab400,
relid=relid@entry=0, params=params@entry=0x7ffdd87d72a0, va_cols=0x0,
bstrategy=<optimized out>, bstrategy@entry=0x0, isTopLevel=1 '\001') at
vacuum.c:304
#4  0x00000000005f093e in ExecVacuum (vacstmt=vacstmt@entry=0x1dab458,
isTopLevel=isTopLevel@entry=1 '\001') at vacuum.c:122
#5  0x0000000000728925 in standard_ProcessUtility (pstmt=0x1dab7b8,
queryString=0x1daa9a8 "VACUUM FULL vactst;",
context=PROCESS_UTILITY_TOPLEVEL, params=0x0, dest=0x1dab8b0,
completionTag=0x7ffdd87d76a0 "") at utility.c:670
#6  0x0000000000725d82 in PortalRunUtility (portal=0x1d48a68,
pstmt=0x1dab7b8, isTopLevel=<optimized out>, setHoldSnapshot=<optimized
out>, dest=<optimized out>, completionTag=0x7ffdd87d76a0 "") at
pquery.c:1165
#7  0x0000000000726819 in PortalRunMulti (portal=portal@entry=0x1d48a68,
isTopLevel=isTopLevel@entry=1 '\001',
setHoldSnapshot=setHoldSnapshot@entry=0 '\000',
dest=dest@entry=0x1dab8b0, altdest=altdest@entry=0x1dab8b0,
completionTag=completionTag@entry=0x7ffdd87d76a0 "") at pquery.c:1315
#8  0x0000000000727488 in PortalRun (portal=portal@entry=0x1d48a68,
count=count@entry=9223372036854775807, isTopLevel=isTopLevel@entry=1
'\001', dest=dest@entry=0x1dab8b0, altdest=altdest@entry=0x1dab8b0,
completionTag=completionTag@entry=0x7ffdd87d76a0 "") at pquery.c:788
#9  0x000000000072500a in exec_simple_query (query_string=0x1daa9a8
"VACUUM FULL vactst;") at postgres.c:1101
#10 PostgresMain (argc=<optimized out>, argv=argv@entry=0x1d561e0,
dbname=0x1d55f30 "regression", username=<optimized out>) at postgres.c:4066
#11 0x00000000004765b4 in BackendRun (port=0x1d51420) at postmaster.c:4317
#12 BackendStartup (port=0x1d51420) at postmaster.c:3989
#13 ServerLoop () at postmaster.c:1729
#14 0x00000000006b9a0a in PostmasterMain (argc=argc@entry=8,
argv=argv@entry=0x1d2a260) at postmaster.c:1337
#15 0x00000000004775c2 in main (argc=8, argv=0x1d2a260) at main.c:228



cheers


andrew


-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




[HACKERS] Re: check failure with -DRELCACHE_FORCE_RELEASE-DCLOBBER_FREED_MEMORY

From
Andrew Dunstan
Date:

On 03/03/2017 02:24 PM, Andrew Dunstan wrote:
> I have been setting up a buildfarm member with -DRELCACHE_FORCE_RELEASE
> -DCLOBBER_FREED_MEMORY, settings which Alvaro suggested to me.I got core
> dumps with these stack traces. The platform is Amazon Linux.
>


I have replicated this on a couple of other platforms (Fedora, FreeBSD)
and back to 9.5. The same failure doesn't happen with buildfarm runs on
earlier branches, although possibly they don't have the same set of tests.

cheers

andrew

-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services




Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
> On 03/03/2017 02:24 PM, Andrew Dunstan wrote:
>> I have been setting up a buildfarm member with -DRELCACHE_FORCE_RELEASE
>> -DCLOBBER_FREED_MEMORY, settings which Alvaro suggested to me.I got core
>> dumps with these stack traces. The platform is Amazon Linux.

> I have replicated this on a couple of other platforms (Fedora, FreeBSD)
> and back to 9.5. The same failure doesn't happen with buildfarm runs on
> earlier branches, although possibly they don't have the same set of tests.

well, the problem in rebuild_relation() seems pretty blatant:
   /* Close relcache entry, but keep lock until transaction commit */   heap_close(OldHeap, NoLock);
   /* Create the transient table that will receive the re-ordered data */   OIDNewHeap = make_new_heap(tableOid,
tableSpace,                             OldHeap->rd_rel->relpersistence,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^                             AccessExclusiveLock);
 

There are two such references after the heap_close.  I don't know that
those are the only bugs, but this reference is certainly the proximate
cause of the crash I'm seeing.

Will push a fix in a little bit.
        regards, tom lane



I wrote:
> Andrew Dunstan <andrew.dunstan@2ndquadrant.com> writes:
>> On 03/03/2017 02:24 PM, Andrew Dunstan wrote:
>>> I have been setting up a buildfarm member with -DRELCACHE_FORCE_RELEASE
>>> -DCLOBBER_FREED_MEMORY, settings which Alvaro suggested to me.I got core
>>> dumps with these stack traces. The platform is Amazon Linux.

>> I have replicated this on a couple of other platforms (Fedora, FreeBSD)
>> and back to 9.5. The same failure doesn't happen with buildfarm runs on
>> earlier branches, although possibly they don't have the same set of tests.

> well, the problem in rebuild_relation() seems pretty blatant:

I fixed that, and the basic regression tests no longer crash outright with
these settings, but I do see half a dozen errors that all seem to be in
RLS-related tests.  They all look like something is trying to access an
already-closed relcache entry, much like the problem in
rebuild_relation().  But I have no time to look closer for the next
several days.  Stephen, I think this is your turf anyway.
        regards, tom lane



I wrote:
> I fixed that, and the basic regression tests no longer crash outright with
> these settings, but I do see half a dozen errors that all seem to be in
> RLS-related tests.

Those turned out to all be the same bug in DoCopy.  "make check-world"
passes for me now with -DRELCACHE_FORCE_RELEASE, but I've only tried
HEAD not the back branches.
        regards, tom lane




On 03/06/2017 05:14 PM, Tom Lane wrote:
> I wrote:
>> I fixed that, and the basic regression tests no longer crash outright with
>> these settings, but I do see half a dozen errors that all seem to be in
>> RLS-related tests.
> Those turned out to all be the same bug in DoCopy.  "make check-world"
> passes for me now with -DRELCACHE_FORCE_RELEASE, but I've only tried
> HEAD not the back branches.
>
>             



I have tied the back branches. They are good.

cheers

andrew


-- 
Andrew Dunstan                https://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services