Thread: Cache invalidation bug in RelationGetIndexAttrBitmap()

Cache invalidation bug in RelationGetIndexAttrBitmap()

From
Andres Freund
Date:
On 2014-05-14 15:17:39 +0200, Andres Freund wrote:
> On 2014-05-14 15:08:08 +0200, Tomas Vondra wrote:
> > Apparently there's something wrong with 'test-decoding-check':
>
> Man. I shouldn't have asked... My code. There's some output in there
> that's probably triggered by the extraordinarily long runtimes, but
> there's definitely something else wrong.
> My gut feeling says it's in RelationGetIndexList().

Nearly right. It's in RelationGetIndexAttrBitmap(). Fix attached.

Tomas, thanks for that. I've never (and probably will never) run
CLOBBER_CACHE_RECURSIVELY during development. Having a machine do that
regularly is really helpful. How long does a single testrun take? It
takes hundreds of seconds here to do a single UPDATE?

There were some more differences but those are all harmless and caused
by the extraordinarily long runtime (autovacuums). I think we need to
add a feature to test_decoding to suppress displaying transactions
without changes. Ick.

Greetings,

Andres Freund

--
 Andres Freund                       http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services

Attachment

Re: Cache invalidation bug in RelationGetIndexAttrBitmap()

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-05-14 15:17:39 +0200, Andres Freund wrote:
>> My gut feeling says it's in RelationGetIndexList().

> Nearly right. It's in RelationGetIndexAttrBitmap(). Fix attached.

TBH, I don't believe this patch at all.  Where exactly is rd_replidindex
reset?  If it's supposed to have similar lifespan to, say, rd_oidindex,
why isn't it being handled like rd_oidindex?  And why does the header
comment for RelationGetIndexList make no mention of this new side-effect?
Somebody did a seriously poor job of adding this functionality to
relcache.
        regards, tom lane



Re: Cache invalidation bug in RelationGetIndexAttrBitmap()

From
Andres Freund
Date:
On 2014-05-14 12:15:27 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2014-05-14 15:17:39 +0200, Andres Freund wrote:
> >> My gut feeling says it's in RelationGetIndexList().
> 
> > Nearly right. It's in RelationGetIndexAttrBitmap(). Fix attached.
> 
> TBH, I don't believe this patch at all.  Where exactly is rd_replidindex
> reset?  If it's supposed to have similar lifespan to, say, rd_oidindex,
> why isn't it being handled like rd_oidindex? 

I don't see why it'd have a different lifespan than rd_oidindex at all?
If the latter were used inside the loop it'd be a bug as well.

> And why does the header
> comment for RelationGetIndexList make no mention of this new side-effect?
> Somebody did a seriously poor job of adding this functionality to
> relcache.

It's not like it's not documented: There's a comment about it in struct
RelationData. I must have missed that rd_oidindex has a comment abou
it's lifetime over RelationGetIndexList().
I personally actually prefer the struct as the location for the
lifetime. I can send a patch to commonalize the location in either place
with the other location pointing to it.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Cache invalidation bug in RelationGetIndexAttrBitmap()

From
Tom Lane
Date:
Andres Freund <andres@2ndquadrant.com> writes:
> On 2014-05-14 12:15:27 -0400, Tom Lane wrote:
>> And why does the header
>> comment for RelationGetIndexList make no mention of this new side-effect?
>> Somebody did a seriously poor job of adding this functionality to
>> relcache.

> It's not like it's not documented: There's a comment about it in struct
> RelationData. I must have missed that rd_oidindex has a comment abou
> it's lifetime over RelationGetIndexList().

If rd_replidindex is being managed like rd_oidindex, then it should be
managed just like rd_oidindex, including getting reset in all the places
rd_oidindex is.  This might be just a matter of cleanliness but I think
it's important for readability and debuggability.

I notice also that rd_keyattr and rd_idattr have been implemented with
bad copies of the logic for rd_indexattr.  This is at least leading
to a permanent memory leak in CacheMemoryContext during every relcache
flush, and maybe worse things.  The bugs for rd_keyattr appear to predate
your patch though.

Working on a patch for this now.  One thing I'm wondering about is
RelationSetIndexList.  It's probably okay for it not to touch rd_keyattr
and rd_idattr, but I'm not too clear on what the use cases for those
attnum sets are.
        regards, tom lane



Re: Cache invalidation bug in RelationGetIndexAttrBitmap()

From
Andres Freund
Date:
On 2014-05-14 13:32:43 -0400, Tom Lane wrote:
> Andres Freund <andres@2ndquadrant.com> writes:
> > On 2014-05-14 12:15:27 -0400, Tom Lane wrote:
> >> And why does the header
> >> comment for RelationGetIndexList make no mention of this new side-effect?
> >> Somebody did a seriously poor job of adding this functionality to
> >> relcache.
> 
> > It's not like it's not documented: There's a comment about it in struct
> > RelationData. I must have missed that rd_oidindex has a comment abou
> > it's lifetime over RelationGetIndexList().
> 
> If rd_replidindex is being managed like rd_oidindex, then it should be
> managed just like rd_oidindex, including getting reset in all the places
> rd_oidindex is.  This might be just a matter of cleanliness but I think
> it's important for readability and debuggability.

Agreed. I am not against resetting it. I think I might not have been
aware of rd_oidindex when writing that code...

> I notice also that rd_keyattr and rd_idattr have been implemented with
> bad copies of the logic for rd_indexattr.  This is at least leading
> to a permanent memory leak in CacheMemoryContext during every relcache
> flush, and maybe worse things.  The bugs for rd_keyattr appear to predate
> your patch though.

Hm. Yes, the bitmapsets should be freed. I guess I copied the logic for
keyattr and didn't find any relevant places that touch it. rd_keyattr
should go back to 9.3.

> Working on a patch for this now.  One thing I'm wondering about is
> RelationSetIndexList.  It's probably okay for it not to touch rd_keyattr
> and rd_idattr, but I'm not too clear on what the use cases for those
> attnum sets are.

rd_keyattr is used to determine whether a heap_update() changed any keys
that could be referenced by a foreign key. That's then used to determine
which locklevel an update requires.
rd_idattr does something similar. It decides whether the configured
REPLICA IDENTITY key has changed so whether to log the old primary key
for logical decoding or not.

I can't see why either would need to care about forced index lists right
now, but will do a scan of the sources to see if I am wrong.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Cache invalidation bug in RelationGetIndexAttrBitmap()

From
Tomas Vondra
Date:
On 14.5.2014 17:52, Andres Freund wrote:
> On 2014-05-14 15:17:39 +0200, Andres Freund wrote:
>> On 2014-05-14 15:08:08 +0200, Tomas Vondra wrote:
>>> Apparently there's something wrong with 'test-decoding-check':
>>
>> Man. I shouldn't have asked... My code. There's some output in there
>> that's probably triggered by the extraordinarily long runtimes, but
>> there's definitely something else wrong.
>> My gut feeling says it's in RelationGetIndexList().
> 
> Nearly right. It's in RelationGetIndexAttrBitmap(). Fix attached.
> 
> Tomas, thanks for that. I've never (and probably will never) run
> CLOBBER_CACHE_RECURSIVELY during development. Having a machine do that
> regularly is really helpful. How long does a single testrun take? It
> takes hundreds of seconds here to do a single UPDATE?

Don't know yet, as it fails at the beginning. But I suppose it will be
tens or possibly hundreds of hours. For example these are the logs from
regular build (no clobber etc.)
   May 14 19:00 SCM-checkout.log   May 14 19:00 githead.log   May 14 19:00 configure.log   May 14 19:00 config.log
May14 19:05 make.log   May 14 19:05 check.log   May 14 19:06 make-contrib.log   May 14 19:06 make-install.log   May 14
19:06install-contrib.log   May 14 19:07 check-pg_upgrade.log   May 14 19:08 test-decoding-check.log
 

while these are the logs from recursive clobber:
   May 14 00:19 SCM-checkout.log   May 14 00:20 configure.log   May 14 00:20 config.log   May 14 00:26 make.log   May
1403:12 check.log   May 14 03:13 make-contrib.log   May 14 03:13 make-install.log   May 14 03:13 install-contrib.log
May14 08:25 check-pg_upgrade.log   May 14 09:07 test-decoding-check.log   May 14 09:07 web-txn.data
 


So with the regular build, it took <1 minute to do 'make check' and ~1
minute to test pg_upgrade, with recursive clobber it takes ~3 hours and
~5 hours. That's a factor of ~300, although it's a very rough estimate.

Without clobber the whole run (for a "C" locale) takes ~10 minutes, so
my estimate is ~50 hours for the recursive one. But I wouldn't be
surprised by 100 hours.

> 
> There were some more differences but those are all harmless and caused
> by the extraordinarily long runtime (autovacuums). I think we need to
> add a feature to test_decoding to suppress displaying transactions
> without changes. Ick.
> 

I expect to hit more timing-related issues with the recursive clobber
tests - not necessarily in the code/tests itself, but I guess the
buildfarm tooling doesn't really expect runs that long.

regards
Tomas



Re: Cache invalidation bug in RelationGetIndexAttrBitmap()

From
Andres Freund
Date:
Hi,

On 2014-05-14 21:04:41 +0200, Tomas Vondra wrote:
> On 14.5.2014 17:52, Andres Freund wrote:
> > On 2014-05-14 15:17:39 +0200, Andres Freund wrote:
> >> On 2014-05-14 15:08:08 +0200, Tomas Vondra wrote:
> >>> Apparently there's something wrong with 'test-decoding-check':
> >>
> >> Man. I shouldn't have asked... My code. There's some output in there
> >> that's probably triggered by the extraordinarily long runtimes, but
> >> there's definitely something else wrong.
> >> My gut feeling says it's in RelationGetIndexList().
> > 
> > Nearly right. It's in RelationGetIndexAttrBitmap(). Fix attached.
> > 
> > Tomas, thanks for that. I've never (and probably will never) run
> > CLOBBER_CACHE_RECURSIVELY during development. Having a machine do that
> > regularly is really helpful. How long does a single testrun take? It
> > takes hundreds of seconds here to do a single UPDATE?
> 
> Don't know yet, as it fails at the beginning.

test decoding is at the beginning? That's somewhat odd?

> But I suppose it will be
> tens or possibly hundreds of hours. For example these are the logs from
> regular build (no clobber etc.)

>     May 14 19:00 SCM-checkout.log
>     May 14 19:00 githead.log
>     May 14 19:00 configure.log
>     May 14 19:00 config.log
>     May 14 19:05 make.log
>     May 14 19:05 check.log
>     May 14 19:06 make-contrib.log
>     May 14 19:06 make-install.log
>     May 14 19:06 install-contrib.log
>     May 14 19:07 check-pg_upgrade.log
>     May 14 19:08 test-decoding-check.log
> 
> while these are the logs from recursive clobber:
> 
>     May 14 00:19 SCM-checkout.log
>     May 14 00:20 configure.log
>     May 14 00:20 config.log
>     May 14 00:26 make.log
>     May 14 03:12 check.log
>     May 14 03:13 make-contrib.log
>     May 14 03:13 make-install.log
>     May 14 03:13 install-contrib.log
>     May 14 08:25 check-pg_upgrade.log
>     May 14 09:07 test-decoding-check.log
>     May 14 09:07 web-txn.data
> 
> 
> So with the regular build, it took <1 minute to do 'make check' and ~1
> minute to test pg_upgrade, with recursive clobber it takes ~3 hours and
> ~5 hours. That's a factor of ~300, although it's a very rough
> estimate.

I seriously doubt that's recursive clobber. That should take *way* much
longer. And indeed you have:

> -DCLOBBER_CACHE_ALWAYS -DCLOBBER_FREED_MEMORY -DMEMORY_CONTEXT_CHECKING
> -DRANDOMIZE_ALLOCATED_MEMORY -DCLOBBER_CACHE_RECURSIVELY
> 
> it does not happen with
> 
> CPPFLAGS => '-DCLOBBER_CACHE_ALWAYS -DCLOBBER_FREED_MEMORY
> -DMEMORY_CONTEXT_CHECKING -DRANDOMIZE_ALLOCATED_MEMORY',

#if defined(CLOBBER_CACHE_ALWAYS){    static bool in_recursion = false;
    if (!in_recursion)    {        in_recursion = true;        InvalidateSystemCaches();        in_recursion = false;
}}
 
#elif defined(CLOBBER_CACHE_RECURSIVELY)InvalidateSystemCaches();
#endif

i.e. you can't specifiy -DCLOBBER_CACHE_ALWAYS and
-DCLOBBER_CACHE_RECURSIVELY together. The former will take precedence.

> Without clobber the whole run (for a "C" locale) takes ~10 minutes, so
> my estimate is ~50 hours for the recursive one. But I wouldn't be
> surprised by 100 hours.

I'm afraid it's more in the year range from what i've seen. I.e. not
practical.

Greetings,

Andres Freund

-- Andres Freund                       http://www.2ndQuadrant.com/PostgreSQL Development, 24x7 Support, Training &
Services



Re: Cache invalidation bug in RelationGetIndexAttrBitmap()

From
Tomas Vondra
Date:
On 14.5.2014 22:29, Andres Freund wrote:
> Hi,
> 
> On 2014-05-14 21:04:41 +0200, Tomas Vondra wrote:
>> On 14.5.2014 17:52, Andres Freund wrote:
>>> On 2014-05-14 15:17:39 +0200, Andres Freund wrote:
>>>> On 2014-05-14 15:08:08 +0200, Tomas Vondra wrote:
>>>>> Apparently there's something wrong with 'test-decoding-check':
>>>>
>>>> Man. I shouldn't have asked... My code. There's some output in there
>>>> that's probably triggered by the extraordinarily long runtimes, but
>>>> there's definitely something else wrong.
>>>> My gut feeling says it's in RelationGetIndexList().
>>>
>>> Nearly right. It's in RelationGetIndexAttrBitmap(). Fix attached.
>>>
>>> Tomas, thanks for that. I've never (and probably will never) run
>>> CLOBBER_CACHE_RECURSIVELY during development. Having a machine do that
>>> regularly is really helpful. How long does a single testrun take? It
>>> takes hundreds of seconds here to do a single UPDATE?
>>
>> Don't know yet, as it fails at the beginning.
> 
> test decoding is at the beginning? That's somewhat odd?

Judging from the timestamps of log files, it seems to be running after
pg_upgrade checks. Or maybe I'm reading that wrong.

>> But I suppose it will be
>> tens or possibly hundreds of hours. For example these are the logs from
>> regular build (no clobber etc.)
> 
>>     May 14 19:00 SCM-checkout.log
>>     May 14 19:00 githead.log
>>     May 14 19:00 configure.log
>>     May 14 19:00 config.log
>>     May 14 19:05 make.log
>>     May 14 19:05 check.log
>>     May 14 19:06 make-contrib.log
>>     May 14 19:06 make-install.log
>>     May 14 19:06 install-contrib.log
>>     May 14 19:07 check-pg_upgrade.log
>>     May 14 19:08 test-decoding-check.log
>>
>> while these are the logs from recursive clobber:
>>
>>     May 14 00:19 SCM-checkout.log
>>     May 14 00:20 configure.log
>>     May 14 00:20 config.log
>>     May 14 00:26 make.log
>>     May 14 03:12 check.log
>>     May 14 03:13 make-contrib.log
>>     May 14 03:13 make-install.log
>>     May 14 03:13 install-contrib.log
>>     May 14 08:25 check-pg_upgrade.log
>>     May 14 09:07 test-decoding-check.log
>>     May 14 09:07 web-txn.data
>>
>>
>> So with the regular build, it took <1 minute to do 'make check' and ~1
>> minute to test pg_upgrade, with recursive clobber it takes ~3 hours and
>> ~5 hours. That's a factor of ~300, although it's a very rough
>> estimate.
> 
> I seriously doubt that's recursive clobber. That should take *way* much
> longer. And indeed you have:
> 
>> -DCLOBBER_CACHE_ALWAYS -DCLOBBER_FREED_MEMORY -DMEMORY_CONTEXT_CHECKING
>> -DRANDOMIZE_ALLOCATED_MEMORY -DCLOBBER_CACHE_RECURSIVELY
>>
>> it does not happen with
>>
>> CPPFLAGS => '-DCLOBBER_CACHE_ALWAYS -DCLOBBER_FREED_MEMORY
>> -DMEMORY_CONTEXT_CHECKING -DRANDOMIZE_ALLOCATED_MEMORY',
> 
> #if defined(CLOBBER_CACHE_ALWAYS)
>     {
>         static bool in_recursion = false;
> 
>         if (!in_recursion)
>         {
>             in_recursion = true;
>             InvalidateSystemCaches();
>             in_recursion = false;
>         }
>     }
> #elif defined(CLOBBER_CACHE_RECURSIVELY)
>     InvalidateSystemCaches();
> #endif
> 
> i.e. you can't specifiy -DCLOBBER_CACHE_ALWAYS and
> -DCLOBBER_CACHE_RECURSIVELY together. The former will take precedence.

Oh, I see :-/

>> Without clobber the whole run (for a "C" locale) takes ~10 minutes, so
>> my estimate is ~50 hours for the recursive one. But I wouldn't be
>> surprised by 100 hours.
> 
> I'm afraid it's more in the year range from what i've seen. I.e. not
> practical.

Yeah, that wouldn't be very practical. I'll try to run it though and if
it'd run more than a few days, I'll switch it to CLOBBER_CACHE_ALWAYS.

Tomas



Re: Cache invalidation bug in RelationGetIndexAttrBitmap()

From
Tom Lane
Date:
Tomas Vondra <tv@fuzzy.cz> writes:
> On 14.5.2014 22:29, Andres Freund wrote:
>> I'm afraid it's more in the year range from what i've seen. I.e. not
>> practical.

> Yeah, that wouldn't be very practical. I'll try to run it though and if
> it'd run more than a few days, I'll switch it to CLOBBER_CACHE_ALWAYS.

I think it might be interesting to try running CLOBBER_CACHE_RECURSIVELY
on some subset of the regression tests.  But I'm not sure which subset,
nor whether the buildfarm scripts could readily be coaxed to do that.
Let's see how far you get with a full run before you get bored ...
        regards, tom lane