Thread: Excessive cost of OpClassCache flushes in CLOBBER_CACHE_ALWAYS mode

Excessive cost of OpClassCache flushes in CLOBBER_CACHE_ALWAYS mode

From
Tom Lane
Date:
Over in [1] it is demonstrated that with CLOBBER_CACHE_ALWAYS enabled,
initdb accounts for a full 50% of the runtime of "make check-world"
(well, actually of the buildfarm cycle, which is not quite exactly
that but close).  Since initdb certainly doesn't cost that much
normally, I wondered why it is so negatively affected by CCA.  Some
perf measuring led me to LookupOpclassInfo, and specifically this bit:

    /*
     * When testing for cache-flush hazards, we intentionally disable the
     * operator class cache and force reloading of the info on each call. This
     * is helpful because we want to test the case where a cache flush occurs
     * while we are loading the info, and it's very hard to provoke that if
     * this happens only once per opclass per backend.
     */
#ifdef CLOBBER_CACHE_ENABLED
    if (debug_invalidate_system_caches_always > 0)
        opcentry->valid = false;
#endif

Diking that out halves initdb's CCA runtime.  Turns out it also
roughly halves the runtime of the core regression tests under CCA,
so this doesn't explain why initdb seems so disproportionately
affected by CCA.

However, seeing that this single choice is accounting for half the
cost of CCA testing, we really have to ask whether it's worth that.
This code was added by my commit 03ffc4d6d of 2007-11-28, about which
I wrote:

    Improve test coverage of CLOBBER_CACHE_ALWAYS by having it also force
    reloading of operator class information on each use of LookupOpclassInfo.
    Had this been in place a year ago, it would have helped me find a bug
    in the then-new 'operator family' code.  Now that we have a build farm
    member testing CLOBBER_CACHE_ALWAYS on a regular basis, it seems worth
    expending a little bit of effort here.

I'm now a little dubious about my claim that this would have helped find
any bugs.  Invalidating a finished OpClassCache entry does not model any
real-world scenario, because as noted elsewhere in LookupOpclassInfo,
once such a cache entry is populated it is kept for the rest of the
session.  Also, the claim in the comment that we need this to test
a cache flush during load seems like nonsense: if we have
debug_invalidate_system_caches_always turned on, then we'll test
the effects of such flushes throughout the initial population of
a cache entry.  Doing it over and over again adds nothing.

So I'm now fairly strongly tempted to remove this code outright
(effectively reverting 03ffc4d6d).  Another possibility now that
we have debug_invalidate_system_caches_always is to increase the
threshold at which this happens, making it more like
CLOBBER_CACHE_RECURSIVE.

Thoughts?

            regards, tom lane

[1] https://www.postgresql.org/message-id/1289102.1625353189%40sss.pgh.pa.us



Re: Excessive cost of OpClassCache flushes in CLOBBER_CACHE_ALWAYS mode

From
Andrew Dunstan
Date:
On 7/4/21 3:57 PM, Tom Lane wrote:
> Over in [1] it is demonstrated that with CLOBBER_CACHE_ALWAYS enabled,
> initdb accounts for a full 50% of the runtime of "make check-world"
> (well, actually of the buildfarm cycle, which is not quite exactly
> that but close).  Since initdb certainly doesn't cost that much
> normally, I wondered why it is so negatively affected by CCA.  Some
> perf measuring led me to LookupOpclassInfo, and specifically this bit:
>
>     /*
>      * When testing for cache-flush hazards, we intentionally disable the
>      * operator class cache and force reloading of the info on each call. This
>      * is helpful because we want to test the case where a cache flush occurs
>      * while we are loading the info, and it's very hard to provoke that if
>      * this happens only once per opclass per backend.
>      */
> #ifdef CLOBBER_CACHE_ENABLED
>     if (debug_invalidate_system_caches_always > 0)
>         opcentry->valid = false;
> #endif
>
> Diking that out halves initdb's CCA runtime.  Turns out it also
> roughly halves the runtime of the core regression tests under CCA,
> so this doesn't explain why initdb seems so disproportionately
> affected by CCA.
>
> However, seeing that this single choice is accounting for half the
> cost of CCA testing, we really have to ask whether it's worth that.
> This code was added by my commit 03ffc4d6d of 2007-11-28, about which
> I wrote:
>
>     Improve test coverage of CLOBBER_CACHE_ALWAYS by having it also force
>     reloading of operator class information on each use of LookupOpclassInfo.
>     Had this been in place a year ago, it would have helped me find a bug
>     in the then-new 'operator family' code.  Now that we have a build farm
>     member testing CLOBBER_CACHE_ALWAYS on a regular basis, it seems worth
>     expending a little bit of effort here.
>
> I'm now a little dubious about my claim that this would have helped find
> any bugs.  Invalidating a finished OpClassCache entry does not model any
> real-world scenario, because as noted elsewhere in LookupOpclassInfo,
> once such a cache entry is populated it is kept for the rest of the
> session.  Also, the claim in the comment that we need this to test
> a cache flush during load seems like nonsense: if we have
> debug_invalidate_system_caches_always turned on, then we'll test
> the effects of such flushes throughout the initial population of
> a cache entry.  Doing it over and over again adds nothing.
>
> So I'm now fairly strongly tempted to remove this code outright
> (effectively reverting 03ffc4d6d).  Another possibility now that
> we have debug_invalidate_system_caches_always is to increase the
> threshold at which this happens, making it more like
> CLOBBER_CACHE_RECURSIVE.
>
> Thoughts?
>
>             


If we don't think it's adding anything useful just rip it out. We don't
generally keep code hanging around just on the off chance it might be
useful some day.


cheers


andrew


--
Andrew Dunstan
EDB: https://www.enterprisedb.com




Andrew Dunstan <andrew@dunslane.net> writes:
> On 7/4/21 3:57 PM, Tom Lane wrote:
>> I'm now a little dubious about my claim that this would have helped find
>> any bugs.  Invalidating a finished OpClassCache entry does not model any
>> real-world scenario, because as noted elsewhere in LookupOpclassInfo,
>> once such a cache entry is populated it is kept for the rest of the
>> session.  Also, the claim in the comment that we need this to test
>> a cache flush during load seems like nonsense: if we have
>> debug_invalidate_system_caches_always turned on, then we'll test
>> the effects of such flushes throughout the initial population of
>> a cache entry.  Doing it over and over again adds nothing.
>>
>> So I'm now fairly strongly tempted to remove this code outright
>> (effectively reverting 03ffc4d6d).  Another possibility now that
>> we have debug_invalidate_system_caches_always is to increase the
>> threshold at which this happens, making it more like
>> CLOBBER_CACHE_RECURSIVE.

> If we don't think it's adding anything useful just rip it out. We don't
> generally keep code hanging around just on the off chance it might be
> useful some day.

I did a little more research about the origins of 03ffc4d6d.
I found this thread:

https://www.postgresql.org/message-id/flat/2988.1196271930%40sss.pgh.pa.us#a7dd1ce92f5470ba5ad2e1be03d40802

That points back to commit 0b56be834, which fixed the oversight that
OpclassOidIndexId had not been marked as a critical system index,
and claims that making LookupOpclassInfo invalidate entries would
have helped identify that.

To test that, I tried removing OpclassOidIndexId from the list of
critical system indexes.  I found that (a) if I also remove the
invalidation code in LookupOpclassInfo, then things seem to proceed
normally even in CCA mode, but (b) with that code, we get into
infinite recursion, continually trying to rebuild that index's
relcache entry.  Once criticalRelcachesBuilt is set, LookupOpclassInfo
assumes it can use catalog index scans for all opclasses, even those
needed for indexes it's going to rely on, which is why the recursion
happens.  This might arguably be a bug in that CLOBBER_CACHE_ENABLED
code itself: maybe it needs to avoid invalidating the entries for
OID_BTREE_OPS_OID and INT2_BTREE_OPS_OID.  If I make it do so, the
infinite recursion disappears even without OpclassOidIndexId as a
critical index.

So on the one hand maybe that logic is too simplistic, but on the
other hand it might help identify missed critical indexes in future
too.  That puts it in the same category as the sorts of bugs that
"debug_invalidate_system_caches_always > 1" is meant to help with,
so now I'm content to change it that way.

            regards, tom lane