Thread: Latent cache flush hazard in RelationInitIndexAccessInfo

Latent cache flush hazard in RelationInitIndexAccessInfo

From
Tom Lane
Date:
Some stuff Salesforce has been doing turned up the fact that
RelationInitIndexAccessInfo is not safe against a relcache flush on
its target index.  Specifically, the failure goes like this:

* RelationInitIndexAccessInfo loads up relation->rd_indextuple.

* Within one of the subsequent catalog fetches, eg the pg_am read, a
relcache flush occurs (this can be forced with CLOBBER_CACHE_ALWAYS).

* RelationClearRelation keeps the relcache entry, since it has
positive refcount, but doesn't see a reason to keep the index-related
fields.

* RelationInitIndexAccessInfo dumps core when it tries to fetch
fields out of relation->rd_indextuple, which has been reset to NULL.

Now, it turns out this scenario cannot happen in the standard Postgres
code as it stands today (if it could, we'd have seen it in the buildfarm's
CLOBBER_CACHE_ALWAYS testing).  The reason for that is that
RelationInitIndexAccessInfo is actually only called on newly-minted
Relation structs that are not yet installed into the relcache hash table;
hence RelationCacheInvalidate can't find them to zap them.  It can happen
in Salesforce's modified code though, and it's not hard to imagine that
future changes in the community version might expose the same hazard.
(For one reason, the current technique implies that an error halfway
through relcache load results in leaking the Relation struct and
subsidiary data; we might eventually decide that's not acceptable.)

We could just ignore the problem until/unless it becomes real for us,
but now that I've figured it out I'm inclined to do something before
the knowledge disappears again.

The specific reason there's a problem is that there's a disconnect between
RelationClearRelation's test for whether a relcache entry has valid index
info (it checks whether rd_indexcxt != NULL) and the order of operations
in RelationInitIndexAccessInfo (creating the index context is not the
first thing it does).  Given that if RelationClearRelation thinks the
info is valid, what it does is call RelationReloadIndexInfo which needs
rd_index to be valid, I'm thinking the best fix is to leave the order of
operations in RelationInitIndexAccessInfo alone and instead make the
"relcache entry has index info" check be "rd_index != NULL".  There might
need to be some additional work to keep RelationReloadIndexInfo from
setting rd_isvalid = true too soon.

Thoughts?
        regards, tom lane



Re: Latent cache flush hazard in RelationInitIndexAccessInfo

From
Robert Haas
Date:
On Wed, Sep 9, 2015 at 3:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Some stuff Salesforce has been doing turned up the fact that
> RelationInitIndexAccessInfo is not safe against a relcache flush on
> its target index.  Specifically, the failure goes like this:
>
> * RelationInitIndexAccessInfo loads up relation->rd_indextuple.
>
> * Within one of the subsequent catalog fetches, eg the pg_am read, a
> relcache flush occurs (this can be forced with CLOBBER_CACHE_ALWAYS).
>
> * RelationClearRelation keeps the relcache entry, since it has
> positive refcount, but doesn't see a reason to keep the index-related
> fields.
>
> * RelationInitIndexAccessInfo dumps core when it tries to fetch
> fields out of relation->rd_indextuple, which has been reset to NULL.
>
> Now, it turns out this scenario cannot happen in the standard Postgres
> code as it stands today (if it could, we'd have seen it in the buildfarm's
> CLOBBER_CACHE_ALWAYS testing).  The reason for that is that
> RelationInitIndexAccessInfo is actually only called on newly-minted
> Relation structs that are not yet installed into the relcache hash table;
> hence RelationCacheInvalidate can't find them to zap them.  It can happen
> in Salesforce's modified code though, and it's not hard to imagine that
> future changes in the community version might expose the same hazard.
> (For one reason, the current technique implies that an error halfway
> through relcache load results in leaking the Relation struct and
> subsidiary data; we might eventually decide that's not acceptable.)
>
> We could just ignore the problem until/unless it becomes real for us,
> but now that I've figured it out I'm inclined to do something before
> the knowledge disappears again.
>
> The specific reason there's a problem is that there's a disconnect between
> RelationClearRelation's test for whether a relcache entry has valid index
> info (it checks whether rd_indexcxt != NULL) and the order of operations
> in RelationInitIndexAccessInfo (creating the index context is not the
> first thing it does).  Given that if RelationClearRelation thinks the
> info is valid, what it does is call RelationReloadIndexInfo which needs
> rd_index to be valid, I'm thinking the best fix is to leave the order of
> operations in RelationInitIndexAccessInfo alone and instead make the
> "relcache entry has index info" check be "rd_index != NULL".  There might
> need to be some additional work to keep RelationReloadIndexInfo from
> setting rd_isvalid = true too soon.
>
> Thoughts?

Doesn't seem like a bad thing to clean up.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company



Re: Latent cache flush hazard in RelationInitIndexAccessInfo

From
Michael Paquier
Date:
On Fri, Sep 11, 2015 at 11:29 AM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Wed, Sep 9, 2015 at 3:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> Some stuff Salesforce has been doing turned up the fact that
>> RelationInitIndexAccessInfo is not safe against a relcache flush on
>> its target index.  Specifically, the failure goes like this:
>>
>> * RelationInitIndexAccessInfo loads up relation->rd_indextuple.
>>
>> * Within one of the subsequent catalog fetches, eg the pg_am read, a
>> relcache flush occurs (this can be forced with CLOBBER_CACHE_ALWAYS).
>>
>> * RelationClearRelation keeps the relcache entry, since it has
>> positive refcount, but doesn't see a reason to keep the index-related
>> fields.
>>
>> * RelationInitIndexAccessInfo dumps core when it tries to fetch
>> fields out of relation->rd_indextuple, which has been reset to NULL.
>>
>> Now, it turns out this scenario cannot happen in the standard Postgres
>> code as it stands today (if it could, we'd have seen it in the buildfarm's
>> CLOBBER_CACHE_ALWAYS testing).  The reason for that is that
>> RelationInitIndexAccessInfo is actually only called on newly-minted
>> Relation structs that are not yet installed into the relcache hash table;
>> hence RelationCacheInvalidate can't find them to zap them.  It can happen
>> in Salesforce's modified code though, and it's not hard to imagine that
>> future changes in the community version might expose the same hazard.
>> (For one reason, the current technique implies that an error halfway
>> through relcache load results in leaking the Relation struct and
>> subsidiary data; we might eventually decide that's not acceptable.)
>>
>> We could just ignore the problem until/unless it becomes real for us,
>> but now that I've figured it out I'm inclined to do something before
>> the knowledge disappears again.
>>
>> The specific reason there's a problem is that there's a disconnect between
>> RelationClearRelation's test for whether a relcache entry has valid index
>> info (it checks whether rd_indexcxt != NULL) and the order of operations
>> in RelationInitIndexAccessInfo (creating the index context is not the
>> first thing it does).  Given that if RelationClearRelation thinks the
>> info is valid, what it does is call RelationReloadIndexInfo which needs
>> rd_index to be valid, I'm thinking the best fix is to leave the order of
>> operations in RelationInitIndexAccessInfo alone and instead make the
>> "relcache entry has index info" check be "rd_index != NULL".  There might
>> need to be some additional work to keep RelationReloadIndexInfo from
>> setting rd_isvalid = true too soon.
>>
>> Thoughts?
>
> Doesn't seem like a bad thing to clean up.

Doesn't sound bad to me either to reinforce things on HEAD... It's
been months since this has been posted, are you working on a patch?
-- 
Michael



Re: Latent cache flush hazard in RelationInitIndexAccessInfo

From
Tom Lane
Date:
Michael Paquier <michael.paquier@gmail.com> writes:
> On Fri, Sep 11, 2015 at 11:29 AM, Robert Haas <robertmhaas@gmail.com> wrote:
>> On Wed, Sep 9, 2015 at 3:00 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> The specific reason there's a problem is that there's a disconnect between
>>> RelationClearRelation's test for whether a relcache entry has valid index
>>> info (it checks whether rd_indexcxt != NULL) and the order of operations
>>> in RelationInitIndexAccessInfo (creating the index context is not the
>>> first thing it does).  Given that if RelationClearRelation thinks the
>>> info is valid, what it does is call RelationReloadIndexInfo which needs
>>> rd_index to be valid, I'm thinking the best fix is to leave the order of
>>> operations in RelationInitIndexAccessInfo alone and instead make the
>>> "relcache entry has index info" check be "rd_index != NULL".  There might
>>> need to be some additional work to keep RelationReloadIndexInfo from
>>> setting rd_isvalid = true too soon.

>> Doesn't seem like a bad thing to clean up.

> Doesn't sound bad to me either to reinforce things on HEAD... It's
> been months since this has been posted, are you working on a patch?

This dropped off my radar screen somehow.  (I have a feeling I fixed it in
Salesforce's code and forgot to transfer the fix to community.)

Given that this is just latent for the community's purposes, it seems
probably inappropriate to fix it post-beta1.  I'm inclined to let it
slide till 9.7^H^H^H10 development opens.  Thoughts?
        regards, tom lane