Thread: replication cleanup code incorrect way to use of HTAB HASH_REMOVE ?

replication cleanup code incorrect way to use of HTAB HASH_REMOVE ?

From

Peter Smith

Date:

20 March 2021, 07:24:20

Hi,

I found some dubious looking HTAB cleanup code for replication streams
(see file:worker.c, function:stream_cleanup_files).

viz.

----------
static void
stream_cleanup_files(Oid subid, TransactionId xid)
{
    char        path[MAXPGPATH];
    StreamXidHash *ent;

    /* Remove the xid entry from the stream xid hash */
    ent = (StreamXidHash *) hash_search(xidhash,
                                        (void *) &xid,
                                        HASH_REMOVE,
                                        NULL);
    /* By this time we must have created the transaction entry */
    Assert(ent != NULL);

    /* Delete the change file and release the stream fileset memory */
    changes_filename(path, subid, xid);
    SharedFileSetDeleteAll(ent->stream_fileset);
    pfree(ent->stream_fileset);
    ent->stream_fileset = NULL;

    /* Delete the subxact file and release the memory, if it exist */
    if (ent->subxact_fileset)
    {
        subxact_filename(path, subid, xid);
        SharedFileSetDeleteAll(ent->subxact_fileset);
        pfree(ent->subxact_fileset);
        ent->subxact_fileset = NULL;
    }
}
----------

Notice how the code calls hash_search(... HASH_REMOVE ...), but then
it deferences the same ent that was returned from that function.

IIUC that is a violation of the hash_search API, whose function
comment (dynahash.c) clearly says not to use the return value in such
a way:

----------
 * Return value is a pointer to the element found/entered/removed if any,
 * or NULL if no match was found.  (NB: in the case of the REMOVE action,
 * the result is a dangling pointer that shouldn't be dereferenced!)
----------

~~

PSA my patch to correct this by firstly doing a HASH_FIND, then only
HASH_REMOVE after we've finished using the ent.

----
Kind Regards,
Peter Smith.
Fujitsu Australia

Attachment

v1-0001-Fix-for-stream_cleanup_files-HASH_REMOVE.patch

Re: replication cleanup code incorrect way to use of HTAB HASH_REMOVE ?

From

Amit Kapila

Date:

21 March 2021, 09:54:44

On Sat, Mar 20, 2021 at 12:54 PM Peter Smith <smithpb2250@gmail.com> wrote:
>
> PSA my patch to correct this by firstly doing a HASH_FIND, then only
> HASH_REMOVE after we've finished using the ent.
>

Why can't we keep using HASH_REMOVE as it is but get the output (entry
found or not) in the last parameter of hash_search API and then
perform Assert based on that? See similar usage in reorderbuffer.c and
rewriteheap.c.

-- 
With Regards,
Amit Kapila.

Re: replication cleanup code incorrect way to use of HTAB HASH_REMOVE ?

From

Peter Smith

Date:

21 March 2021, 21:49:50

On Sun, Mar 21, 2021 at 8:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Sat, Mar 20, 2021 at 12:54 PM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > PSA my patch to correct this by firstly doing a HASH_FIND, then only
> > HASH_REMOVE after we've finished using the ent.
> >
>
> Why can't we keep using HASH_REMOVE as it is but get the output (entry
> found or not) in the last parameter of hash_search API and then
> perform Assert based on that? See similar usage in reorderbuffer.c and
> rewriteheap.c.
>

Changing the Assert doesn't do anything to fix the problem as
described, i.e. dereferencing of ent after the HASH_REMOVE.

The real problem isn't the Assert. It's all those other usages of ent
disobeying the API rule: "(NB: in the case of the REMOVE action, the
result is a dangling pointer that shouldn't be dereferenced!)"

e.g.
-  SharedFileSetDeleteAll(ent->stream_fileset);
-  pfree(ent->stream_fileset);
-  ent->stream_fileset = NULL;
-  if (ent->subxact_fileset)
-  SharedFileSetDeleteAll(ent->subxact_fileset);
-  pfree(ent->subxact_fileset);
-  ent->subxact_fileset = NULL;

------
Kind Regards,
Peter Smith
Fujitsu Australia.

Re: replication cleanup code incorrect way to use of HTAB HASH_REMOVE ?

From

Thomas Munro

Date:

21 March 2021, 22:20:37

On Mon, Mar 22, 2021 at 10:50 AM Peter Smith <smithpb2250@gmail.com> wrote:
> The real problem isn't the Assert. It's all those other usages of ent
> disobeying the API rule: "(NB: in the case of the REMOVE action, the
> result is a dangling pointer that shouldn't be dereferenced!)"

I suppose the HASH_REMOVE case could clobber the object with 0x7f if
CLOBBER_FREED_MEMORY is defined (typically assertion builds), or
alternatively return some other non-NULL but poisoned pointer, so that
problems of this ilk  blow up in early testing.

Re: replication cleanup code incorrect way to use of HTAB HASH_REMOVE ?

From

Peter Smith

Date:

21 March 2021, 22:44:15

On Mon, Mar 22, 2021 at 9:21 AM Thomas Munro <thomas.munro@gmail.com> wrote:
>
> On Mon, Mar 22, 2021 at 10:50 AM Peter Smith <smithpb2250@gmail.com> wrote:
> > The real problem isn't the Assert. It's all those other usages of ent
> > disobeying the API rule: "(NB: in the case of the REMOVE action, the
> > result is a dangling pointer that shouldn't be dereferenced!)"
>
> I suppose the HASH_REMOVE case could clobber the object with 0x7f if
> CLOBBER_FREED_MEMORY is defined (typically assertion builds), or
> alternatively return some other non-NULL but poisoned pointer, so that
> problems of this ilk  blow up in early testing.

+1, but not sure if the poisoned ptr alternative can work because some
code (e.g see RemoveTargetIfNoLongerUsed function) is asserting the
return ptr actual value, not just its NULL-ness.

------
Kind Regards,
Peter Smith.
Fujitsu Australia.

Re: replication cleanup code incorrect way to use of HTAB HASH_REMOVE ?

From

Andres Freund

Date:

21 March 2021, 22:56:15

Hi,

On 2021-03-22 11:20:37 +1300, Thomas Munro wrote:
> On Mon, Mar 22, 2021 at 10:50 AM Peter Smith <smithpb2250@gmail.com> wrote:
> > The real problem isn't the Assert. It's all those other usages of ent
> > disobeying the API rule: "(NB: in the case of the REMOVE action, the
> > result is a dangling pointer that shouldn't be dereferenced!)"

Right that's clearly not ok.

> I suppose the HASH_REMOVE case could clobber the object with 0x7f if
> CLOBBER_FREED_MEMORY is defined (typically assertion builds)

Yea. Plus VALGRIND_MAKE_MEM_NOACCESS() (requiring a
VALGRIND_MAKE_MEM_UNDEFINED() when reusing) or at least a
VALGRIND_MAKE_MEM_UNDEFINED().

>or alternatively return some other non-NULL but poisoned pointer, so
>that problems of this ilk blow up in early testing.

IMO it's just a bad API to combine the different use cases into
hash_search(). It's kinda defensible to have FIND/ENTER/ENTER_NULL go
through the same function (although I do think it makes code harder to
read), but having HASH_REMOVE is just wrong. The only reason for
returning a dangling pointer is that that's the obvious way to check if
something was found.

If they weren't combined we could tell newer compilers that the memory
shouldn't be accessed after the HASH_REMOVE anymore. And it'd remove
some unnecessary branches in performance critical code...

But I guess making dynahash not terrible from that POV basically means
replacing all of dynahash. Having all those branches for partitioned
hashes, actions are really not great.

Greetings,

Andres Freund

Re: replication cleanup code incorrect way to use of HTAB HASH_REMOVE ?

From

Amit Kapila

Date:

22 March 2021, 02:27:15

On Mon, Mar 22, 2021 at 3:20 AM Peter Smith <smithpb2250@gmail.com> wrote:
>
> On Sun, Mar 21, 2021 at 8:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> >
> > On Sat, Mar 20, 2021 at 12:54 PM Peter Smith <smithpb2250@gmail.com> wrote:
> > >
> > > PSA my patch to correct this by firstly doing a HASH_FIND, then only
> > > HASH_REMOVE after we've finished using the ent.
> > >
> >
> > Why can't we keep using HASH_REMOVE as it is but get the output (entry
> > found or not) in the last parameter of hash_search API and then
> > perform Assert based on that? See similar usage in reorderbuffer.c and
> > rewriteheap.c.
> >
>
> Changing the Assert doesn't do anything to fix the problem as
> described, i.e. dereferencing of ent after the HASH_REMOVE.
>
> The real problem isn't the Assert. It's all those other usages of ent
> disobeying the API rule: "(NB: in the case of the REMOVE action, the
> result is a dangling pointer that shouldn't be dereferenced!)"
>

Right, that is a problem. I see that your patch will fix it. Thanks.

-- 
With Regards,
Amit Kapila.

Re: replication cleanup code incorrect way to use of HTAB HASH_REMOVE ?

From

Amit Kapila

Date:

23 March 2021, 05:22:46

On Mon, Mar 22, 2021 at 7:57 AM Amit Kapila <amit.kapila16@gmail.com> wrote:
>
> On Mon, Mar 22, 2021 at 3:20 AM Peter Smith <smithpb2250@gmail.com> wrote:
> >
> > On Sun, Mar 21, 2021 at 8:54 PM Amit Kapila <amit.kapila16@gmail.com> wrote:
> > >
> > > On Sat, Mar 20, 2021 at 12:54 PM Peter Smith <smithpb2250@gmail.com> wrote:
> > > >
> > > > PSA my patch to correct this by firstly doing a HASH_FIND, then only
> > > > HASH_REMOVE after we've finished using the ent.
> > > >
> > >
> > > Why can't we keep using HASH_REMOVE as it is but get the output (entry
> > > found or not) in the last parameter of hash_search API and then
> > > perform Assert based on that? See similar usage in reorderbuffer.c and
> > > rewriteheap.c.
> > >
> >
> > Changing the Assert doesn't do anything to fix the problem as
> > described, i.e. dereferencing of ent after the HASH_REMOVE.
> >
> > The real problem isn't the Assert. It's all those other usages of ent
> > disobeying the API rule: "(NB: in the case of the REMOVE action, the
> > result is a dangling pointer that shouldn't be dereferenced!)"
> >
>
> Right, that is a problem. I see that your patch will fix it. Thanks.
>

Pushed your patch.

-- 
With Regards,
Amit Kapila.