Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype - Mailing list pgsql-hackers

From Tom Lane
Subject Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype
Date
Msg-id 27247.1466185504@sss.pgh.pa.us
Whole thread Raw
In response to Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype  (Robert Haas <robertmhaas@gmail.com>)
Re: Parallelized polymorphic aggs, and aggtype vs aggoutputtype  (David Rowley <david.rowley@2ndquadrant.com>)
List pgsql-hackers
I wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> I don't mind if you feel the need to refactor this.

> I'm not sure yet.  What I think we need to work out first is exactly
> how polymorphic parallel aggregates ought to identify which concrete
> data types they are using.  There were well-defined rules before,
> but how do we adapt those to two levels of aggregate evaluation?

After reviewing things a bit, it seems like the first design-level issue
there is what about polymorphic combine functions.  So far as the type
system is concerned, there's no issue: the declared signature is always
going to be "combine(foo, foo) returns foo" and it doesn't matter whether
foo is an ordinary type, a polymorphic one, or INTERNAL.  The other
question is whether the function itself knows what it's operating on in
the current invocation.  For regular polymorphic types this seems no
different from any other usage.  If the transtype is INTERNAL, then the
type system provides no help; but we could reasonably assume that the
internal representation itself contains as much information as the combine
func needs.  We could add extra arguments like the "finalfunc_extra"
option does for the finalfunc, but I don't really see a need --- that hack
is mainly to satisfy the type system that the finalfunc's signature is
sane, not to tell the finalfunc something it has no other way to find out.
So I think we're probably OK as far as the combine function is concerned.

The situation is much direr as far as serialize/deserialize are concerned.
These basically don't work at all for polymorphic transtypes: if you try
to declare "deserialize(bytea) returns anyarray", the type system won't
let you.  Perhaps that's not an issue because you shouldn't really need
serialize/deserialize for anything except INTERNAL transtype.  However,
there's a far bigger problem which is that "deserialize(bytea) returns
internal" is a blatant security hole, which I see you ignored the defense
against in opr_sanity.sql.  The claim in the comment there that it's okay
if we check for aggregate context is a joke --- or haven't you heard that
we allow users to create aggregates?  That's not nearly good enough to
prevent unsafe usage of such functions.  Not to mention that CREATE
FUNCTION won't allow creation of such functions, so extensions are locked
out of using this feature.

This has to be redesigned or else reverted entirely.  I'm not going to
take no for an answer.

A possible solution is to give deserialize an extra dummy argument, along
the lines of "deserialize(bytea, internal) returns internal", thereby
ensuring it can't be called in any non-system-originated contexts.  This
is still rather dangerous if the other argument is variable, as somebody
might be able to abuse an internal-taking function by naming it as the
deserialize function for a maliciously-designed aggregate.  What I'm
inclined to do to lock it down further is to drop the "serialtype"
argument to CREATE AGGREGATE, which seems rather pointless (what else
would you ever use besides bytea?).  Instead, insist that
serialize/deserialize apply *only* when the transtype is INTERNAL, and
their signatures are exactly "serialize(internal) returns bytea" and
"deserialize(bytea, internal) returns internal", never anything else.

A different way of locking things down, which might be cleaner in the
long run, is to invent a new pseudo-type for the sole purpose of being
the serialization type, that is we'd insist on the signatures being
"serialize(internal) returns serialized_internal" and
"deserialize(serialized_internal) returns internal", where
serialized_internal has the representation properties of bytea but is
usable for no other purpose than this.  Not sure it's worth the trouble
though.

Anyway, setting aside the security angle, it doesn't seem like there is
anything wrong with the high-level design for polymorphic cases.  I'm now
thinking the problem I saw is just a garden variety implementation bug
(or bugs).  I'm still not very happy with the confusion around aggtype
though, not least because I suspect it contributed directly to this bug.
I also feel that both the code comments and the user-facing documentation
for this feature are well below project standards, eg where's the
discussion in section 35.10?
        regards, tom lane



pgsql-hackers by date:

Previous
From: Robert Haas
Date:
Subject: Re: 10.0
Next
From: Aleksey Demakov
Date:
Subject: Re: Experimental dynamic memory allocation of postgresql shared memory