RE: Partial aggregates pushdown - Mailing list pgsql-hackers

From Fujii.Yuki@df.MitsubishiElectric.co.jp"
Subject RE: Partial aggregates pushdown
Date
Msg-id TYAPR01MB551411A7DDFF0B38C5AAEDFE9584A@TYAPR01MB5514.jpnprd01.prod.outlook.com
Whole thread Raw
In response to Re: Partial aggregates pushdown  (Stephen Frost <sfrost@snowman.net>)
Responses Re: Partial aggregates pushdown
List pgsql-hackers
Hi Mr.Haas, hackers.

> From: Robert Haas <robertmhaas@gmail.com>
> Sent: Tuesday, November 28, 2023 5:03 AM
> Also, I want to make one other point here about security and reliability. Right now, there is no way for a user to
feed
> arbitrary data to a deserialization function. Since serialization and deserialization functions are only used in the
contextof
 
> parallel query, we always know that the data fed to the deserialization function must have come from the
serialization
> function on the same machine. Nor can users call the deserialization function directly with arbitrary data of their
own
> choosing, because users cannot call functions that take or return internal. But with this system, it becomes possible
to
> feed arbitrary data to a deserialization function.
> The user could redefine the function on the remote side so that it produces arbitrary data of their choosing, and the
local
> deserialization function will ingest it.
> 
> That's potentially quite a significant problem. Consider for example that numericvar_deserialize() does no validity
> checking on any of the weight, sign, or dscale, but not all values for those fields are legal. Right now that doesn't
matter,
> but if you can feed arbitrary data to that function, then it is. I don't know exactly what the consequences are if
youcan get
 
> that function to spit out a NumericVar with values outside the normal legal range. What can you do then?
> Store a bogus numeric on disk? Crash the server? Worst case, some problem like this could be a security issue
allowingfor
 
> escalation to superuser; more likely, it would be a crash bug, corrupt your database, or lead to unexpected and
strange
> error messages.
> 
> Unfortunately, I have the unpleasant suspicion that most internal-type aggregates will be affected by this problem.
> Consider, for example, string_agg_deserialize(). Generally, strings are one of the least-constrained data types, so
you
> might hope that this function would be OK. But it doesn't look very promising. The first int4 in the serialized
representation
> is the cursor, which would have to be bounds-checked, lest someone provide a cursor that falls outside the bounds of
the
> StringInfo and, maybe, cause a reference to an arbitrary memory location. Then the rest of the representation is the
actual
> data, which could be anything. This function is used for both bytea and for text, and for bytea, letting the payload
be
> anything is OK.
> But for text, the supplied data shouldn't contain an embedded zero byte, or otherwise be invalid in the server
encoding.If
 
> it were, that would provide a vector to inject invalidly encoded data into the database. This feature can't be
allowedto do
 
> that.
I completely overlooked this issue. I should have considered the risks of sending raw state values or serialized state
data directly from remote to local. I apologize.

> What could be a solution to this class of problems? One option is to just give up on supporting this feature for
internal-type
> aggregates for now. That's easy enough to do, and just means we have less functionality, but it's sad because that's
> functionality we'd like to have. Another approach is to add necessary sanity checks to the relevant deserialization
> functions, but that seems a little hard to get right, and it would slow down parallel query cases which are probably
goingto
 
> be more common than the use of this feature. I think the slowdown might be significant, too. A third option is to
change
> those aggregates in some way, like giving them a transition function that operates on some data type other than
internal,
> but there again we have to be careful of slowdowns. A final option is to rethink the infrastructure in some way, like
having
> a way to serialize to something other than bytea, for which we already have input functions with adequate checks.
For
> instance, if string_agg_serialize() produced a record containing an integer column and a text or bytea column, we
could
> attempt to ingest that record on the other side and presumably the right things would happen in the case of any
invalid
> data. But I'm not quite sure what infrastructure would be required to make this kind of idea work.
Thank you very much for providing a direction towards resolving this issue.
As you have suggested as the last option, it seems that expanding the current mechanism of the aggregation
function is the only choice. It may take some time, but I will consider specific solutions.

> From: Robert Haas <robertmhaas@gmail.com>
> Sent: Tuesday, November 28, 2023 4:08 AM
> On Wed, Nov 22, 2023 at 1:32 AM Alexander Pyhalov <a.pyhalov@postgrespro.ru> wrote:
> > Hi. HAVING is also a problem. Consider the following query
> >
> > SELECT count(a) FROM t HAVING count(a) > 10 - we can't push it down to
> > foreign server as HAVING needs full aggregate result, but foreign
> > server don't know it.
> 
> I don't see it that way. What we would push to the foreign server would be something like SELECT count(a) FROM t.
Then,
> after we get the results back and combine the various partial counts locally, we would locally evaluate the HAVING
clause
> afterward. That is, partial aggregation is a barrier to pushing down HAVING clause itself, but it doesn't preclude
pushing
> down the aggregation.
I understand what the problem is. I will try to fix it in the next version.

> From: Robert Haas <robertmhaas@gmail.com>
> Sent: Tuesday, November 28, 2023 5:03 AM
> On Wed, Nov 22, 2023 at 5:16 AM Fujii.Yuki@df.MitsubishiElectric.co.jp
> <Fujii.Yuki@df.mitsubishielectric.co.jp> wrote:
> > I did not choose Approach2 because I was not confident that the
> > disadvantage mentioned in 2.(2)(a) would be accepted by the PostgreSQL development community.
> > If it is accepted, I think Approach 2 is smarter.
> > Could you please provide your opinion on which approach is preferable
> > after comparing these two approaches?
> > If we cannot say anything without comparing the amount of source code,
> > as Mr.Momjian mentioned, we need to estimate the amount of source code required to implement Approach2.
> 
> I've had the same concern, that approach #2 would draw objections, so I think you were right to be cautious about it.
I
> don't think it is a wonderful approach in all ways, but I do think that it is superior to approach #1. If we add
dedicated
> support to the grammar, it is mostly a one-time effort, and after that, there should not be much need for anyone to
be
> concerned about it. If we instead add extra aggregates, then that generates extra work every time someone writes a
patch
> that adds a new aggregate to core. I have a difficult time believing that anyone will prefer an approach that
involvesan
 
> ongoing maintenance effort of that type over one that doesn't.
> 
> One point that seems to me to be of particular importance is that if we add new aggregates, there is a risk that
some
> future aggregate might do that incorrectly, so that the main aggregate works, but the secondary aggregate created for
this
> feature does not work. That seems like it would be very frustrating for future code authors so I'd like to avoid the
riskas
 
> much as we can.
Are you concerned about the hassle and potential human errors of manually adding new partial
aggregation functions, rather than the catalog becoming bloated?
The process of creating partial aggregation functions from aggregation functions can be automated,
so I believe this issue can be resolved. However, automating it may increase the size of the patch
even more, so overall, approach#2 might be better.
To implement approach #2, it would be necessary to investigate how much additional code is required.

Sincerely yours,
Yuuki Fujii

--
Yuuki Fujii
Information Technology R&D Center Mitsubishi Electric Corporation

pgsql-hackers by date:

Previous
From: Peter Eisentraut
Date:
Subject: Re: tablecmds.c/MergeAttributes() cleanup
Next
From: Amit Kapila
Date:
Subject: Re: logical decoding and replication of sequences, take 2