Re: Partial aggregates pushdown - Mailing list pgsql-hackers
From | Jelte Fennema-Nio |
---|---|
Subject | Re: Partial aggregates pushdown |
Date | |
Msg-id | CAGECzQTnfUdRO21CkwZeMowFZC3zQ9+dhNp3EB-G52=BXJvyPg@mail.gmail.com Whole thread Raw |
In response to | RE: Partial aggregates pushdown ("Fujii.Yuki@df.MitsubishiElectric.co.jp" <Fujii.Yuki@df.MitsubishiElectric.co.jp>) |
Responses |
RE: Partial aggregates pushdown
|
List | pgsql-hackers |
On Mon, 24 Jun 2024 at 15:03, Fujii.Yuki@df.MitsubishiElectric.co.jp <Fujii.Yuki@df.mitsubishielectric.co.jp> wrote: > I see. I maybe got your proposal. > Refer to your proposal, for avg(int8), > I create a new native type like state_int8_avg > with the new typsend/typreceive functions > and use them to transmit the state value, right? Yes, that's roughly what I had in mind indeed. > That might seem to be a more fundamental solution > because I can avoid adding export/import functions of my proposal, > which are the new components of aggregate function. > I have never considered the solution. > I appreciate your proposal. Thank you :) > However, I still have the following two questions. > > 1. Not necessary components of new native types > Refer to pg_type.dat, typinput and typoutput are required. > I think that in your proposal they are not necessary, > so waste. I think that it is not acceptable. > How can I resolve the problem? I think requiring typinput/typoutput is a benefit personally, because that makes it possible to do PARTIAL_AGGREGATE pushdown to a different PG major version. Also it makes it easier to debug the partial aggregate values when using psql/pgregress. So yes, it requires implementing both binary (send/receive) and text (input/output) serialization, but it also has some benefits. And in theory you might be able to skip implementing the binary serialization, and rely purely on the text serialization to send partial aggregates between servers. > 2. Many new native types > I think that, basically, each aggregate function does need a new native type. > For example, > avg(int8), avg(numeric), and var_pop(int4) has the same transtype, PolyNumAggState. > You said that it is enough to add only one native type like state_poly_num_agg > for supporting them, right? Yes, correct. That's what I had in mind. > But the combine functions of them might have quite different expectation > on the data items of PolyNumAggState like > the range of N(means count) and the true/false of calcSumX2 > (the flag of calculating sum of squares). > The final functions of them have the similar expectation. > So, I think that, responded to your proposal, > each of them need a native data type > like state_int8_avg, state_numeric_avg, for safety. > > And, we do need a native type for an aggregate function > whose transtype is not internal and not pseudo. > For avg(int4), the current transtype is _int8. > However, I do need a validation check on the number of the array > And the positiveness of count(the first element of the array). > Responded to your proposal, > I do need a new native type like state_int4_avg. To help avoid confusion let me try to restate what I think you mean here: You're worried about someone passing in a bogus native type into the final/combine functions and then getting crashes and/or wrong results. With internal type people cannot do this because they cannot manually call the combinefunc/finalfunc because the argument type is internal. To solve this problem your suggestion is to make the type specific to the specific aggregate such that send/receive or input/output can validate the input as reasonable. But this would then mean that we have many native types (and also many deserialize/serialize functions). Assuming that's indeed what you meant, that's an interesting thought, I didn't consider that much indeed. My thinking was that we only need to implement send/receive & input/output functions for these types, and even though their meaning is very different we can serialize them in the same way. As you say though, something like that is already true for avg(int4) today. The way avg(int4) handles this issue is by doing some input validation for every call to its trans/final/combinefunc (see int4_avg_accum, int4_avg_combine, and int8_avg). It checks the length of the array there, but it doesn't check the positiveness of the count. I think that makes sense. IMHO these functions only need to protect against crashes (e.g. null pointer dereferences). But I don't think there is a good reason for them to protect the user against passing in weird data. These functions aren't really meant to be called manually in the first place anyway, so if the user does that and they pass in weird data then I'm fine with them getting a weird result back, even errors are fine (only crashes are not). So as long as our input/output & send/receive functions for state_poly_num_agg handle all the inconsistencies that could cause crashes later on (which I think is pretty simple to do for PolyNumAggState), then I don't think we need state_int8_avg, state_numeric_avg, etc. > > Not a full review but some initial notes: > Thank you. I don't have time today, so I'll answer after tomorrow. Sure, no rush.
pgsql-hackers by date: