Re: Partial aggregates pushdown - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Partial aggregates pushdown |
Date | |
Msg-id | CA+TgmoYQKF-8vL8sctWuw_dskBiQWcspuNPJuH4ttusuiP529g@mail.gmail.com Whole thread Raw |
In response to | RE: Partial aggregates pushdown ("Fujii.Yuki@df.MitsubishiElectric.co.jp" <Fujii.Yuki@df.MitsubishiElectric.co.jp>) |
Responses |
Re: Partial aggregates pushdown
|
List | pgsql-hackers |
On Wed, Nov 22, 2023 at 5:16 AM Fujii.Yuki@df.MitsubishiElectric.co.jp <Fujii.Yuki@df.mitsubishielectric.co.jp> wrote: > I did not choose Approach2 because I was not confident that the disadvantage mentioned in 2.(2)(a) > would be accepted by the PostgreSQL development community. > If it is accepted, I think Approach 2 is smarter. > Could you please provide your opinion on which > approach is preferable after comparing these two approaches? > If we cannot say anything without comparing the amount of source code, as Mr.Momjian mentioned, > we need to estimate the amount of source code required to implement Approach2. I've had the same concern, that approach #2 would draw objections, so I think you were right to be cautious about it. I don't think it is a wonderful approach in all ways, but I do think that it is superior to approach #1. If we add dedicated support to the grammar, it is mostly a one-time effort, and after that, there should not be much need for anyone to be concerned about it. If we instead add extra aggregates, then that generates extra work every time someone writes a patch that adds a new aggregate to core. I have a difficult time believing that anyone will prefer an approach that involves an ongoing maintenance effort of that type over one that doesn't. One point that seems to me to be of particular importance is that if we add new aggregates, there is a risk that some future aggregate might do that incorrectly, so that the main aggregate works, but the secondary aggregate created for this feature does not work. That seems like it would be very frustrating for future code authors so I'd like to avoid the risk as much as we can. Also, I want to make one other point here about security and reliability. Right now, there is no way for a user to feed arbitrary data to a deserialization function. Since serialization and deserialization functions are only used in the context of parallel query, we always know that the data fed to the deserialization function must have come from the serialization function on the same machine. Nor can users call the deserialization function directly with arbitrary data of their own choosing, because users cannot call functions that take or return internal. But with this system, it becomes possible to feed arbitrary data to a deserialization function. The user could redefine the function on the remote side so that it produces arbitrary data of their choosing, and the local deserialization function will ingest it. That's potentially quite a significant problem. Consider for example that numericvar_deserialize() does no validity checking on any of the weight, sign, or dscale, but not all values for those fields are legal. Right now that doesn't matter, but if you can feed arbitrary data to that function, then it is. I don't know exactly what the consequences are if you can get that function to spit out a NumericVar with values outside the normal legal range. What can you do then? Store a bogus numeric on disk? Crash the server? Worst case, some problem like this could be a security issue allowing for escalation to superuser; more likely, it would be a crash bug, corrupt your database, or lead to unexpected and strange error messages. Unfortunately, I have the unpleasant suspicion that most internal-type aggregates will be affected by this problem. Consider, for example, string_agg_deserialize(). Generally, strings are one of the least-constrained data types, so you might hope that this function would be OK. But it doesn't look very promising. The first int4 in the serialized representation is the cursor, which would have to be bounds-checked, lest someone provide a cursor that falls outside the bounds of the StringInfo and, maybe, cause a reference to an arbitrary memory location. Then the rest of the representation is the actual data, which could be anything. This function is used for both bytea and for text, and for bytea, letting the payload be anything is OK. But for text, the supplied data shouldn't contain an embedded zero byte, or otherwise be invalid in the server encoding. If it were, that would provide a vector to inject invalidly encoded data into the database. This feature can't be allowed to do that. What could be a solution to this class of problems? One option is to just give up on supporting this feature for internal-type aggregates for now. That's easy enough to do, and just means we have less functionality, but it's sad because that's functionality we'd like to have. Another approach is to add necessary sanity checks to the relevant deserialization functions, but that seems a little hard to get right, and it would slow down parallel query cases which are probably going to be more common than the use of this feature. I think the slowdown might be significant, too. A third option is to change those aggregates in some way, like giving them a transition function that operates on some data type other than internal, but there again we have to be careful of slowdowns. A final option is to rethink the infrastructure in some way, like having a way to serialize to something other than bytea, for which we already have input functions with adequate checks. For instance, if string_agg_serialize() produced a record containing an integer column and a text or bytea column, we could attempt to ingest that record on the other side and presumably the right things would happen in the case of any invalid data. But I'm not quite sure what infrastructure would be required to make this kind of idea work. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: