Re: Partial aggregates pushdown - Mailing list pgsql-hackers
From | Jelte Fennema-Nio |
---|---|
Subject | Re: Partial aggregates pushdown |
Date | |
Msg-id | CAGECzQSGmJA8zx9BfvQtyPamuk6dujVdE4eZ_sqGuinVnJ7SzQ@mail.gmail.com Whole thread Raw |
In response to | RE: Partial aggregates pushdown ("Fujii.Yuki@df.MitsubishiElectric.co.jp" <Fujii.Yuki@df.MitsubishiElectric.co.jp>) |
List | pgsql-hackers |
On Sun, 7 Jul 2024 at 23:46, Fujii.Yuki@df.MitsubishiElectric.co.jp <Fujii.Yuki@df.mitsubishielectric.co.jp> wrote: > In my mind, Approach1 is superior. Therefore, if there are no objections this week, I plan to resume implementing Approach1next week. I would appreciate it if anyone could discuss the topic with me or ask questions. Honestly, the more I think about this, the more I like Approach2. Not because I disagree with you about some of the limitations of Approach2, but because I'd rather see those limitations fixed in CREATE TYPE, instead of working around these limitations in CREATE AGGREGATE. That way more usages can benefit. Detailed explanation and arguments below. > 1. Extendability > I believe it is crucial to support scenarios where the local and remote major versions may differ in the future (see thebelow). > > https://www.postgresql.org/message-id/4012625.1701120204%40sss.pgh.pa.us From my reading, Tom's concern is that different server versions cannot talk to each other anymore. So as long as this perf optimization is only enabled when server versions are the same, I don't think there is a huge problem if we never implement this. Honestly, I even think making this optimization opt-in at the FDW server creation level would already solve Tom's concert. I do agree that it would be good if we could have cross version partial aggregates though, so it's definitely something to consider. > Regarding this aspect, I consider Approach1 superior to Approach2. The reason is that: > ・The data type of an aggregate function's state value may change with each major version increment. > ・In Approach1, by extending the export/import functionalities to include the major version in which the state value wascreated (refer to p.16 and p.17 of [1]), I can handle such situations. > ・On the other hand, it appears that Approach2 fundamentally lacks the capability to support these scenarios. Approach 2 definitely has some cross-version capabilities, e.g. jsonb_send includes a version. Such an approach can be used to solve a newer coordinator talking to an older worker, if the transtypes are the same. I personally don't think it's worth supporting this optimization for an older coordinator talking to a newer worker. Using binary protocol to talk to from an older server to a newer server doesn't work either. Finally, based on p.16 & p.17 it's unclear to me how cross-version with different transtypes would work. That situation seems inherently incompatible to me. > 2. Amount of codes > Regarding this aspect, I find Approach1 to be better than Approach2. > In Approach1, developers only need to export/import functions and can use a standardized format for transmitting statevalues. > In Approach2, developers have two options: > Option1: Adding typinput/typoutput and typsend/typreceive. > Option2: Adding typinput/typoutput only. > Option1 requires more lines of code, which may be seen as cumbersome by some developers. > Option2 restricts developers to using only text representation for transmitting state values, which I consider limiting. In my opinion this is your strongest argument for Approach1. But you didn't answer my previous two questions yet: On Tue, 25 Jun 2024 at 11:28, Jelte Fennema-Nio <postgres@jeltef.nl> wrote: > So that leaves me two questions: > 1. Maybe CREATE TYPE should allow types without input/output functions > as long as send/receive are defined. For these types text > representation could fall back to the hex representation of bytea. > 2. If for some reason 1 is undesired, then why are transtypes so > special. Why is it fine for them to only have send/receive functions > and not for other types? Basically: I agree with this argument, but I feel like this being a problem for this usecase is probably a sign that we should take the solution a step further and solve this at the CREATE TYPE level instead of allowing people to hack around CREATE TYPE its limitations just for these partial aggregates. > 3. Catalog size > Regarding this point, I believe Approach1 is better than Approach2. > In Approach1, theoretically, it is necessary to add export/import functions to pg_proc for each aggregate. > In Approach2, theoretically, it is necessary to add typoutput/typinput functions (and typsend/typreceive if necessary)to pg_proc and add a native type to pg_type for each aggregate. > I would like to emphasize that we should consider user-defined functions in addition to built-in aggregate functions. > I think most developers prefer to avoid bloating catalogs, even if they may not be able to specify exact reasons. > In fact, in Robert's previous review, he expressed a similar concern (see below). > > https://www.postgresql.org/message-id/CA%2BTgmobvja%2Bjytj5zcEcYgqzOaeJiqrrJxgqDf1q%3D3k8FepuWQ%40mail.gmail.com So, to summarize the difference (assuming we change CREATE TYPE to allow only typsend/typreceive): "Approach 2 adds an additional pg_type entry per aggregate" IMHO this is fine, especially since these types can usually be shared across multiple aggregates. I think the main reason Robert expressed concern before was the level of catalog bloat that was added in that version of the proposal: It required a new function to be added to every existing aggregate. Both Approach1 and Approach2 only require new catalog entries to be added for aggregates with internal transtypes. That means both approaches introduce significantly less bloat than the proposal that Robert expressed concern about. > 4. Developer burden. > Regarding this aspect, I believe Approach1 is better than Approach2. > In Approach1, developers have the following additional tasks: > Task1-1: Create and define export/import functions. > > In Approach2, developers have the following additional tasks: > Task2-1: Create and define typoutput/input functions (and typesend/typreceive functions if necessary). > Task2-2: Define a native type. > > Approach1 requires fewer additional tasks, although the difference may be not substantial. I think the difference here is so small that this isn't really an argument. Especially since you're skipping over Task-1-2: Use new export/import functions in aggregate definition. Which means the effective difference in SQL that needs to be typed is really "CREATE TYPE mytype" (the arguments of the CREATE TYPE, would be part of the aggregate definition in Approach1). In a sense this same counter-argument applies to the catalog bloat section. The only extra data that gets to the catalog for Approach2 is the typename and typeoid, the export/import vs send/receive functions just move between pg_type and pg_aggregate. > 5. Additional columns to catalogs. > Regarding this aspect, Approach2 is better than Approach1. > Approach1 requires additional three columns in pg_aggregate, specifically the aggpartialpushdownsafe flag, export functionreference, and import function reference. > Approach2 does not require any additional columns in catalogs. > However, over the past four years of discussions, no one has expressed concerns about additional columns in catalogs. I agree that the columns aren't a big deal. The main thing I don't like is that now we have two ways of serializing types to be sent over the network, one way for actual types and one way for transtypes of type internal. FINALLY: I thought of one other advantage of using actual types is that at the protocol level these types will be visible too. This allows for some extra safety checks to be performed on the coordinator in case the transtypes are not the same across servers, which might happen with aggregates of extensions even if the PG server versions are the same, but the extension versions aren't. With Approach1 the only way to detect this is the importfunc complaining, but that can only fail with a vague error, like "cannot parse message". Instead of saying something more useful, like "expected type avg_int4_transtype, got type avg_int4_transtype_v2"
pgsql-hackers by date: