Re: Partial aggregates pushdown - Mailing list pgsql-hackers
From | Robert Haas |
---|---|
Subject | Re: Partial aggregates pushdown |
Date | |
Msg-id | CA+TgmoZKrq5+JfTp=RXDygYd_UZp_Z6ii7RFsKxy1430Ewn6rw@mail.gmail.com Whole thread Raw |
In response to | Re: Partial aggregates pushdown (Ashutosh Bapat <ashutosh.bapat.oss@gmail.com>) |
List | pgsql-hackers |
On Tue, Nov 28, 2023 at 5:24 AM Ashutosh Bapat <ashutosh.bapat.oss@gmail.com> wrote: > If my memory serves me right, PGXC implemented partial aggregation > only when the output of partial aggregate was a SQL data type > (non-Internal, non-Unknown). But I may be wrong. But at that time, > JSONB wasn't there or wasn't that widespread. > > Problem with Internal is it's just a binary string whose content can > change across version and which can be interpreted differently across > different versions. There is no metadata in it to know how to > interpret it. We can add that metadata to JSONB. The result of partial > aggregate can be sent as a JSONB. If the local server finds the JSONB > familiar it will construct the right partial aggregate value otherwise > it will throw an error. If there's a way to even avoid that error (by > looking at server version etc.) the error can be avoided too. But > JSONB leaves very very less chance that the value will be interpreted > wrong. Downside is we are tying PARTIAL's output to be JSONB thus > tying SQL syntax with a data type. > > Does that look acceptable? If somebody had gone to the trouble of making this work, and had done a good job, I wouldn't vote against it, but in a vacuum, I'm not sure it's the best design. The problem in my view is that working with JSON is not actually very pleasant. It's super-easy to generate, and super-easy for humans to read. But parsing and validating it is a pain. You basically have to have two parsers, one to do syntactical validation and then a second one to ensure that the structure of the document and the contents of each item are as expected. See parse_manifest.c for an example of what I mean by that. Now, if we add new code, it can reuse the JSON parser we've already got, so it's not that you need to write a new JSON parser for every new application of JSON, but the semantic validator (a la parse_manifest.c) isn't necessarily any less code than a whole new parser for a bespoke format. To make that a bit more concrete, for something like string_agg(), is it easier to write a validator for the existing deserialization function that accepts a bytea blob, or to write a validator for a JSON blob that we could be passing instead? My suspicion is that the former is less work and easier to verify, but it's possible I'm wrong about that and they're more or less equal. I don't really see any way that the JSON thing is straight-up better; at best it's a toss-up in terms of amount of code. Now somebody could still make an argument that they would like JSON better because it would be more useful for some purpose other than this feature, and that is fine, but here I'm just thinking about this feature in particular. My personal suspicion is that the easiest way to support internal-type aggregates here is to convert them to use an array or record type as a transition state instead, or maybe serialize the internal state to one of those things instead of to bytea. I suspect that would allow us to leverage more of our existing validation infrastructure than using JSON or sticking with bytea. But I'm certainly amenable to other points of view. I'm not trying to pretend that my gut feeling is necessarily correct; I'm just explaining what I currently think. -- Robert Haas EDB: http://www.enterprisedb.com
pgsql-hackers by date: