Re: Partial aggregates pushdown - Mailing list pgsql-hackers
From | Fujii.Yuki@df.MitsubishiElectric.co.jp" |
---|---|
Subject | Re: Partial aggregates pushdown |
Date | |
Msg-id | TYAPR01MB5514F0CBD9CD4F84A261198195562@TYAPR01MB5514.jpnprd01.prod.outlook.com Whole thread Raw |
In response to | Re: [CAUTION!! freemail] Re: Partial aggregates pushdown (vignesh C <vignesh21@gmail.com>) |
Responses |
Re: Partial aggregates pushdown
|
List | pgsql-hackers |
Hi. Mr.Haas, hackers. I apologize for the significant delay since my last post. I have conducted investigations and considerations regarding the remaining tasks as follows. Would it be possible for you to review them? In particular, could you please confirm if the approach mentioned in 1. is acceptable? If there are no issues with the direction outlined in 1., I plan to make a simple prototype based on this approach. 1. Transmitting state value safely between machines > From: Robert Haas <robertmhaas@gmail.com> > Sent: Wednesday, December 6, 2023 10:25 PM > the problems around transmitting > serialized bytea blobs between machines that can't be assumed to fully trust each other will need to be addressed in some > way, which seems like it will require a good deal of design work, forming some kind of consensus, and then implementation > work to follow. I have considered methods for safely transmitting state values between different machines. I have taken into account the version policy of PostgreSQL (5 years of support) and the major version release cycle overthe past 10 years (1 year), and as a result, I have made the assumption that transmission is allowed only when the differencebetween the local version and the remote version is 5 or less. I believe that by adding new components, "export function" and "import function", to the aggregate functions, and furtherintroducing a new SQL keyword to the query syntax of aggregate expressions, we can address this issue. If the version of the local server is higher than or equal to the version of the remote server, the proposed method can besimplified. The export version mentioned later in (1) would not be necessary. Furthermore, if the version of the localserver matches the version of the remote server, the proposed method can be further simplified. I would appreciate your input on reasonable assumptions regarding the differences in versions between the local server andthe remote server. I will explain the specifications of the export function, import function, the new SQL keyword for aggregate expressions,and the behavior of query processing for partial aggregation separately. (1) Export Function Specification This function is another final function for partial aggregate. This function converts the state value that represents the result of partial aggregation into a format that can be read bythe local server. This function is called instead of the existing finalfunc during the final stage of aggregation when performing partial aggregation. The conversion process described above will be referred to as "export". The argument of an export function is the version of the server that will receive the return value. Hereafter, this version will be referred to as the export version. The concept of an export version is necessary to handle cases where the version of the local server is smaller than the versionof the remote server. The return value of the export function is the transformed state value, and its data type is bytea. For backward compatibility, the developer of the export function must ensure that the export can be performed for major versionsup to five versions prior to the major version of PostgreSQL that the export function is being developed for. For built-in functions, I believe it is necessary to allow for the possibility of not developing the export functionalityfor specific versions in the future (due to reasons such as development burden) after the export function isdeveloped for a certain version. To achieve this, for built-in functions, we will add a column to the pg_aggregate catalog that indicates the presence orabsence of export functionality for each major version, including the major version being developed and the previous fivemajor versions. This column will be named safety_export_versions and will have a data type of boolean[6]. For user-defined functions, we will refer to the extensions option and add an external server option called safety_export_extensions,which will maintain a list of extensions that include only the aggregate functions that can be exportedto the local server version. (2) Import Function Specification The import function is a function that performs validity checks on the exported data and converts it into a state value.The process of this conversion is referred to as importing. The import function is called from postgres_fdw in the local server. The arguments of the import function are the exported data and the export version. The return value of the import function is a state value that can be read on the local server. The import function will terminate with an error if the validity check determines that the exported result cannot be readon the local server. For backward compatibility, developers of the import function must ensure that it can be imported if the export version isup to five versions prior to their own version. (3) The new SQL keyword for aggregate expressions The local server passes the instructions for partial aggregation and the export version to the remote server using SQL keywords.The syntax specification is as follows: aggregate_function(PARTIAL_AGGREGATE(export_version) expr) Here, PARTIAL_AGGREGATE is a keyword that indicates partial aggregation, and export_version is a string constant that indicatesthe export version. (4) The behavior of query processing for partial aggregation I will illustrate the flow of query processing using the example query "select aggfunc(c) from t". In the following explanation, the major version of the remote server will be referred to as remote_version, and the majorversion of the local server will be referred to as local_version. STEP1. Checking the feasibility of partial aggregation pushdown on the local server (i) Retrieving the remote_version The postgres_fdw connects to the remote server and retrieves the remote_version. (ii) Checking the versions The postgres_fdw determines whether the difference between local_version and remote_version is within 5. If the differenceis 6 or more, it is determined that partial aggregation pushdown is not possible. (iii) Checking the import function The postgres_fdw checks the pg_aggregate catalog to see if there is an import function for aggfunc. If there is none, itis determined that partial aggregation pushdown is not possible. (iv) Checking the export function If aggfunc is a built-in function, the postgres_fdw checks the pg_aggregate catalog. It checks if there is a version numberexport_version that satisfies the conditions local_version >= export_version >= local_version-5 and if there is anexport function available for that version. If the version number export_version does not exist, it is determined thatpartial aggregation pushdown is not possible. This check is only performed if local_version >= remote_version. If aggfunc is a user-defined function, the postgres_fdw checks if the extension on which aggfunc depends is included in export_safety_extensions.If it is not included, it is determined that partial aggregation pushdown is not possible. STEP2. Sending a remote query on the local server The query containing the keyword indicating partial aggregation is sent to the remote server. The remote query for the samplequery would be as follows: "select aggfunc(PARTIAL_AGGREGATE(export_version) c) from t" STEP3. Executing the remote query on the remote server The remote server performs partial aggregation for aggfunc. Instead of calling the final function at the last stage of aggregation,the remote server calls the export function with export_version and generates the return value of the partialaggregation. STEP4. Receiving the result of the remote query on the local server The postgres_fdw passes the export_version and the return value of STEP3 to the import function of aggfunc and receives thestate value. The postgres_fdw then passes the received state value to the executor of the local server. 2. The approach of adding SQL keywords > From: Robert Haas <robertmhaas@gmail.com> > Sent: Tuesday, November 21, 2023 5:52 AM > I do have a concern about this, though. It adds a lot of bloat. It adds a whole lot of additional entries to pg_aggregate,and > every new aggregate we add in the future will require a bonus entry for this, and it needs a bunch of new pg_proc entriesas > well. One idea that I've had in the past is to instead introduce syntax that just does this, without requiring a separate > aggregate definition in each case. > For example, maybe instead of changing string_agg(whatever) to string_agg_p_text_text(whatever), you can say > PARTIAL_AGGREGATE > string_agg(whatever) or string_agg(PARTIAL_AGGREGATE whatever) or something. Then all aggregates could be treated > in a generic way. I'm not completely sure that's better, but I think it's worth considering. I have prototyped an approach using SQL keywords for the patch that does not include the functionality of Step 1. Pleasefind the prototype attached as a file. # I apologize for not including sufficient comments, documentation, and tests in the prototype. Please understand. Mainly, it seems that we can address this by adding handling for the new SQL keywords in the parser and making modificationsto the finalize process for aggregation in the executor. As pointed out by Mr.Haas, it has been realized that the code can be significantly simplified. The additional lines of code, excluding documentation and tests, are as follows. Adding new aggregate functions approach(approach #1): 1,069 Adding new SQL keyword approach(approach #2): 318 As mentioned in 1., I plan to modify the patch by adding SQL keywords in the future. 3. Fixing the behavior when the HAVING clause is present > From: Robert Haas <robertmhaas@gmail.com> > Sent: Tuesday, November 28, 2023 4:08 AM > > On Wed, Nov 22, 2023 at 1:32 AM Alexander Pyhalov <a.pyhalov@postgrespro.ru> wrote: > > Hi. HAVING is also a problem. Consider the following query > > > > SELECT count(a) FROM t HAVING count(a) > 10 - we can't push it down to > > foreign server as HAVING needs full aggregate result, but foreign > > server don't know it. > > I don't see it that way. What we would push to the foreign server would be something like SELECT count(a) FROM t. Then, > after we get the results back and combine the various partial counts locally, we would locally evaluate the HAVING clause > afterward. That is, partial aggregation is a barrier to pushing down HAVING clause itself, but it doesn't preclude pushing > down the aggregation. I have made modifications in the attached patch to ensure that when the HAVING clause is present, the HAVING clause is executedlocally while the partial aggregations are pushed down. Sincerely yours, Yuuki Fujii -- Yuuki Fujii Information Technology R&D Center Mitsubishi Electric Corporation
Attachment
pgsql-hackers by date: