Thread: array_agg() does not stop aggregating according to HAVING clause
Hello list, I have a query that goes through *billions* of rows and for the columns that have an infrequent "datatag" (HAVING count(test_datatag_n)<10) it selects all the IDs of the entries (array_agg(run_n)). Here is the full query: INSERT INTO infrequent_datatags_in_this_chunk SELECT datatag, datatags.datatag_n, array_agg(run_n) FROM runs_raw JOIN datatags USING(datatag_n) WHERE workitem_n >= 295 AND workitem_n < 714218 AND datatag IS NOT NULL GROUP BY datatags.datatag_n HAVING count(datatag_n) < 10 AND count(datatag_n) > 0 -- Not really needed because of the JOIN above ; The runs_raw table has run_n as the primary key id, and an index on workitem_n. The datatags table is a key value store with datatag_n as primary key. The problem is that this is extremely slow (5 hours), most likely because it creates tens of gigabytes of temporary files as I see in the logs. I suspect that it is writing to disk the array_agg(run_n) of all entries and not only those HAVING count(datatag_n)<10. (I might be wrong though, as this is only an assumption based on the amount of data written; I don't know of any way to examine the temporary files written). While this query is going through billions of rows, the ones with infrequent datatags are maybe 10M. How do I tell postgres to stop aggregating when count>=10? Thank you in advance, Dimitris
Dimitrios Apostolou <jimis@gmx.net> writes: > I have a query that goes through *billions* of rows and for the columns > that have an infrequent "datatag" (HAVING count(test_datatag_n)<10) it > selects all the IDs of the entries (array_agg(run_n)). Here is the full > query: > INSERT INTO infrequent_datatags_in_this_chunk > SELECT datatag, datatags.datatag_n, array_agg(run_n) > FROM runs_raw > JOIN datatags USING(datatag_n) > WHERE workitem_n >= 295 > AND workitem_n < 714218 > AND datatag IS NOT NULL > GROUP BY datatags.datatag_n > HAVING count(datatag_n) < 10 > AND count(datatag_n) > 0 -- Not really needed because of the JOIN above > ; > The problem is that this is extremely slow (5 hours), most likely because > it creates tens of gigabytes of temporary files as I see in the logs. I > suspect that it is writing to disk the array_agg(run_n) of all entries and > not only those HAVING count(datatag_n)<10. Well, yes: the two aggregates (array_agg and count) are computed concurrently in a single Aggregate plan node scanning the output of the JOIN. There's no way to apply the HAVING filter until after the aggregation is finished. I think this approach is basically forced by the SQL standard's semantics for grouping/aggregation. > How do I tell postgres to stop aggregating when count>=10? The only way to do this would be to do two separate passes of aggregation in separate sub-queries. Perhaps like WITH rare AS ( SELECT datatag_n FROM runs_raw WHERE workitem_n >= 295 AND workitem_n < 714218 AND datatag IS NOT NULL GROUP BY datatag_n HAVING count(datatag_n) < 10 AND count(datatag_n) > 0 ) INSERT INTO infrequent_datatags_in_this_chunk SELECT datatag, datatags.datatag_n, array_agg(run_n) FROM runs_raw JOIN datatags USING(datatag_n) JOIN rare USING(datatag_n) GROUP BY datatags.datatag_n ; I can't tell from what you said which level the workitem_n and datatag conditions go at, so this is just a draft-quality query. But I think the structure is basically okay, given that you said datatag_n is unique in datatags (so there's no need to join it in the WITH step). regards, tom lane
On Sat, 17 Aug 2024, Tom Lane wrote: > Well, yes: the two aggregates (array_agg and count) are computed > concurrently in a single Aggregate plan node scanning the output > of the JOIN. There's no way to apply the HAVING filter until > after the aggregation is finished. > > I think this approach is basically forced by the SQL standard's > semantics for grouping/aggregation. FWIW I also tried: HAVING array_length(array_agg(run_n), 1) < 10; but I saw the same amount of temp files, at least in the short duration of my test run. Thank you, I will split this into two passes like you suggested. It's just that I'm doing another 3 passes over this table for different things I calculate (different GROUP BY, different WHERE clauses) and I was hoping to minimize the time spent. But avoiding the array_agg() over everything is my top priority ATM so I'll definitely try. Regards, Dimitris