O(n^2) aggregates - Mailing list pgsql-hackers

From Gregory Stark
Subject O(n^2) aggregates
Date
Msg-id 87tzmqfv4v.fsf@oxford.xeocode.com
Whole thread Raw
Responses Re: O(n^2) aggregates  (Tom Lane <tgl@sss.pgh.pa.us>)
Re: O(n^2) aggregates  ("Pavel Stehule" <pavel.stehule@gmail.com>)
List pgsql-hackers
I was trying to test my patch to do posix_fadvise to speed up bitmap heap
scans (with disappointing results so far) and ran into a bit of a gotcha. I'm
not sure where this should be documented but it probably should be somewhere.

In order to test bitmap heap scan I had to build an array and use the = ANY(a)
form. (The natural approach of building a table of values to search for
produces a hash join or nested loop -- it may make sense to add a new kind of
join which uses an outer bitmap heap scan.)

So I defined an aggregate to group up the random values in an array in the
usual fashion:

CREATE AGGREGATE arrayize(anyelement) (   SFUNC = array_append,   STYPE = anyarray,   INITCOND = '{}'
);

and ran queries like:

select count(*) from hugewhere h = any ((select arrayize( (1+random()*300000000)::integer )                 from
generate_series(1,1000)              )::integer[])
 

To test the behaviour for larger and larger samples I bumped the "1000" up
further and further and noticed a larger and large pause in disk access. When
I tried 40000 the query took over 47 minutes nearly all of which had one cpu
pegged at 100%.

What's going on is that arrayize() is actually a O(n^2) algorithm since each
transition requires creating a copy of the entire array.

The solution to this would analogous to what we did to count(). We would need
to add a field to ArrayMetaState which is stored in fn_extra to remember the
last array returned. Then if array_push notices it has been called from an
aggregate context it can store its result in there. The next time it would
extend that array in place (which is code which doesn't currently exist),
possibly repallocing it and return the same pointer.

It's a bit of a hack but I think this is going to be a pretty common use case
and I don't see any more general solution.

--  Gregory Stark EnterpriseDB          http://www.enterprisedb.com Ask me about EnterpriseDB's Slony Replication
support!


pgsql-hackers by date:

Previous
From: Andrew Dunstan
Date:
Subject: Re: [BUGS] BUG #3799: csvlog skips some logs
Next
From: Alvaro Herrera
Date:
Subject: Re: [BUGS] BUG #3799: csvlog skips some logs