Re: "pivot aggregation" with a patched intarray - Mailing list pgsql-hackers
From | Marc Mamin |
---|---|
Subject | Re: "pivot aggregation" with a patched intarray |
Date | |
Msg-id | B6F6FD62F2624C4C9916AC0175D56D8828A80FC4@jenmbs01.ad.intershop.net Whole thread Raw |
In response to | "pivot aggregation" with a patched intarray (Marc Mamin <M.Mamin@intershop.de>) |
List | pgsql-hackers |
(reposted, this time with attachment. sorry) Hello, (sorry for this long post) I have patched intarray with 3 additional functions in order to count[distinct] event IDs into arrays, whereas the array position correspond to the integer values. (mimic column oriented storage) I need to use an array for the event IDs as I don't know how many of them exist, and the list may increase slowly upon thetime. The first results are quite promising. Although this approach is only usable for a narrow set of use cases (*), I wonder if I should look at other places in the source code to achieve a better implementation and if there are discussions or plan to enhance Postgres with some support for such kind of storage (vectors of int). (*) The array sizes should ideally not exceed a few tens and the number of "events" should be unknown, otherwise using one column per event would be faster) use case ======== c: category s: session e: event int4 (small range, mostly less than 20) c s e - - - 1 1 1 1 1 1 1 1 3 1 2 1 2 2 3 3 1 5 WITH SES AS ( SELECT s, c, icount_to_array(e) ecount, iarray_flag (e) edistinct FROM foo GROUP BY s, c) SELECT c, iarray_sum(ecount) ecount, iarray_sum(edistinct)edistinct FROM SES GROUP BY c c ecount edistinct - ------- --------- 1 [3,0,1] [2,0,1] 2 [0,0,1] [0,0,1] 3 [0,0,0,0,1] [0,0,0,0,1] e.g.: the event 1 was met 3 times in the category 1, in 2 distinct sessions. source code =========== The 3 aggregates use following functions: isumv: isumv([1,1,1],[0,0,1,3]) = [1,1,2,3]) iarray_addat : iarray_addat([3,3,3],2) = [3,4,3] iarray_setat : iarray_setat([0,0],2) = [0,1] iarray_setat([0,1],2) = [0,1] I've added them at the end of _int_op.c from intarray (attached) missing in code: checks for integer overflow and that the array lower bound are all 1. These are my first C lines ever and I've never learned it. Hence I'm open for critics ;-) I've started with this great posting by Craig Ringer: http://stackoverflow.com/questions/16992339/why-is-postgresql-array-access-so-much-faster-in-c-than-in-pl-pgsql?answertab=votes#tab-top The rest is mostly copy and paste from other parts of intarray. iarray_setat is just to set a "bitmap position" to 1, possibly enlarging it when required. It is a trivial modification from iarray_addat It should be possible to implement this more efficiently with bit[] or varbit, but this lies beyond my C capbilities. performances & advantage of icount_to_array =========================================== As this aggregate allows to reduce the cardinality of GROUP BYs it often performs better than a vertical approach. With the horizontal storage, the result can be stored in smaller tables with much less rows, hence bringing more advantages when it comes to indices. e.g.: select icount_to_array(user_id) from foo {256655,0,0,1075,0,1,91154,36,0 (...) Aggregate (cost=36930.44..36930.45 rows=1 width=4) (actual time=333.378..333.378 rows=1 loops=1) -> Seq Scan on foo (cost=0.00..32279.95 rows=1860195 width=4) (actual time=0.010..131.117 rows=1860179 loops=1) Total runtime: 333.420 ms select user_id, count(*) from foo group by user_id order by 1 Sort (cost=41582.75..41582.87 rows=48 width=4) (actual time=492.638..492.638 rows=79 loops=1) Sort Key: user_id Sort Method: quicksort Memory: 28kB -> HashAggregate (cost=41580.93..41581.41 rows=48 width=4) (actual time=492.606..492.615 rows=79 loops=1) -> Seq Scan on foo (cost=0.00..32279.95 rows=1860195 width=4) (actual time=0.010..128.800 rows=1860179 loops=1) Total runtime: 492.699 ms 1;256656 4;1075 7;91157 8;36 17;1455583 (...) It will swap later on --------------------- ... which result in a significant difference in some cases. create temp table ev AS SELECT * FROM ( generate_series(1,1000)s cross join generate_series(1,500)c cross join generate_series(1,15)e cross join generate_series(1,5) repeat ) explain analyze select s, c, icount_to_array(e) from ev group by s,c HashAggregate (cost=830575.54..830975.54 rows=40000 width=12) (actual time=19188.384..19379.154 rows=500000 loops=1) -> Seq Scan on ev (cost=0.00..561487.31 rows=35878431 width=12) (actual time=0.046..4849.977 rows=37500000 loops=1) Total runtime: 19399.151 ms explain analyze select s, c, e, count(*) from ev group by s,c,e GroupAggregate (cost=5589186.88..6073545.71 rows=3587844 width=12) (actual time=63290.168..89336.265 rows=7500000 loops=1) -> Sort (cost=5589186.88..5678882.96 rows=35878431 width=12) (actual time=63290.156..83981.772 rows=37500000 loops=1) Sort Key: s, c, e Sort Method: external merge Disk: 806424kB -> Seq Scan on ev (cost=0.00..561487.31 rows=35878431 width=12) (actual time=0.039..4957.481 rows=37500000 loops=1) Total runtime: 89680.844 ms Aggregates definition: ====================== CREATE AGGREGATE icount_to_array(integer) ( SFUNC=iarray_addat, STYPE=int4[], INITCOND='{0}' ); CREATE AGGREGATE iarray_flag(integer) ( SFUNC=iarray_setat, STYPE=int4[], INITCOND='{0}' ); CREATE AGGREGATE iarray_sum(int[]) ( SFUNC=isumv, STYPE=int[], INITCOND='{0}' ); regards, Marc Mamin
Attachment
pgsql-hackers by date: