"pivot aggregation" with a patched intarray - Mailing list pgsql-hackers

From Marc Mamin
Subject "pivot aggregation" with a patched intarray
Date
Msg-id B6F6FD62F2624C4C9916AC0175D56D8828A80FA7@jenmbs01.ad.intershop.net
Whole thread Raw
Responses Re: "pivot aggregation" with a patched intarray
Re: "pivot aggregation" with a patched intarray
List pgsql-hackers
Hello,

(sorry for this long post)

I have patched intarray with 3 additional functions in order to count[distinct] event IDs
into arrays, whereas the array position correspond to the integer values. (mimic column oriented storage)

I need to use an array for the event IDs as I don't know how many of them exist, and the list may increase slowly upon
thetime. 

The first results are quite promising.

Although this approach is only usable for a narrow set of use cases (*),
I wonder if I should look at other places in the source code to achieve a better implementation
and if there are discussions or plan to enhance Postgres with some support for such kind of storage (vectors of int).


(*) The array sizes should ideally not exceed a few tens    and the number of "events" should be unknown, otherwise
usingone column per event would be faster) 


use case
========

c: category
s: session
e: event int4 (small range, mostly less than 20)

c   s   e
-   -   -
1   1   1
1   1   1
1   1   3
1   2   1
2   2   3
3   1   5

WITH SES AS ( SELECT s, c,         icount_to_array(e) ecount,        iarray_flag    (e) edistinct FROM foo GROUP BY s,
c)
SELECT  c,        iarray_sum(ecount) ecount,        iarray_sum(edistinct)edistinct
FROM SES GROUP BY c

c        ecount          edistinct
-        -------         ---------
1        [3,0,1]         [2,0,1]
2        [0,0,1]         [0,0,1]
3        [0,0,0,0,1]     [0,0,0,0,1]

e.g.: the event 1 was met 3 times in the category 1, in 2 distinct sessions.


source code
===========
The 3 aggregates use following functions:

isumv:          isumv([1,1,1],[0,0,1,3]) = [1,1,2,3])

iarray_addat :  iarray_addat([3,3,3],2) = [3,4,3]

iarray_setat :  iarray_setat([0,0],2) = [0,1]                iarray_setat([0,1],2) = [0,1]


I've added them at the end of _int_op.c from intarray (attached)

missing in code: checks for integer overflow and that the array lower bound are all 1.

These are my first C lines ever and I've never learned it.
Hence I'm open for critics ;-)
I've started with this great posting by Craig
Ringer:http://stackoverflow.com/questions/16992339/why-is-postgresql-array-access-so-much-faster-in-c-than-in-pl-pgsql?answertab=votes#tab-top
The rest is mostly copy and paste from other parts of intarray.

iarray_setat is just to set a "bitmap position" to 1, possibly enlarging it when required.
It is a trivial modification from iarray_addat
It should be possible to implement this more efficiently with bit[] or varbit, but this
lies beyond my C capbilities.

performances & advantage of icount_to_array
===========================================
As this aggregate allows to reduce the cardinality of GROUP BYs
it often performs better than a vertical approach.

With the horizontal storage, the result can be stored in smaller tables with much less rows,
hence bringing more advantages when it comes to indices.



e.g.:

select icount_to_array(user_id) from foo

{256655,0,0,1075,0,1,91154,36,0 (...)

Aggregate  (cost=36930.44..36930.45 rows=1 width=4) (actual time=333.378..333.378 rows=1 loops=1) ->  Seq Scan on foo
(cost=0.00..32279.95rows=1860195 width=4) (actual time=0.010..131.117 rows=1860179 loops=1) 
Total runtime: 333.420 ms


select user_id, count(*) from foo group by user_id order by 1

Sort  (cost=41582.75..41582.87 rows=48 width=4) (actual time=492.638..492.638 rows=79 loops=1) Sort Key: user_id Sort
Method:quicksort  Memory: 28kB ->  HashAggregate  (cost=41580.93..41581.41 rows=48 width=4) (actual
time=492.606..492.615rows=79 loops=1)       ->  Seq Scan on foo (cost=0.00..32279.95 rows=1860195 width=4) (actual
time=0.010..128.800rows=1860179 loops=1) 
Total runtime: 492.699 ms

1;256656
4;1075
7;91157
8;36
17;1455583
(...)

It will swap later on
---------------------
... which result in a significant difference in some cases.

create temp table ev AS SELECT * FROM (
generate_series(1,1000)s cross join
generate_series(1,500)c cross join
generate_series(1,15)e cross join
generate_series(1,5) repeat
)

explain analyze select s, c,  icount_to_array(e)  from ev group by s,c

HashAggregate  (cost=830575.54..830975.54 rows=40000 width=12) (actual time=19188.384..19379.154 rows=500000 loops=1)
-> Seq Scan on ev  (cost=0.00..561487.31 rows=35878431 width=12) (actual time=0.046..4849.977 rows=37500000 loops=1) 
Total runtime: 19399.151 ms

explain analyze select s, c, e, count(*) from ev group by s,c,e

GroupAggregate  (cost=5589186.88..6073545.71 rows=3587844 width=12) (actual time=63290.168..89336.265 rows=7500000
loops=1)->  Sort  (cost=5589186.88..5678882.96 rows=35878431 width=12) (actual time=63290.156..83981.772 rows=37500000
loops=1)      Sort Key: s, c, e       Sort Method: external merge  Disk: 806424kB       ->  Seq Scan on ev
(cost=0.00..561487.31rows=35878431 width=12) (actual time=0.039..4957.481 rows=37500000 loops=1) 
Total runtime: 89680.844 ms


Aggregates definition:
======================

  CREATE AGGREGATE icount_to_array(integer) (    SFUNC=iarray_addat,    STYPE=int4[],    INITCOND='{0}'  );
  CREATE AGGREGATE iarray_flag(integer) (    SFUNC=iarray_setat,    STYPE=int4[],    INITCOND='{0}'  );
  CREATE AGGREGATE iarray_sum(int[]) (    SFUNC=isumv,    STYPE=int[],    INITCOND='{0}'  );


regards,

Marc Mamin



pgsql-hackers by date:

Previous
From: Andres Freund
Date:
Subject: Re: Proposal for CSN based snapshots
Next
From: Teodor Sigaev
Date:
Subject: Re: SP-GiST bug.