Re: "pivot aggregation" with a patched intarray - Mailing list pgsql-hackers

From Marc Mamin
Subject Re: "pivot aggregation" with a patched intarray
Date
Msg-id B6F6FD62F2624C4C9916AC0175D56D8828A80FC4@jenmbs01.ad.intershop.net
Whole thread Raw
In response to "pivot aggregation" with a patched intarray  (Marc Mamin <M.Mamin@intershop.de>)
List pgsql-hackers
(reposted, this time with attachment. sorry)


Hello,


(sorry for this long post)

I have patched intarray with 3 additional functions in order to count[distinct] event IDs
into arrays, whereas the array position correspond to the integer values. (mimic column oriented storage)

I need to use an array for the event IDs as I don't know how many of them exist, and the list may increase slowly upon
thetime. 

The first results are quite promising.

Although this approach is only usable for a narrow set of use cases (*),
I wonder if I should look at other places in the source code to achieve a better implementation
and if there are discussions or plan to enhance Postgres with some support for such kind of storage (vectors of int).


(*) The array sizes should ideally not exceed a few tens
    and the number of "events" should be unknown, otherwise using one column per event would be faster)


use case
========

c: category
s: session
e: event int4 (small range, mostly less than 20)

c   s   e
-   -   -
1   1   1
1   1   1
1   1   3
1   2   1
2   2   3
3   1   5

WITH SES AS (
  SELECT s, c,
         icount_to_array(e) ecount,
         iarray_flag    (e) edistinct
  FROM foo
  GROUP BY s, c)
SELECT  c,
        iarray_sum(ecount) ecount,
        iarray_sum(edistinct)edistinct
FROM SES GROUP BY c


c        ecount          edistinct
-        -------         ---------
1        [3,0,1]         [2,0,1]
2        [0,0,1]         [0,0,1]
3        [0,0,0,0,1]     [0,0,0,0,1]

e.g.: the event 1 was met 3 times in the category 1, in 2 distinct sessions.


source code
===========
The 3 aggregates use following functions:

isumv:          isumv([1,1,1],[0,0,1,3]) = [1,1,2,3])

iarray_addat :  iarray_addat([3,3,3],2) = [3,4,3]

iarray_setat :  iarray_setat([0,0],2) = [0,1]
                iarray_setat([0,1],2) = [0,1]


I've added them at the end of _int_op.c from intarray (attached)

missing in code: checks for integer overflow and that the array lower bound are all 1.

These are my first C lines ever and I've never learned it.
Hence I'm open for critics ;-)
I've started with this great posting by Craig Ringer:

http://stackoverflow.com/questions/16992339/why-is-postgresql-array-access-so-much-faster-in-c-than-in-pl-pgsql?answertab=votes#tab-top
The rest is mostly copy and paste from other parts of intarray.


iarray_setat is just to set a "bitmap position" to 1, possibly enlarging it when required.
It is a trivial modification from iarray_addat
It should be possible to implement this more efficiently with bit[] or varbit, but this
lies beyond my C capbilities.

performances & advantage of icount_to_array
===========================================
As this aggregate allows to reduce the cardinality of GROUP BYs
it often performs better than a vertical approach.

With the horizontal storage, the result can be stored in smaller tables with much less rows,
hence bringing more advantages when it comes to indices.



e.g.:

select icount_to_array(user_id) from foo

{256655,0,0,1075,0,1,91154,36,0 (...)

Aggregate  (cost=36930.44..36930.45 rows=1 width=4) (actual time=333.378..333.378 rows=1 loops=1)
  ->  Seq Scan on foo (cost=0.00..32279.95 rows=1860195 width=4) (actual time=0.010..131.117 rows=1860179 loops=1)
Total runtime: 333.420 ms


select user_id, count(*) from foo group by user_id order by 1

Sort  (cost=41582.75..41582.87 rows=48 width=4) (actual time=492.638..492.638 rows=79 loops=1)
  Sort Key: user_id
  Sort Method: quicksort  Memory: 28kB
  ->  HashAggregate  (cost=41580.93..41581.41 rows=48 width=4) (actual time=492.606..492.615 rows=79 loops=1)
        ->  Seq Scan on foo (cost=0.00..32279.95 rows=1860195 width=4) (actual time=0.010..128.800 rows=1860179
loops=1)
Total runtime: 492.699 ms

1;256656
4;1075
7;91157
8;36
17;1455583
(...)

It will swap later on
---------------------
... which result in a significant difference in some cases.

create temp table ev AS SELECT * FROM (
generate_series(1,1000)s cross join
generate_series(1,500)c cross join
generate_series(1,15)e cross join
generate_series(1,5) repeat
)

explain analyze select s, c,  icount_to_array(e)  from ev group by s,c

HashAggregate  (cost=830575.54..830975.54 rows=40000 width=12) (actual time=19188.384..19379.154 rows=500000 loops=1)
  ->  Seq Scan on ev  (cost=0.00..561487.31 rows=35878431 width=12) (actual time=0.046..4849.977 rows=37500000 loops=1)
Total runtime: 19399.151 ms

explain analyze select s, c, e, count(*) from ev group by s,c,e

GroupAggregate  (cost=5589186.88..6073545.71 rows=3587844 width=12) (actual time=63290.168..89336.265 rows=7500000
loops=1)
  ->  Sort  (cost=5589186.88..5678882.96 rows=35878431 width=12) (actual time=63290.156..83981.772 rows=37500000
loops=1)
        Sort Key: s, c, e
        Sort Method: external merge  Disk: 806424kB
        ->  Seq Scan on ev  (cost=0.00..561487.31 rows=35878431 width=12) (actual time=0.039..4957.481 rows=37500000
loops=1)
Total runtime: 89680.844 ms


Aggregates definition:
======================


   CREATE AGGREGATE icount_to_array(integer) (
     SFUNC=iarray_addat,
     STYPE=int4[],
     INITCOND='{0}'
   );

   CREATE AGGREGATE iarray_flag(integer) (
     SFUNC=iarray_setat,
     STYPE=int4[],
     INITCOND='{0}'
   );

   CREATE AGGREGATE iarray_sum(int[]) (
     SFUNC=isumv,
     STYPE=int[],
     INITCOND='{0}'
   );




regards,

Marc Mamin

Attachment

pgsql-hackers by date:

Previous
From: Teodor Sigaev
Date:
Subject: Re: SP-GiST bug.
Next
From: Andrew Dunstan
Date:
Subject: Re: jsonb access operators inefficiency