Thread: "pivot aggregation" with a patched intarray

"pivot aggregation" with a patched intarray

From
Marc Mamin
Date:
Hello,

(sorry for this long post)

I have patched intarray with 3 additional functions in order to count[distinct] event IDs
into arrays, whereas the array position correspond to the integer values. (mimic column oriented storage)

I need to use an array for the event IDs as I don't know how many of them exist, and the list may increase slowly upon
thetime. 

The first results are quite promising.

Although this approach is only usable for a narrow set of use cases (*),
I wonder if I should look at other places in the source code to achieve a better implementation
and if there are discussions or plan to enhance Postgres with some support for such kind of storage (vectors of int).


(*) The array sizes should ideally not exceed a few tens    and the number of "events" should be unknown, otherwise
usingone column per event would be faster) 


use case
========

c: category
s: session
e: event int4 (small range, mostly less than 20)

c   s   e
-   -   -
1   1   1
1   1   1
1   1   3
1   2   1
2   2   3
3   1   5

WITH SES AS ( SELECT s, c,         icount_to_array(e) ecount,        iarray_flag    (e) edistinct FROM foo GROUP BY s,
c)
SELECT  c,        iarray_sum(ecount) ecount,        iarray_sum(edistinct)edistinct
FROM SES GROUP BY c

c        ecount          edistinct
-        -------         ---------
1        [3,0,1]         [2,0,1]
2        [0,0,1]         [0,0,1]
3        [0,0,0,0,1]     [0,0,0,0,1]

e.g.: the event 1 was met 3 times in the category 1, in 2 distinct sessions.


source code
===========
The 3 aggregates use following functions:

isumv:          isumv([1,1,1],[0,0,1,3]) = [1,1,2,3])

iarray_addat :  iarray_addat([3,3,3],2) = [3,4,3]

iarray_setat :  iarray_setat([0,0],2) = [0,1]                iarray_setat([0,1],2) = [0,1]


I've added them at the end of _int_op.c from intarray (attached)

missing in code: checks for integer overflow and that the array lower bound are all 1.

These are my first C lines ever and I've never learned it.
Hence I'm open for critics ;-)
I've started with this great posting by Craig
Ringer:http://stackoverflow.com/questions/16992339/why-is-postgresql-array-access-so-much-faster-in-c-than-in-pl-pgsql?answertab=votes#tab-top
The rest is mostly copy and paste from other parts of intarray.

iarray_setat is just to set a "bitmap position" to 1, possibly enlarging it when required.
It is a trivial modification from iarray_addat
It should be possible to implement this more efficiently with bit[] or varbit, but this
lies beyond my C capbilities.

performances & advantage of icount_to_array
===========================================
As this aggregate allows to reduce the cardinality of GROUP BYs
it often performs better than a vertical approach.

With the horizontal storage, the result can be stored in smaller tables with much less rows,
hence bringing more advantages when it comes to indices.



e.g.:

select icount_to_array(user_id) from foo

{256655,0,0,1075,0,1,91154,36,0 (...)

Aggregate  (cost=36930.44..36930.45 rows=1 width=4) (actual time=333.378..333.378 rows=1 loops=1) ->  Seq Scan on foo
(cost=0.00..32279.95rows=1860195 width=4) (actual time=0.010..131.117 rows=1860179 loops=1) 
Total runtime: 333.420 ms


select user_id, count(*) from foo group by user_id order by 1

Sort  (cost=41582.75..41582.87 rows=48 width=4) (actual time=492.638..492.638 rows=79 loops=1) Sort Key: user_id Sort
Method:quicksort  Memory: 28kB ->  HashAggregate  (cost=41580.93..41581.41 rows=48 width=4) (actual
time=492.606..492.615rows=79 loops=1)       ->  Seq Scan on foo (cost=0.00..32279.95 rows=1860195 width=4) (actual
time=0.010..128.800rows=1860179 loops=1) 
Total runtime: 492.699 ms

1;256656
4;1075
7;91157
8;36
17;1455583
(...)

It will swap later on
---------------------
... which result in a significant difference in some cases.

create temp table ev AS SELECT * FROM (
generate_series(1,1000)s cross join
generate_series(1,500)c cross join
generate_series(1,15)e cross join
generate_series(1,5) repeat
)

explain analyze select s, c,  icount_to_array(e)  from ev group by s,c

HashAggregate  (cost=830575.54..830975.54 rows=40000 width=12) (actual time=19188.384..19379.154 rows=500000 loops=1)
-> Seq Scan on ev  (cost=0.00..561487.31 rows=35878431 width=12) (actual time=0.046..4849.977 rows=37500000 loops=1) 
Total runtime: 19399.151 ms

explain analyze select s, c, e, count(*) from ev group by s,c,e

GroupAggregate  (cost=5589186.88..6073545.71 rows=3587844 width=12) (actual time=63290.168..89336.265 rows=7500000
loops=1)->  Sort  (cost=5589186.88..5678882.96 rows=35878431 width=12) (actual time=63290.156..83981.772 rows=37500000
loops=1)      Sort Key: s, c, e       Sort Method: external merge  Disk: 806424kB       ->  Seq Scan on ev
(cost=0.00..561487.31rows=35878431 width=12) (actual time=0.039..4957.481 rows=37500000 loops=1) 
Total runtime: 89680.844 ms


Aggregates definition:
======================

  CREATE AGGREGATE icount_to_array(integer) (    SFUNC=iarray_addat,    STYPE=int4[],    INITCOND='{0}'  );
  CREATE AGGREGATE iarray_flag(integer) (    SFUNC=iarray_setat,    STYPE=int4[],    INITCOND='{0}'  );
  CREATE AGGREGATE iarray_sum(int[]) (    SFUNC=isumv,    STYPE=int[],    INITCOND='{0}'  );


regards,

Marc Mamin



Re: "pivot aggregation" with a patched intarray

From
Marc Mamin
Date:
(reposted, this time with attachment. sorry)


Hello,


(sorry for this long post)

I have patched intarray with 3 additional functions in order to count[distinct] event IDs
into arrays, whereas the array position correspond to the integer values. (mimic column oriented storage)

I need to use an array for the event IDs as I don't know how many of them exist, and the list may increase slowly upon
thetime. 

The first results are quite promising.

Although this approach is only usable for a narrow set of use cases (*),
I wonder if I should look at other places in the source code to achieve a better implementation
and if there are discussions or plan to enhance Postgres with some support for such kind of storage (vectors of int).


(*) The array sizes should ideally not exceed a few tens
    and the number of "events" should be unknown, otherwise using one column per event would be faster)


use case
========

c: category
s: session
e: event int4 (small range, mostly less than 20)

c   s   e
-   -   -
1   1   1
1   1   1
1   1   3
1   2   1
2   2   3
3   1   5

WITH SES AS (
  SELECT s, c,
         icount_to_array(e) ecount,
         iarray_flag    (e) edistinct
  FROM foo
  GROUP BY s, c)
SELECT  c,
        iarray_sum(ecount) ecount,
        iarray_sum(edistinct)edistinct
FROM SES GROUP BY c


c        ecount          edistinct
-        -------         ---------
1        [3,0,1]         [2,0,1]
2        [0,0,1]         [0,0,1]
3        [0,0,0,0,1]     [0,0,0,0,1]

e.g.: the event 1 was met 3 times in the category 1, in 2 distinct sessions.


source code
===========
The 3 aggregates use following functions:

isumv:          isumv([1,1,1],[0,0,1,3]) = [1,1,2,3])

iarray_addat :  iarray_addat([3,3,3],2) = [3,4,3]

iarray_setat :  iarray_setat([0,0],2) = [0,1]
                iarray_setat([0,1],2) = [0,1]


I've added them at the end of _int_op.c from intarray (attached)

missing in code: checks for integer overflow and that the array lower bound are all 1.

These are my first C lines ever and I've never learned it.
Hence I'm open for critics ;-)
I've started with this great posting by Craig Ringer:

http://stackoverflow.com/questions/16992339/why-is-postgresql-array-access-so-much-faster-in-c-than-in-pl-pgsql?answertab=votes#tab-top
The rest is mostly copy and paste from other parts of intarray.


iarray_setat is just to set a "bitmap position" to 1, possibly enlarging it when required.
It is a trivial modification from iarray_addat
It should be possible to implement this more efficiently with bit[] or varbit, but this
lies beyond my C capbilities.

performances & advantage of icount_to_array
===========================================
As this aggregate allows to reduce the cardinality of GROUP BYs
it often performs better than a vertical approach.

With the horizontal storage, the result can be stored in smaller tables with much less rows,
hence bringing more advantages when it comes to indices.



e.g.:

select icount_to_array(user_id) from foo

{256655,0,0,1075,0,1,91154,36,0 (...)

Aggregate  (cost=36930.44..36930.45 rows=1 width=4) (actual time=333.378..333.378 rows=1 loops=1)
  ->  Seq Scan on foo (cost=0.00..32279.95 rows=1860195 width=4) (actual time=0.010..131.117 rows=1860179 loops=1)
Total runtime: 333.420 ms


select user_id, count(*) from foo group by user_id order by 1

Sort  (cost=41582.75..41582.87 rows=48 width=4) (actual time=492.638..492.638 rows=79 loops=1)
  Sort Key: user_id
  Sort Method: quicksort  Memory: 28kB
  ->  HashAggregate  (cost=41580.93..41581.41 rows=48 width=4) (actual time=492.606..492.615 rows=79 loops=1)
        ->  Seq Scan on foo (cost=0.00..32279.95 rows=1860195 width=4) (actual time=0.010..128.800 rows=1860179
loops=1)
Total runtime: 492.699 ms

1;256656
4;1075
7;91157
8;36
17;1455583
(...)

It will swap later on
---------------------
... which result in a significant difference in some cases.

create temp table ev AS SELECT * FROM (
generate_series(1,1000)s cross join
generate_series(1,500)c cross join
generate_series(1,15)e cross join
generate_series(1,5) repeat
)

explain analyze select s, c,  icount_to_array(e)  from ev group by s,c

HashAggregate  (cost=830575.54..830975.54 rows=40000 width=12) (actual time=19188.384..19379.154 rows=500000 loops=1)
  ->  Seq Scan on ev  (cost=0.00..561487.31 rows=35878431 width=12) (actual time=0.046..4849.977 rows=37500000 loops=1)
Total runtime: 19399.151 ms

explain analyze select s, c, e, count(*) from ev group by s,c,e

GroupAggregate  (cost=5589186.88..6073545.71 rows=3587844 width=12) (actual time=63290.168..89336.265 rows=7500000
loops=1)
  ->  Sort  (cost=5589186.88..5678882.96 rows=35878431 width=12) (actual time=63290.156..83981.772 rows=37500000
loops=1)
        Sort Key: s, c, e
        Sort Method: external merge  Disk: 806424kB
        ->  Seq Scan on ev  (cost=0.00..561487.31 rows=35878431 width=12) (actual time=0.039..4957.481 rows=37500000
loops=1)
Total runtime: 89680.844 ms


Aggregates definition:
======================


   CREATE AGGREGATE icount_to_array(integer) (
     SFUNC=iarray_addat,
     STYPE=int4[],
     INITCOND='{0}'
   );

   CREATE AGGREGATE iarray_flag(integer) (
     SFUNC=iarray_setat,
     STYPE=int4[],
     INITCOND='{0}'
   );

   CREATE AGGREGATE iarray_sum(int[]) (
     SFUNC=isumv,
     STYPE=int[],
     INITCOND='{0}'
   );




regards,

Marc Mamin

Attachment

Re: "pivot aggregation" with a patched intarray

From
Michael Paquier
Date:
On Sat, May 31, 2014 at 12:31 AM, Marc Mamin <M.Mamin@intershop.de> wrote:
> I have patched intarray with 3 additional functions in order to count[distinct] event IDs
> into arrays, whereas the array position correspond to the integer values. (mimic column oriented storage)

I didn't look at the feature itself, but here are some comments about
the format of the patch:
- Be careful the newlines on the file you posted use ¥r¥n, which is
purely Windows stuff... This will generate unnecessary diffs with the
source code
- Here are some guidelines about the code convention:
http://www.postgresql.org/docs/devel/static/source.html
- And here are a couple of rules to respect when submitting a patch:
https://wiki.postgresql.org/wiki/Submitting_a_Patch
Following those rules will make patch review as well as the life of
reviewers easier.
Regards,
--
Michael



Re: "pivot aggregation" with a patched intarray

From
Marc Mamin
Date:
>On Sat, May 31, 2014 at 12:31 AM, Marc Mamin <M.Mamin@intershop.de> wrote:
>> I have patched intarray with 3 additional functions in order to count[distinct] event IDs
>> into arrays, whereas the array position correspond to the integer values. (mimic column oriented storage)
>
>I didn't look at the feature itself, but here are some comments about
>the format of the patch:
>- Be careful the newlines on the file you posted use ¥r¥n, which is
>purely Windows stuff... This will generate unnecessary diffs with the
>source code

oops, will fix

>- Here are some guidelines about the code convention:
>http://www.postgresql.org/docs/devel/static/source.html
>- And here are a couple of rules to respect when submitting a patch:
>https://wiki.postgresql.org/wiki/Submitting_a_Patch
>Following those rules will make patch review as well as the life of
>reviewers easier.

thanks for your comments
I don't mean to suggests this directly as a patch,
I'm first interested to see if there are some interest
for such an aggregation type.

regards,

Marc Mamin


Re: "pivot aggregation" with a patched intarray

From
Ali Akbar
Date:
2014-06-01 20:48 GMT+07:00 Marc Mamin <M.Mamin@intershop.de>:
>
> >On Sat, May 31, 2014 at 12:31 AM, Marc Mamin <M.Mamin@intershop.de> wrote:
> >> I have patched intarray with 3 additional functions in order to count[distinct] event IDs
> >> into arrays, whereas the array position correspond to the integer values. (mimic column oriented storage)
> >
> >I didn't look at the feature itself, but here are some comments about
> >the format of the patch:
> >- Be careful the newlines on the file you posted use ¥r¥n, which is
> >purely Windows stuff... This will generate unnecessary diffs with the
> >source code
> I don't mean to suggests this directly as a patch,
> I'm first interested to see if there are some interest
> for such an aggregation type.

From what i see, the icount_to_array is complementary to standard
count() aggregates, but it produces array. If the values are not
sparse, i think the performance and memory/storage benefit you
mentioned will be true. But if the values are sparse, there will be
many 0's, how it will perform?

I'm interested to benchmark it with some use cases, to confirm the
performance benefits of it.

--
Ali Akbar



Re: "pivot aggregation" with a patched intarray

From
Marc Mamin
Date:
> -----Original Message-----
> From: Ali Akbar [mailto:the.apaan@gmail.com]
> Sent: Donnerstag, 5. Juni 2014 01:12
> To: Marc Mamin
> Cc: Michael Paquier; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] "pivot aggregation" with a patched intarray
> 
> 2014-06-01 20:48 GMT+07:00 Marc Mamin <M.Mamin@intershop.de>:
> >
> > >On Sat, May 31, 2014 at 12:31 AM, Marc Mamin <M.Mamin@intershop.de>
> wrote:
> > >> I have patched intarray with 3 additional functions in order to
> > >> count[distinct] event IDs into arrays, whereas the array position
> > >> correspond to the integer values. (mimic column oriented storage)
> > >
> > >I didn't look at the feature itself, but here are some comments
> about
> > >the format of the patch:
> > >- Be careful the newlines on the file you posted use ¥r¥n, which is
> > >purely Windows stuff... This will generate unnecessary diffs with
> the
> > >source code
> > I don't mean to suggests this directly as a patch, I'm first
> > interested to see if there are some interest for such an aggregation
> > type.
> 
> From what i see, the icount_to_array is complementary to standard
> count() aggregates, but it produces array. If the values are not
> sparse, i think the performance and memory/storage benefit you
> mentioned will be true. But if the values are sparse, there will be
> many 0's, how it will perform?

I'm thinking about adding a final function to my aggregate that would replace zero values will nulls, 
hence transforming the intarray into a standard int[], possibly with nullbitmap and a lowerbound that can be > 1.
This will probably degrade the performance considerably, but may reduce the size of the end result for spare data and
nottoo small integers...
 


> 
> I'm interested to benchmark it with some use cases, to confirm the
> performance benefits of it.


Performances should greatly depend on the data distribution and order as they influence the number of palloc.
My first tests shown as well better and poorer results.

My target is not to get better performances at the first place, but to get a pivot structure in an early aggregation
stage.


regards,

Marc Mamin


Attachment

Re: "pivot aggregation" with a patched intarray

From
Ali Akbar
Date:
2014-06-05 17:18 GMT+07:00 Marc Mamin <M.Mamin@intershop.de>:
> I'm thinking about adding a final function to my aggregate that would replace zero values will nulls,
> hence transforming the intarray into a standard int[], possibly with nullbitmap and a lowerbound that can be > 1.
> This will probably degrade the performance considerably, but may reduce the size of the end result for spare data and
nottoo small integers...
 
> Performances should greatly depend on the data distribution and order as they influence the number of palloc.
> My first tests shown as well better and poorer results.
>
> My target is not to get better performances at the first place, but to get a pivot structure in an early aggregation
stage.

Usually for pivot, i use crosstab function from tablefunc
(http://www.postgresql.org/docs/9.4/static/tablefunc.html#AEN158550).
If your patch doesn't perform better, it's more easier to just use
crosstab. For storing it efficiently, the result can be transformed
into array manually.

PS: as Michael Paquier said above, its better if you could send the
patch in the .patch file format (see:
https://wiki.postgresql.org/wiki/Working_with_GIT).

-- 
Ali Akbar



Re: "pivot aggregation" with a patched intarray

From
Marc Mamin
Date:
> From: Ali Akbar [mailto:the.apaan@gmail.com]
> Sent: Freitag, 6. Juni 2014 03:44
> Subject: Re: [HACKERS] "pivot aggregation" with a patched intarray
> 
> 2014-06-05 17:18 GMT+07:00 Marc Mamin <M.Mamin@intershop.de>:
> > I'm thinking about adding a final function to my aggregate that would
> > replace zero values will nulls, hence transforming the intarray into
> a standard int[], possibly with nullbitmap and a lowerbound that can be
> > 1.
> > This will probably degrade the performance considerably, but may
> reduce the size of the end result for spare data and not too small
> integers...
> > Performances should greatly depend on the data distribution and order
> as they influence the number of palloc.
> > My first tests shown as well better and poorer results.
> >
> > My target is not to get better performances at the first place, but
> to get a pivot structure in an early aggregation stage.
> 
> Usually for pivot, i use crosstab function from tablefunc
> (http://www.postgresql.org/docs/9.4/static/tablefunc.html#AEN158550).
> If your patch doesn't perform better, it's more easier to just use
> crosstab. For storing it efficiently, the result can be transformed
> into array manually.


Hello,

crosstab is too restrictive for my use case: 
it supports only one row_name column and you have to know the number of returned columns ("categories").
What I'm doing is to combine a count aggregate with a pivot. This is not the same as what crosstab offers.

> 
> PS: as Michael Paquier said above, its better if you could send the
> patch in the .patch file format (see:
> https://wiki.postgresql.org/wiki/Working_with_GIT).

I'm first waiting for some positive feedback on the idea itself before to eventually submit this officially.

best regards, 

Marc Mamin