Re: Index only scan - Mailing list pgsql-general

From Ondrej Ivanič
Subject Re: Index only scan
Date
Msg-id CAM6mieKhkwhgkzsO2aqe51v_HOHGvB89g9BEsE_fypOBLhSm=w@mail.gmail.com
Whole thread Raw
In response to Index only scan  (Lars Helge Øverland <larshelge@gmail.com>)
List pgsql-general
Hi,

On 10 October 2012 23:03, Lars Helge Øverland <larshelge@gmail.com> wrote:
> We are now in the process of designing a new component for analytics
> and this feature got me thinking we could utilize postgres over other
> alternatives like column-oriented databases. Basically we will have a
> wide, denormalized table with 20+ columns with relatively low
> cardinality. Typically we will have queries which sums a fact column
> based on where/group by clauses on several dimension columns (standard
> data warehouse stuff). An example would be "select a, b, c, sum(d)
> from analytics where a=1 and b=2 group by a,b,c";
>
> Finally, is there anyone else who are using postgres for this purpose
> and have some good tips to share in order to achieve good performance,
> including index strategies, beyond the standard config best practices?

yes, we had fact table which has around 250 columns and 250mil rows.
The question is if you can partition your data set. For example,
monthly partition. This keeps indexes small but all queries must be
constrained by the same column as is used for partitioning (ie.
monthly partitions -> every query should have "datetime between ...
and ...")

From my experience postgres is not good with large group by queries.
For example, your query:
select a, b, c, sum(d) from analytics where a=1 and b=2 group by a,b,c

could be executed over multiple connections:
insert into t select select a, b, c, sum(d) as d from analytics where
c >= val1 and c < val2 and a=1 and b=2 group by a,b,c
insert into t select select a, b, c, sum(d) as d from analytics where
c >= val2 and c < val3 and a=1 and b=2 group by a,b,c
...
insert into t select select a, b, c, sum(d) as d from analytics where
c >= valN-1 and c < valN and a=1 and b=2 group by a,b,c

and then get the final result:
select a, b, c, sum(d) from t group by a,b,c

You can use pgpool-II parallel query feature instead of manual slicing.

--
Ondrej Ivanic
(ondrej.ivanic@gmail.com)
(http://www.linkedin.com/in/ondrejivanic)


pgsql-general by date:

Previous
From: Gavin Flower
Date:
Subject: Re: Index only scan
Next
From: Greg Sabino Mullane
Date:
Subject: Re: Planner chooses multi-column index in 9.2 when maybe it should not