Thread: Index only scan

Index only scan

From

Lars Helge Øverland

Date:

10 October 2012, 15:04:00

Hi all,

first of all thanks for the great new "index only scan" feature in
9.2. We have managed to adapt our app (dhis2.org) to take advantage of
it and it really speeds up several queries significantly.

We are now in the process of designing a new component for analytics
and this feature got me thinking we could utilize postgres over other
alternatives like column-oriented databases. Basically we will have a
wide, denormalized table with 20+ columns with relatively low
cardinality. Typically we will have queries which sums a fact column
based on where/group by clauses on several dimension columns (standard
data warehouse stuff). An example would be "select a, b, c, sum(d)
from analytics where a=1 and b=2 group by a,b,c";

Now my initial idea was to simply put a single index on all of those
columns, in the hope that "index only scans" would kick in. It seems
this is not the case, as strictly one (single or multi-column) index
is required to enable index only scans for a query.

My question is: Would it be feasible and/or possible to implement
index only scans in a way that it could take advantage of several,
single-column indexes? For example, a query spanning columns a, b, c
could take advantage of 3 single-column indexes put on columns a, b,
c.

Finally, is there anyone else who are using postgres for this purpose
and have some good tips to share in order to achieve good performance,
including index strategies, beyond the standard config best practices?


best regards,

Lars Helge Øverland

Re: Index only scan

From

Gavin Flower

Date:

11 October 2012, 02:26:12

On 11/10/12 01:03, Lars Helge Øverland wrote:

Hi all,

first of all thanks for the great new "index only scan" feature in
9.2. We have managed to adapt our app (dhis2.org) to take advantage of
it and it really speeds up several queries significantly.

We are now in the process of designing a new component for analytics
and this feature got me thinking we could utilize postgres over other
alternatives like column-oriented databases. Basically we will have a
wide, denormalized table with 20+ columns with relatively low
cardinality. Typically we will have queries which sums a fact column
based on where/group by clauses on several dimension columns (standard
data warehouse stuff). An example would be "select a, b, c, sum(d)
from analytics where a=1 and b=2 group by a,b,c";

Now my initial idea was to simply put a single index on all of those
columns, in the hope that "index only scans" would kick in. It seems
this is not the case, as strictly one (single or multi-column) index
is required to enable index only scans for a query.

My question is: Would it be feasible and/or possible to implement
index only scans in a way that it could take advantage of several,
single-column indexes? For example, a query spanning columns a, b, c
could take advantage of 3 single-column indexes put on columns a, b,
c.

Finally, is there anyone else who are using postgres for this purpose
and have some good tips to share in order to achieve good performance,
including index strategies, beyond the standard config best practices?


best regards,

Lars Helge Øverland

Index only scans do use multiple indexes of single fields where appropriate. Here the planner determined it only needed to scan 2 of the 3 relevant single field indexes.

Cheers,
Gavin

-- index_only_scan_001.sqlDROP TABLE IF EXISTS iostab;CREATE TABLE iostab( id int PRIMARY KEY, a int, b int, c int, d int, z text);INSERT INTO iostab (id, a, b, c, d, z) VALUES (generate_series(1, 1000000), 1000 * random(), 1000 * random(), 1000 * random(), 1000 * random(), 'qqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqqq' || random()); CREATE INDEX ON iostab (a);CREATE INDEX ON iostab (b);CREATE INDEX ON iostab (c);CREATE INDEX ON iostab (d); ANALYZE VERBOSE iostab;EXPLAINSELECT i.*FROM iostab iWHERE i.a = 2 AND i.b = 7 AND i.c = 4/**/;/**///////////////DROP TABLEpsql:index_only_scan_001.sql:14: NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "iostab_pkey" for table "iostab"CREATE TABLEINSERT 0 1000000CREATE INDEXCREATE INDEXCREATE INDEXCREATE INDEXpsql:index_only_scan_001.sql:30: INFO: analyzing "public.iostab"psql:index_only_scan_001.sql:30: INFO: "iostab": scanned 15385 of 15385 pages, containing 1000000 live rows and 0 dead rows; 30000 rows in sample, 1000000 estimated total rowsANALYZE QUERY PLAN ------------------------------------------------------------------------------------ Bitmap Heap Scan on iostab i (cost=41.37..45.39 rows=1 width=90) Recheck Cond: ((b = 7) AND (a = 2)) Filter: (c = 4) -> BitmapAnd (cost=41.37..41.37 rows=1 width=0) -> Bitmap Index Scan on iostab_b_idx (cost=0.00..20.55 rows=960 width=0) Index Cond: (b = 7) -> Bitmap Index Scan on iostab_a_idx (cost=0.00..20.57 rows=963 width=0) Index Cond: (a = 2)(8 rows)gavin=>

Re: Index only scan

From

Ondrej Ivanič

Date:

11 October 2012, 02:36:23

Hi,

On 10 October 2012 23:03, Lars Helge Øverland <larshelge@gmail.com> wrote:
> We are now in the process of designing a new component for analytics
> and this feature got me thinking we could utilize postgres over other
> alternatives like column-oriented databases. Basically we will have a
> wide, denormalized table with 20+ columns with relatively low
> cardinality. Typically we will have queries which sums a fact column
> based on where/group by clauses on several dimension columns (standard
> data warehouse stuff). An example would be "select a, b, c, sum(d)
> from analytics where a=1 and b=2 group by a,b,c";
>
> Finally, is there anyone else who are using postgres for this purpose
> and have some good tips to share in order to achieve good performance,
> including index strategies, beyond the standard config best practices?

yes, we had fact table which has around 250 columns and 250mil rows.
The question is if you can partition your data set. For example,
monthly partition. This keeps indexes small but all queries must be
constrained by the same column as is used for partitioning (ie.
monthly partitions -> every query should have "datetime between ...
and ...")

From my experience postgres is not good with large group by queries.
For example, your query:
select a, b, c, sum(d) from analytics where a=1 and b=2 group by a,b,c

could be executed over multiple connections:
insert into t select select a, b, c, sum(d) as d from analytics where
c >= val1 and c < val2 and a=1 and b=2 group by a,b,c
insert into t select select a, b, c, sum(d) as d from analytics where
c >= val2 and c < val3 and a=1 and b=2 group by a,b,c
...
insert into t select select a, b, c, sum(d) as d from analytics where
c >= valN-1 and c < valN and a=1 and b=2 group by a,b,c

and then get the final result:
select a, b, c, sum(d) from t group by a,b,c

You can use pgpool-II parallel query feature instead of manual slicing.

--
Ondrej Ivanic
(ondrej.ivanic@gmail.com)
(http://www.linkedin.com/in/ondrejivanic)

Re: Index only scan

From

Tom Lane

Date:

11 October 2012, 02:41:54

Gavin Flower <GavinFlower@archidevsys.co.nz> writes:
> On 11/10/12 01:03, Lars Helge �verland wrote:
>> My question is: Would it be feasible and/or possible to implement
>> index only scans in a way that it could take advantage of several,
>> single-column indexes? For example, a query spanning columns a, b, c
>> could take advantage of 3 single-column indexes put on columns a, b,
>> c.

> Index only scans do use multiple indexes of single fields where
> appropriate.  Here the planner determined it only needed to scan 2 of
> the 3 relevant single field indexes.

But your example isn't an index-only scan ... it's a plain old bitmap
scan, and so it does touch the heap.

The difficulty with what Lars proposes is that there's no way to scan
the different indexes "in sync" --- each one will be ordered according
to its own column order.  In principle I guess we could read out the
index data and do a join using the ctid's, but it's far from clear that
such a thing would be worth the trouble.

            regards, tom lane

Re: Index only scan

From

Gavin Flower

Date:

11 October 2012, 03:47:36

On 11/10/12 12:41, Tom Lane wrote:
> Gavin Flower <GavinFlower@archidevsys.co.nz> writes:
>> On 11/10/12 01:03, Lars Helge Øverland wrote:
>>> My question is: Would it be feasible and/or possible to implement
>>> index only scans in a way that it could take advantage of several,
>>> single-column indexes? For example, a query spanning columns a, b, c
>>> could take advantage of 3 single-column indexes put on columns a, b,
>>> c.
>> Index only scans do use multiple indexes of single fields where
>> appropriate.  Here the planner determined it only needed to scan 2 of
>> the 3 relevant single field indexes.
> But your example isn't an index-only scan ... it's a plain old bitmap
> scan, and so it does touch the heap.
>
> The difficulty with what Lars proposes is that there's no way to scan
> the different indexes "in sync" --- each one will be ordered according
> to its own column order.  In principle I guess we could read out the
> index data and do a join using the ctid's, but it's far from clear that
> such a thing would be worth the trouble.
>
>             regards, tom lane
Thanks for the correction!

Cheers,
Gavin

Re: Index only scan

From

Lars Helge Øverland

Date:

11 October 2012, 13:52:36

Thanks all for the help and insights. I will continue to read up on
the details of partitioning and pgpool.

best regards,

Lars



On Thu, Oct 11, 2012 at 2:47 AM, Gavin Flower
<GavinFlower@archidevsys.co.nz> wrote:
> On 11/10/12 12:41, Tom Lane wrote:
>>
>> Gavin Flower <GavinFlower@archidevsys.co.nz> writes:
>>>
>>> On 11/10/12 01:03, Lars Helge Øverland wrote:
>>>>
>>>> My question is: Would it be feasible and/or possible to implement
>>>> index only scans in a way that it could take advantage of several,
>>>> single-column indexes? For example, a query spanning columns a, b, c
>>>> could take advantage of 3 single-column indexes put on columns a, b,
>>>> c.
>>>
>>> Index only scans do use multiple indexes of single fields where
>>> appropriate.  Here the planner determined it only needed to scan 2 of
>>> the 3 relevant single field indexes.
>>
>> But your example isn't an index-only scan ... it's a plain old bitmap
>> scan, and so it does touch the heap.
>>
>> The difficulty with what Lars proposes is that there's no way to scan
>> the different indexes "in sync" --- each one will be ordered according
>> to its own column order.  In principle I guess we could read out the
>> index data and do a join using the ctid's, but it's far from clear that
>> such a thing would be worth the trouble.
>>
>>                         regards, tom lane
>
> Thanks for the correction!
>
> Cheers,
> Gavin