Home > mailing lists

Re: [GENERAL] Performance with high correlation in group by on PK - Mailing list pgsql-general

From	Jeff Janes
Subject	Re: [GENERAL] Performance with high correlation in group by on PK
Date	September 8, 2017 01:23:09
Msg-id	CAMkU=1zUOM5GOXZKvvFbtgU53s7U5vBAwKrkJTzzvZFcbzjh-g@mail.gmail.com Whole thread Raw
In response to	Re: [GENERAL] Performance with high correlation in group by on PK (Alban Hertroys <haramrae@gmail.com>)
Responses	Re: [GENERAL] Performance with high correlation in group by on PK
List	pgsql-general

Tree view

On Tue, Aug 29, 2017 at 1:20 AM, Alban Hertroys <haramrae@gmail.com> wrote:

On 28 August 2017 at 21:32, Jeff Janes <jeff.janes@gmail.com> wrote:
> On Mon, Aug 28, 2017 at 5:22 AM, Alban Hertroys <haramrae@gmail.com> wrote:
>>
>> Hi all,
>>
>> It's been a while since I actually got to use PG for anything serious,
>> but we're finally doing some experimentation @work now to see if it is
>> suitable for our datawarehouse. So far it's been doing well, but there
>> is a particular type of query I run into that I expect we will
>> frequently use and that's choosing a sequential scan - and I can't
>> fathom why.
>>
>> This is on:
>>
>>
>> The query in question is:
>> select "VBAK_MANDT", max("VBAK_VBELN")
>> from staging.etl00001_vbak
>> group by "VBAK_MANDT";
>>
>> This is the header-table for another detail table, and in this case
>> we're already seeing a seqscan. The thing is, there are 15M rows in
>> the table (disk usage is 15GB), while the PK is on ("VBAK_MANDT",
>> "VBAK_VBELN") with very few distinct values for "VBAK_MANDT" (in fact,
>> we only have 1 at the moment!).
>
>
> You need an "index skip-scan" or "loose index scan". PostgreSQL doesn't
> currently detect and implement them automatically, but you can use a
> recursive CTE to get it to work. There are some examples at
> https://wiki.postgresql.org/wiki/Loose_indexscan

Thanks Jeff, that's an interesting approach. It looks very similar to
correlated subqueries.

Unfortunately, it doesn't seem to help with my issue. The CTE is
indeed fast, but when querying the results from the 2nd level ov the
PK with the CTE results, I'm back at a seqscan on pdw00002_vbak again.

Something like this works:

create table foo as select trunc(random()*5) as col1, random() as col2 from generate_series(1,100000000);

create index on foo (col1, col2);

vacuum analyze foo;

with recursive t as (

select * from (select col1, col2 from foo order by col1 desc, col2 desc limit 1) asdfsaf

union all

select

(select col1 from foo where foo.col1 < t.col1 order by col1 desc, col2 desc limit 1) as col1,

(select col2 from foo where foo.col1 < t.col1 order by col1 desc, col2 desc limit 1) as col2

from t where t.col1 is not null

)

select * from t where t is not null;

It is pretty ugly that you need one subquery in the select list for each column to be returned. Maybe someone can find a way to avoid that part. I tried using lateral joins to get around it, but couldn't make that work.

Cheers,

Jeff

pgsql-general by date:

From: Ron Johnson
Date: 08 September 2017, 01:10:13
Subject: Re: [GENERAL] WAL & ready files retained after turning off logshipping

From: Michael Paquier
Date: 08 September 2017, 02:04:30
Subject: Re: [GENERAL] pg_rewind issue

Re: [GENERAL] Performance with high correlation in group by on PK - Mailing list pgsql-general

Previous

Next