Thread: Strange query plan

Strange query plan

From

"Sorbara, Giorgio (CIOK)"

Date:

28 October 2011, 14:37:42

Dear all,

I am new to PG but I have a solid background on tuning in Oracle and MSSQL.

I have a query coming out from a piece of software from our SW-Stack (I can’t change it) and of course it takes a large amount of time.

The table I am query are inherited (partitioned) and the query is the following (names are changed for policy):

[SELECT]LOG: duration: 427514.480 ms execute <unnamed>: select

"f_table"."fk_column" as "c0",

"f_table"."fk_column" as "c1"

from

"s_schema"."f_table" as "f_table"

where

"f_table"."fk_column" = 'somevalue'

group by

"f_table"."fk_column"

order by

"f_table"."fk_column" ASC

the fk_column/somevalue is the “partition key” and the planner correctly purge the inherited table accordingly.

Records in partitions vary from a min of 30M to max of 160M rows.

'Group (cost=0.00..4674965.80 rows=200 width=17)'

' -> Append (cost=0.00..4360975.94 rows=125595945 width=17)'

' -> Index Scan using f_table_pkey on f_table (cost=0.00..5.64 rows=1 width=58)'

' Index Cond: ((fk_column)::text = 'somevalue'::text)'

' -> Seq Scan on f_table _scxc f_table (cost=0.00..4360970.30 rows=125595944 width=17)'

' Filter: ((fk_column)::text = 'somevalue'::text)'

disabling the seq_scan do not help it forces the index but it takes ages.

In each partition the value of fk_column is just one (being the partition key) and I am expecting that this is checked on the constraint by the planner.

Furthermore I have put an index on fk_column (tried both btree and hash) however the plan is always a seq_scan on the partition, even if the index has only one value?

Regardless the constraint (which I think it should be taken into consideration here) I am expecting that through Index Scan would easily figure out that the value.

In theory there should be no need to access the table here but perform everything on the index object (and of course in the “father” table).

Furthemore I don’t understand why on the main table is using an index scan (on 0 rows).

Yes: Analyzed.

I fear I am missing something on Index usage in Postgres.

cheers,

4 CPU (on VMWare) + 8G of RAM

seq_page_cost = 1.0 # measured on an arbitrary scale

random_page_cost = 2.5 # same scale as above

cpu_tuple_cost = 0.01 # same scale as above

cpu_index_tuple_cost = 0.005 # same scale as above

cpu_operator_cost = 0.0025 # same scale as above

default_statistics_target = 100 # range 1-10000

constraint_exclusion = partition # on, off, or partition

shared_buffers = 960MB # min 128kB

temp_buffers = 256MB # min 800kB

max_prepared_transactions = 100 # zero disables the feature

work_mem = 192MB # min 64kB

maintenance_work_mem = 480MB # min 1MB

Re: Strange query plan

From

"Tomas Vondra"

Date:

28 October 2011, 15:10:14

Hi,

On 28 Říjen 2011, 19:27, Sorbara, Giorgio (CIOK) wrote:
> Dear all,
>
> I am new to PG but I have a solid background on tuning in Oracle and
> MSSQL.
> I have a query coming out from a piece of software from our SW-Stack (I
> can't change it) and of course it takes a large amount of time.
>
> The table I am query are inherited (partitioned) and the query is the
> following (names are changed for policy):

That's a bit ridiculous policy, especially as you've used the same fake
column name (fk_column) for all columns. Does that mean you're reading
just one column, or that there are actually more columns? I'd guess the
first option, as fk_column is referenced twice in the select list ...

> [SELECT]LOG:  duration: 427514.480 ms  execute <unnamed>: select
>             "f_table"."fk_column" as "c0",
>             "f_table"."fk_column" as "c1"
>         from
>             "s_schema"."f_table" as "f_table"
>        where
>             "f_table"."fk_column" = 'somevalue'
>         group by
>             "f_table"."fk_column"
>         order by
>             "f_table"."fk_column" ASC
>
> the fk_column/somevalue is the "partition key" and the planner correctly
> purge the inherited table accordingly.
> Records in partitions vary from a min of 30M to max of 160M rows.
>
> 'Group  (cost=0.00..4674965.80 rows=200 width=17)'
> '  ->  Append  (cost=0.00..4360975.94 rows=125595945 width=17)'
> '        ->  Index Scan using f_table_pkey on f_table  (cost=0.00..5.64
> rows=1 width=58)'
> '              Index Cond: ((fk_column)::text = 'somevalue'::text)'
> '        ->  Seq Scan on f_table _scxc f_table  (cost=0.00..4360970.30
> rows=125595944 width=17)'
> '              Filter: ((fk_column)::text = 'somevalue'::text)'
>
> disabling the seq_scan do not help it forces the index but it takes ages.
>
> In each partition the value of fk_column is just one (being the partition
> key) and I am expecting that this is checked on the constraint by the
> planner.
> Furthermore I have put an index on fk_column (tried both btree and hash)
> however the plan is always a seq_scan on the partition, even if the index
> has only one value?

I'm a bit confused right now. The fk_column is used for partitioning, so
"fk_column = somevalue" actually means "give me all data from exactly one
partition, right?

In that case the above behaviour is expected, because index scan would
mean a lot of random I/O. MVCC in PostgreSQL works very differently,
compared to Oracle for example - the indexes do not contain necessary
visibility info (which transactions can see those records), so whenever
you read a tuple from index, you have to check the data in the actual
table.

So an index scan of the whole table means "read the whole index and the
whole table" and the table is accessed randomly (which kinda defeats the
db cache). So the sequential scan is the expected and perfectly sane.

BTW this should change in 9.2, as there is an index-only scan implementation.

> Regardless the constraint (which I think it should be taken into
> consideration here) I am expecting that through Index Scan would easily
> figure out that the value.
> In theory there should be no need to access the table here but perform
> everything on the index object (and of course in the "father" table).
> Furthemore I don't understand why on the main table is using an index scan
> (on 0 rows).

Not true. PostgreSQL MVCC does not work that - see explanation above.

> I fear I am missing something on Index usage in Postgres.

Yup, seems like that.

Anyway, a few recommendations / questions:

1) Don't post EXPLAIN output, post EXPLAIN ANALYZE if possible.

2) How many live tuples are there? Are the really 125.595.945 live rows,
or is the table bloated? When you do a "select count(*)" from the table,
what number you get?

3) What amount of data are we talking here? 125 million rows could be 10MB
or 10GB, hard to guess.

Tomas

Re: Strange query plan

From

"Sorbara, Giorgio (CIOK)"

Date:

31 October 2011, 10:52:40

Hi Tomas, 

and thank you for your reply.
Inline my comments

> -----Original Message-----
> From: Tomas Vondra [mailto:tv@fuzzy.cz]
> Sent: 28 October 2011 8:10 PM
> To: Sorbara, Giorgio (CIOK)
> Cc: pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] Strange query plan
> 
> Hi,
> 
> On 28 Říjen 2011, 19:27, Sorbara, Giorgio (CIOK) wrote:
> > Dear all,
> >
> > I am new to PG but I have a solid background on tuning in Oracle and
> > MSSQL.
> > I have a query coming out from a piece of software from our SW-Stack
> (I
> > can't change it) and of course it takes a large amount of time.
> >
> > The table I am query are inherited (partitioned) and the query is the
> > following (names are changed for policy):
> 
> That's a bit ridiculous policy, especially as you've used the same fake
> column name (fk_column) for all columns. Does that mean you're reading
> just one column, or that there are actually more columns? I'd guess the
> first option, as fk_column is referenced twice in the select list ...

Sorry but that is the exact query (I'll ignore the policy and post the exact columns from now on).
Just to be clear is a query generated by Mondrian (ROLAP engine) for a degenerated dimension and it looks like this:

select
"f_suipy"."fk_theme" as "c0",
"f_suipy"."fk_theme" as "c1"
from
"gaez"."f_suipy" as "f_suipy"
where
"f_suipy"."fk_theme" = 'main_py_six_scxc'
group by
"f_suipy"."fk_theme"
order by
"f_suipy"."fk_theme" ASC;

we have a total of 18 partitions.

> >
> > the fk_column/somevalue is the "partition key" and the planner
> correctly
> > purge the inherited table accordingly.
> > Records in partitions vary from a min of 30M to max of 160M rows.
> >
> > 'Group  (cost=0.00..4674965.80 rows=200 width=17)'
> > '  ->  Append  (cost=0.00..4360975.94 rows=125595945 width=17)'
> > '        ->  Index Scan using f_table_pkey on f_table
> (cost=0.00..5.64
> > rows=1 width=58)'
> > '              Index Cond: ((fk_column)::text = 'somevalue'::text)'
> > '        ->  Seq Scan on f_table _scxc f_table
> (cost=0.00..4360970.30
> > rows=125595944 width=17)'
> > '              Filter: ((fk_column)::text = 'somevalue'::text)'
> >
> > disabling the seq_scan do not help it forces the index but it takes
> ages.
> >
> > In each partition the value of fk_column is just one (being the
> partition
> > key) and I am expecting that this is checked on the constraint by the
> > planner.
> > Furthermore I have put an index on fk_column (tried both btree and
> hash)
> > however the plan is always a seq_scan on the partition, even if the
> index
> > has only one value?
> 
> I'm a bit confused right now. The fk_column is used for partitioning,
> so
> "fk_column = somevalue" actually means "give me all data from exactly
> one
> partition, right?

Yes, but there is an enforced constraint telling me that column can host only one value.

> 
> In that case the above behaviour is expected, because index scan would
> mean a lot of random I/O. MVCC in PostgreSQL works very differently,
> compared to Oracle for example - the indexes do not contain necessary
> visibility info (which transactions can see those records), so whenever
> you read a tuple from index, you have to check the data in the actual
> table.
> 
> So an index scan of the whole table means "read the whole index and the
> whole table" and the table is accessed randomly (which kinda defeats
> the
> db cache). So the sequential scan is the expected and perfectly sane.
> 
> BTW this should change in 9.2, as there is an index-only scan
> implementation.
> 
> > Regardless the constraint (which I think it should be taken into
> > consideration here) I am expecting that through Index Scan would
> easily
> > figure out that the value.
> > In theory there should be no need to access the table here but
> perform
> > everything on the index object (and of course in the "father" table).
> > Furthemore I don't understand why on the main table is using an index
> scan
> > (on 0 rows).
> 
> Not true. PostgreSQL MVCC does not work that - see explanation above.
> 
> > I fear I am missing something on Index usage in Postgres.
> 
> Yup, seems like that.

Ok... so since the index is not version aware I have to check the version in the "data" segment to be sure I am
pointingat the right value.
 
I can see now there is no point at using this partitioning scheme... it was sort of perfect to me as I could drive the
partitioneasily with a degenerated dimension. Except for this small issue (waiting more than 10 min is not an option).
 

Furthermore I am afraid that even partial indexes won't work.

> 
> Anyway, a few recommendations / questions:
> 
> 1) Don't post EXPLAIN output, post EXPLAIN ANALYZE if possible.

Group  (cost=0.00..4674965.80 rows=200 width=17) (actual time=13.375..550943.592 rows=1 loops=1)
   ->  Append  (cost=0.00..4360975.94 rows=125595945 width=17) (actual time=13.373..524324.817 rows=125595932 loops=1)
         ->  Index Scan using f_suipy_pkey on f_suipy  (cost=0.00..5.64 rows=1 width=58) (actual time=0.019..0.019
rows=0loops=1)
 
               Index Cond: ((fk_theme)::text = 'main_py_six_scxc'::text)
         ->  Seq Scan on f_suipy_main_py_six_scxc f_suipy  (cost=0.00..4360970.30 rows=125595944 width=17) (actual
time=13.352..495259.117rows=125595932 loops=1)
 
               Filter: ((fk_theme)::text = 'main_py_six_scxc'::text)
 Total runtime: 550943.699 ms
> 
> 2) How many live tuples are there? Are the really 125.595.945 live
> rows,
> or is the table bloated? When you do a "select count(*)" from the
> table,
> what number you get?

SELECT count(0)
  FROM gaez.f_suipy_main_py_six_scxc;

125595932

> 
> 3) What amount of data are we talking here? 125 million rows could be
> 10MB
> or 10GB, hard to guess.

pg_size_pretty tells me: 'f_suipy_main_py_six_scxc';'21 GB'

> 
> Tomas

g

ps: excellent benchmarking work on your site :)

Re: Strange query plan

From

Robert Haas

Date:

04 November 2011, 13:07:00

On Mon, Oct 31, 2011 at 9:52 AM, Sorbara, Giorgio (CIOK)
<Giorgio.Sorbara@fao.org> wrote:
> Group  (cost=0.00..4674965.80 rows=200 width=17) (actual time=13.375..550943.592 rows=1 loops=1)
>   ->  Append  (cost=0.00..4360975.94 rows=125595945 width=17) (actual time=13.373..524324.817 rows=125595932 loops=1)
>         ->  Index Scan using f_suipy_pkey on f_suipy  (cost=0.00..5.64 rows=1 width=58) (actual time=0.019..0.019
rows=0loops=1) 
>               Index Cond: ((fk_theme)::text = 'main_py_six_scxc'::text)
>         ->  Seq Scan on f_suipy_main_py_six_scxc f_suipy  (cost=0.00..4360970.30 rows=125595944 width=17) (actual
time=13.352..495259.117rows=125595932 loops=1) 
>               Filter: ((fk_theme)::text = 'main_py_six_scxc'::text)
>  Total runtime: 550943.699 ms

How fast do you expect this to run?  It's aggregating 125 million
rows, so that's going to take some time no matter how you slice it.
Unless I'm misreading this, it's actually taking only about 4
microseconds per row, which does not obviously suck.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company

Re: Strange query plan

From

"Sorbara, Giorgio (CIOK)"

Date:

04 November 2011, 13:15:09

> -----Original Message-----
> From: Robert Haas [mailto:robertmhaas@gmail.com]
> Sent: 04 November 2011 5:07 PM
> To: Sorbara, Giorgio (CIOK)
> Cc: Tomas Vondra; pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] Strange query plan
>
> On Mon, Oct 31, 2011 at 9:52 AM, Sorbara, Giorgio (CIOK)
> <Giorgio.Sorbara@fao.org> wrote:
> > Group  (cost=0.00..4674965.80 rows=200 width=17) (actual
> time=13.375..550943.592 rows=1 loops=1)
> >   ->  Append  (cost=0.00..4360975.94 rows=125595945 width=17) (actual
> time=13.373..524324.817 rows=125595932 loops=1)
> >         ->  Index Scan using f_suipy_pkey on f_suipy
>  (cost=0.00..5.64 rows=1 width=58) (actual time=0.019..0.019 rows=0
> loops=1)
> >               Index Cond: ((fk_theme)::text =
> 'main_py_six_scxc'::text)
> >         ->  Seq Scan on f_suipy_main_py_six_scxc f_suipy
>  (cost=0.00..4360970.30 rows=125595944 width=17) (actual
> time=13.352..495259.117 rows=125595932 loops=1)
> >               Filter: ((fk_theme)::text = 'main_py_six_scxc'::text)
> >  Total runtime: 550943.699 ms
>
> How fast do you expect this to run?  It's aggregating 125 million
> rows, so that's going to take some time no matter how you slice it.
> Unless I'm misreading this, it's actually taking only about 4
> microseconds per row, which does not obviously suck.

Well, the problem is not how fast it takes to process one row rather the best query plan I am supposed to get. I don't
meanthe planer is wrong but I was expecting a feature is not there (yet). 
We don't have pure index scan. Fair enough. so I have approached the problem in a different way: getting rid of the
degenerateddimensions and exploiting "useless" dimension table. 
It's a workaround but it actually seems to work :) now I have a ~350 millions fact table and no partition but I am
happyto get the data I want in 1 sec or less. 


>
> --
> Robert Haas
> EnterpriseDB: http://www.enterprisedb.com
> The Enterprise PostgreSQL Company

Re: Strange query plan

From

Robert Haas

Date:

04 November 2011, 15:50:06

On Fri, Nov 4, 2011 at 12:14 PM, Sorbara, Giorgio (CIOK)
<Giorgio.Sorbara@fao.org> wrote:
>> How fast do you expect this to run?  It's aggregating 125 million
>> rows, so that's going to take some time no matter how you slice it.
>> Unless I'm misreading this, it's actually taking only about 4
>> microseconds per row, which does not obviously suck.
>
> Well, the problem is not how fast it takes to process one row rather the best query plan I am supposed to get. I
don'tmean the planer is wrong but I was expecting a feature is not there (yet). 
> We don't have pure index scan. Fair enough. so I have approached the problem in a different way: getting rid of the
degenerateddimensions and exploiting "useless" dimension table. 
> It's a workaround but it actually seems to work :) now I have a ~350 millions fact table and no partition but I am
happyto get the data I want in 1 sec or less. 

Am I misreading the EXPLAIN ANALYZE output?  I'm reading that to say
that there were 125 million rows in the table that matched your filter
condition.  If that's correct, I don't think index-only scans (which
will be in 9.2) are going to help you much - it might be faster, but
it's definitely not going to be anything like instantaneous.

On the flip side, if I *am* misreading the output and the number of
rows needed to compute the aggregate is actually some very small
number, then you ought to be getting an index scan even in older
versions.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company