Re: Strange query plan - Mailing list pgsql-performance

From Tomas Vondra
Subject Re: Strange query plan
Date
Msg-id 31fb57c820a9a0f997fc8ed456f5f109.squirrel@sq.gransy.com
Whole thread Raw
In response to Strange query plan  ("Sorbara, Giorgio (CIOK)" <Giorgio.Sorbara@fao.org>)
Responses Re: Strange query plan  ("Sorbara, Giorgio (CIOK)" <Giorgio.Sorbara@fao.org>)
List pgsql-performance
Hi,

On 28 Říjen 2011, 19:27, Sorbara, Giorgio (CIOK) wrote:
> Dear all,
>
> I am new to PG but I have a solid background on tuning in Oracle and
> MSSQL.
> I have a query coming out from a piece of software from our SW-Stack (I
> can't change it) and of course it takes a large amount of time.
>
> The table I am query are inherited (partitioned) and the query is the
> following (names are changed for policy):

That's a bit ridiculous policy, especially as you've used the same fake
column name (fk_column) for all columns. Does that mean you're reading
just one column, or that there are actually more columns? I'd guess the
first option, as fk_column is referenced twice in the select list ...

> [SELECT]LOG:  duration: 427514.480 ms  execute <unnamed>: select
>             "f_table"."fk_column" as "c0",
>             "f_table"."fk_column" as "c1"
>         from
>             "s_schema"."f_table" as "f_table"
>        where
>             "f_table"."fk_column" = 'somevalue'
>         group by
>             "f_table"."fk_column"
>         order by
>             "f_table"."fk_column" ASC
>
> the fk_column/somevalue is the "partition key" and the planner correctly
> purge the inherited table accordingly.
> Records in partitions vary from a min of 30M to max of 160M rows.
>
> 'Group  (cost=0.00..4674965.80 rows=200 width=17)'
> '  ->  Append  (cost=0.00..4360975.94 rows=125595945 width=17)'
> '        ->  Index Scan using f_table_pkey on f_table  (cost=0.00..5.64
> rows=1 width=58)'
> '              Index Cond: ((fk_column)::text = 'somevalue'::text)'
> '        ->  Seq Scan on f_table _scxc f_table  (cost=0.00..4360970.30
> rows=125595944 width=17)'
> '              Filter: ((fk_column)::text = 'somevalue'::text)'
>
> disabling the seq_scan do not help it forces the index but it takes ages.
>
> In each partition the value of fk_column is just one (being the partition
> key) and I am expecting that this is checked on the constraint by the
> planner.
> Furthermore I have put an index on fk_column (tried both btree and hash)
> however the plan is always a seq_scan on the partition, even if the index
> has only one value?

I'm a bit confused right now. The fk_column is used for partitioning, so
"fk_column = somevalue" actually means "give me all data from exactly one
partition, right?

In that case the above behaviour is expected, because index scan would
mean a lot of random I/O. MVCC in PostgreSQL works very differently,
compared to Oracle for example - the indexes do not contain necessary
visibility info (which transactions can see those records), so whenever
you read a tuple from index, you have to check the data in the actual
table.

So an index scan of the whole table means "read the whole index and the
whole table" and the table is accessed randomly (which kinda defeats the
db cache). So the sequential scan is the expected and perfectly sane.

BTW this should change in 9.2, as there is an index-only scan implementation.

> Regardless the constraint (which I think it should be taken into
> consideration here) I am expecting that through Index Scan would easily
> figure out that the value.
> In theory there should be no need to access the table here but perform
> everything on the index object (and of course in the "father" table).
> Furthemore I don't understand why on the main table is using an index scan
> (on 0 rows).

Not true. PostgreSQL MVCC does not work that - see explanation above.

> I fear I am missing something on Index usage in Postgres.

Yup, seems like that.

Anyway, a few recommendations / questions:

1) Don't post EXPLAIN output, post EXPLAIN ANALYZE if possible.

2) How many live tuples are there? Are the really 125.595.945 live rows,
or is the table bloated? When you do a "select count(*)" from the table,
what number you get?

3) What amount of data are we talking here? 125 million rows could be 10MB
or 10GB, hard to guess.

Tomas


pgsql-performance by date:

Previous
From: "Sorbara, Giorgio (CIOK)"
Date:
Subject: Strange query plan
Next
From: "Tomas Vondra"
Date:
Subject: Re: WAL in RAM