Thread: Query planner question

Query planner question

From
Soni M
Date:
Hi Everyone,

I have this query :

select t.ticket_id ,
tb.transmission_id
from ticket t,
transmission_base tb
where t.latest_transmission_id = tb.transmission_id
and t.ticket_number = tb.ticket_number
and tb.parse_date > ('2014-07-31');


Indexes on ticket :
    "ticket_pkey" PRIMARY KEY, btree (ticket_id) CLUSTER
    "ticket_by_latest_transmission" btree (latest_transmission_id)
    "ticket_by_ticket_number" btree (ticket_number)

This query only returns some portions of rows from ticket table.
The question is, Why does postgres need to get all the rows from ticket table in order to complete this query?
Can't postgres use indexes to get only needed rows on ticket table?

I try set seqscan to off, but still index scan try to get all rows on ticket table.
Here's the execution plan : http://explain.depesz.com/s/abH2

Thanks

--
Regards,

Soni Maula Harriz

Re: Query planner question

From
David G Johnston
Date:
Soni M wrote
> Hi Everyone,
>
> I have this query :
>
> select t.ticket_id ,
> tb.transmission_id
> from ticket t,
> transmission_base tb
> where t.latest_transmission_id = tb.transmission_id
> and t.ticket_number = tb.ticket_number
> and tb.parse_date > ('2014-07-31');
>
> Execution plan: http://explain.depesz.com/s/YAak
>
> Indexes on ticket :
>     "ticket_pkey" PRIMARY KEY, btree (ticket_id) CLUSTER
>     "ticket_by_latest_transmission" btree (latest_transmission_id)
>     "ticket_by_ticket_number" btree (ticket_number)
>
> This query only returns some portions of rows from ticket table.
> The question is, Why does postgres need to get all the rows from ticket
> table in order to complete this query?
> Can't postgres use indexes to get only needed rows on ticket table?
>
> I try set seqscan to off, but still index scan try to get all rows on
> ticket table.
> Here's the execution plan : http://explain.depesz.com/s/abH2

Short answer: you haven't defined "(latest_transmission_id, ticket_number)"
as being a foreign key onto the transmission_base table yet you seem to want
it to act like one.

Because of this failure the planner considers the following:

Nested Looping over 380,000 records is going to suck so it tries some
advanced "merge/join" techniques to try and speed things up.  In any such
alternative the entire ticket table needs to be considered since there is no
constraint provided for that table - the only constraint in on
transmission_base and it rightly is using an index to find records matching
the where clause.

Since ticket_number and latest_transmission_id are found in separate indexes
I do not believe the planner can make use of an Index Only scan to fulfill
the join so each index lookup would require a corresponding heap lookup
which means extra work compared to just sequentially scanning the heap in
the first place.  Since it is going to hit the entire thing in either case
the sequential scan is the logical choice for it to make.

Others will correct any factual mistakes I may have made - I am theorizing
here and do not understand the planner sufficient well to be 100% certain
that an FK definition will solve the problem.

David J.





--
View this message in context: http://postgresql.1045698.n5.nabble.com/Query-planner-question-tp5815659p5815661.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: Query planner question

From
Soni M
Date:



On Thu, Aug 21, 2014 at 9:26 AM, David G Johnston <david.g.johnston@gmail.com> wrote:
Soni M wrote
> Hi Everyone,
>
> I have this query :
>
> select t.ticket_id ,
> tb.transmission_id
> from ticket t,
> transmission_base tb
> where t.latest_transmission_id = tb.transmission_id
> and t.ticket_number = tb.ticket_number
> and tb.parse_date > ('2014-07-31');
>
> Execution plan: http://explain.depesz.com/s/YAak
>
> Indexes on ticket :
>     "ticket_pkey" PRIMARY KEY, btree (ticket_id) CLUSTER
>     "ticket_by_latest_transmission" btree (latest_transmission_id)
>     "ticket_by_ticket_number" btree (ticket_number)
>
> This query only returns some portions of rows from ticket table.
> The question is, Why does postgres need to get all the rows from ticket
> table in order to complete this query?
> Can't postgres use indexes to get only needed rows on ticket table?
>
> I try set seqscan to off, but still index scan try to get all rows on
> ticket table.
> Here's the execution plan : http://explain.depesz.com/s/abH2

Short answer: you haven't defined "(latest_transmission_id, ticket_number)"
as being a foreign key onto the transmission_base table yet you seem to want
it to act like one.

Currently we have only latest_transmission_id as FK, described here :
TABLE "ticket" CONSTRAINT "fkcbe86b0c6ddac9e" FOREIGN KEY (latest_transmission_id) REFERENCES transmission_base(transmission_id)

Change the query to include only FK still result the same:
explain select t.ticket_id ,
tb.transmission_id
from ticket t,
transmission_base tb
where t.latest_transmission_id = tb.transmission_id
and tb.parse_date > ('2014-07-31');
                                                            QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
 Hash Join  (cost=113928.06..2583606.96 rows=200338 width=8)
   Hash Cond: (t.latest_transmission_id = tb.transmission_id)
   ->  Seq Scan on ticket t  (cost=0.00..1767767.26 rows=69990826 width=8)
   ->  Hash  (cost=108923.38..108923.38 rows=400374 width=4)
         ->  Index Scan using transmission_base_by_parse_date on transmission_base tb  (cost=0.00..108923.38 rows=400374 width=4)
               Index Cond: (parse_date > '2014-07-31 00:00:00'::timestamp without time zone)
(6 rows)

I've googling this one, it seems that's how hash join works. For hash join operation, the join predicate cannot be used for the index scan, only independent predicate can be used in index scan. http://use-the-index-luke.com/sql/join/hash-join-partial-objects

 
Because of this failure the planner considers the following:

Nested Looping over 380,000 records is going to suck so it tries some
advanced "merge/join" techniques to try and speed things up.  In any such
alternative the entire ticket table needs to be considered since there is no
constraint provided for that table - the only constraint in on
transmission_base and it rightly is using an index to find records matching
the where clause.

Since ticket_number and latest_transmission_id are found in separate indexes
I do not believe the planner can make use of an Index Only scan to fulfill
the join so each index lookup would require a corresponding heap lookup
which means extra work compared to just sequentially scanning the heap in
the first place.  Since it is going to hit the entire thing in either case
the sequential scan is the logical choice for it to make.

Others will correct any factual mistakes I may have made - I am theorizing
here and do not understand the planner sufficient well to be 100% certain
that an FK definition will solve the problem.

David J.





--
View this message in context: http://postgresql.1045698.n5.nabble.com/Query-planner-question-tp5815659p5815661.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


--
Sent via pgsql-general mailing list (pgsql-general@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-general



--
Regards,

Soni Maula Harriz

Re: Query planner question

From
Alban Hertroys
Date:
On 22 August 2014 14:26, Soni M <diptatapa@gmail.com> wrote:
> Currently we have only latest_transmission_id as FK, described here :
> TABLE "ticket" CONSTRAINT "fkcbe86b0c6ddac9e" FOREIGN KEY
> (latest_transmission_id) REFERENCES transmission_base(transmission_id)
>
> Change the query to include only FK still result the same:
> explain select t.ticket_id ,
> tb.transmission_id
> from ticket t,
> transmission_base tb
> where t.latest_transmission_id = tb.transmission_id
> and tb.parse_date > ('2014-07-31');
>                                                             QUERY PLAN
>
----------------------------------------------------------------------------------------------------------------------------------
>  Hash Join  (cost=113928.06..2583606.96 rows=200338 width=8)
>    Hash Cond: (t.latest_transmission_id = tb.transmission_id)
>    ->  Seq Scan on ticket t  (cost=0.00..1767767.26 rows=69990826 width=8)
>    ->  Hash  (cost=108923.38..108923.38 rows=400374 width=4)
>          ->  Index Scan using transmission_base_by_parse_date on
> transmission_base tb  (cost=0.00..108923.38 rows=400374 width=4)
>                Index Cond: (parse_date > '2014-07-31 00:00:00'::timestamp
> without time zone)
> (6 rows)

Do you have an index on ticket (latest_transmission_id)?

--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.


Re: Query planner question

From
Soni M
Date:



On Fri, Aug 22, 2014 at 9:10 PM, Alban Hertroys <haramrae@gmail.com> wrote:
On 22 August 2014 14:26, Soni M <diptatapa@gmail.com> wrote:
> Currently we have only latest_transmission_id as FK, described here :
> TABLE "ticket" CONSTRAINT "fkcbe86b0c6ddac9e" FOREIGN KEY
> (latest_transmission_id) REFERENCES transmission_base(transmission_id)
>
> Change the query to include only FK still result the same:
> explain select t.ticket_id ,
> tb.transmission_id
> from ticket t,
> transmission_base tb
> where t.latest_transmission_id = tb.transmission_id
> and tb.parse_date > ('2014-07-31');
>                                                             QUERY PLAN
> ----------------------------------------------------------------------------------------------------------------------------------
>  Hash Join  (cost=113928.06..2583606.96 rows=200338 width=8)
>    Hash Cond: (t.latest_transmission_id = tb.transmission_id)
>    ->  Seq Scan on ticket t  (cost=0.00..1767767.26 rows=69990826 width=8)
>    ->  Hash  (cost=108923.38..108923.38 rows=400374 width=4)
>          ->  Index Scan using transmission_base_by_parse_date on
> transmission_base tb  (cost=0.00..108923.38 rows=400374 width=4)
>                Index Cond: (parse_date > '2014-07-31 00:00:00'::timestamp
> without time zone)
> (6 rows)

Do you have an index on ticket (latest_transmission_id)?

Yes, both t.latest_transmission_id and tb.transmission_id is indexed.

Indexes:
    "transmission_base_pkey" PRIMARY KEY, btree (transmission_id) CLUSTER
Indexes:
    "ticket_by_latest_transmission" btree (latest_transmission_id)

 
--
If you can't see the forest for the trees,
Cut the trees and you'll see there is no forest.



--
Regards,

Soni Maula Harriz

Re: Query planner question

From
David G Johnston
Date:
Soni M wrote
> On Fri, Aug 22, 2014 at 9:10 PM, Alban Hertroys <

> haramrae@

> > wrote:
>
>> On 22 August 2014 14:26, Soni M <

> diptatapa@

> > wrote:
>> > Currently we have only latest_transmission_id as FK, described here :
>> > TABLE "ticket" CONSTRAINT "fkcbe86b0c6ddac9e" FOREIGN KEY
>> > (latest_transmission_id) REFERENCES transmission_base(transmission_id)
>> >
>> > Change the query to include only FK still result the same:
>> > explain select t.ticket_id ,
>> > tb.transmission_id
>> > from ticket t,
>> > transmission_base tb
>> > where t.latest_transmission_id = tb.transmission_id
>> > and tb.parse_date > ('2014-07-31');
>> >                                                             QUERY PLAN
>> >
>>
----------------------------------------------------------------------------------------------------------------------------------
>> >  Hash Join  (cost=113928.06..2583606.96 rows=200338 width=8)
>> >    Hash Cond: (t.latest_transmission_id = tb.transmission_id)
>> >    ->  Seq Scan on ticket t  (cost=0.00..1767767.26 rows=69990826
>> width=8)
>> >    ->  Hash  (cost=108923.38..108923.38 rows=400374 width=4)
>> >          ->  Index Scan using transmission_base_by_parse_date on
>> > transmission_base tb  (cost=0.00..108923.38 rows=400374 width=4)
>> >                Index Cond: (parse_date > '2014-07-31
>> 00:00:00'::timestamp
>> > without time zone)
>> > (6 rows)
>>
>> Do you have an index on ticket (latest_transmission_id)?
>>
>> Yes, both t.latest_transmission_id and tb.transmission_id is indexed.
>
> Indexes:
>     "transmission_base_pkey" PRIMARY KEY, btree (transmission_id) CLUSTER
> Indexes:
>     "ticket_by_latest_transmission" btree (latest_transmission_id)

Can you provide EXPLAIN ANALYZE for all three queries?

David J.




--
View this message in context: http://postgresql.1045698.n5.nabble.com/Query-planner-question-tp5815659p5815981.html
Sent from the PostgreSQL - general mailing list archive at Nabble.com.


Re: Query planner question

From
Jeff Janes
Date:
On Wednesday, August 20, 2014, Soni M <diptatapa@gmail.com> wrote:
Hi Everyone,

I have this query :

select t.ticket_id ,
tb.transmission_id
from ticket t,
transmission_base tb
where t.latest_transmission_id = tb.transmission_id
and t.ticket_number = tb.ticket_number
and tb.parse_date > ('2014-07-31');


Indexes on ticket :
    "ticket_pkey" PRIMARY KEY, btree (ticket_id) CLUSTER
    "ticket_by_latest_transmission" btree (latest_transmission_id)
    "ticket_by_ticket_number" btree (ticket_number)

This query only returns some portions of rows from ticket table.
The question is, Why does postgres need to get all the rows from ticket table in order to complete this query?
Can't postgres use indexes to get only needed rows on ticket table?


It can, but having separate indexes on latest_transmission_id and ticket_number is not going to work.

You need a joint index on both columns.
 
 Cheers,

Jeff

Re: Query planner question

From
Alban Hertroys
Date:
On 23 Aug 2014, at 4:34, Soni M <diptatapa@gmail.com> wrote:
> On Fri, Aug 22, 2014 at 9:10 PM, Alban Hertroys <haramrae@gmail.com> wrote:
> On 22 August 2014 14:26, Soni M <diptatapa@gmail.com> wrote:
> > Currently we have only latest_transmission_id as FK, described here :
> > TABLE "ticket" CONSTRAINT "fkcbe86b0c6ddac9e" FOREIGN KEY
> > (latest_transmission_id) REFERENCES transmission_base(transmission_id)
> >
> > Change the query to include only FK still result the same:
> > explain select t.ticket_id ,
> > tb.transmission_id
> > from ticket t,
> > transmission_base tb
> > where t.latest_transmission_id = tb.transmission_id
> > and tb.parse_date > ('2014-07-31');
> >                                                             QUERY PLAN
> >
----------------------------------------------------------------------------------------------------------------------------------
> >  Hash Join  (cost=113928.06..2583606.96 rows=200338 width=8)
> >    Hash Cond: (t.latest_transmission_id = tb.transmission_id)
> >    ->  Seq Scan on ticket t  (cost=0.00..1767767.26 rows=69990826 width=8)
> >    ->  Hash  (cost=108923.38..108923.38 rows=400374 width=4)
> >          ->  Index Scan using transmission_base_by_parse_date on
> > transmission_base tb  (cost=0.00..108923.38 rows=400374 width=4)
> >                Index Cond: (parse_date > '2014-07-31 00:00:00'::timestamp
> > without time zone)
> > (6 rows)
>
> Do you have an index on ticket (latest_transmission_id)?
>
> Yes, both t.latest_transmission_id and tb.transmission_id is indexed.
>
> Indexes:
>     "transmission_base_pkey" PRIMARY KEY, btree (transmission_id) CLUSTER
> Indexes:
>     "ticket_by_latest_transmission" btree (latest_transmission_id)

Okay, so we got those indexes. So much for the low-hanging fruit.

From the above plan we learn that the database estimates[1] that 400k rows from transmission match your condition
(parse_date> '2014-07-31’). The ticket table has a foreign key to that table, which suggests a 1:n relationship. It
alsohas around 70M rows, or at least the database seems to think that about that amount will match those 400k
transmissions.

That means that if on average 175 (=70M/400k) ticket ID’s match a transmission ID, the database would be needing all
those70M rows anyway - and even if it only needs every 175th row, a sequential scan is not a particularly inefficient
wayto go about this. 
The alternative is a whole lot of index lookups, probably not in the same order as either the index or the rows on
disk,meaning quite a bit of random disk I/O. 

I’m suspecting that the cost estimates for this query with seq-scans disabled aren’t very different, provided doing so
comesup with a comparable plan. 

Things you might want to verify/try:
* Are those estimated numbers of rows accurate? If not, is autovacuum (or scheduled vacuum) keeping up with the amount
ofdata churn on these tables? Do you collect a sufficiently large sample for the statistics? 
* How much bloat is in these tables/indexes?
* Did you change planner settings (such as disabling bitmap scans; I kind of expected one here) or did you change cost
estimateparameters? 
* Does it help to put an index on transmission (parse_date, transmission_id)?
* If none of that helps, we’re going to need the output of explain analyze - that will probably take long to create, so
youmight as well start with that and do the other stuff at the side. 

What kind of hardware are these disks on? Is it possible that disk I/O on this particular machine is relatively slow
(relativeto the seq/random cost factor for disk access as specified in your postgresql.conf)? 

Cheers,

Alban Hertroys

[1] You did not provide explain analyse output, so we only have estimates to work with.
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.



Re: Query planner question

From
Soni M
Date:
here's the explain analyze result : http://explain.depesz.com/s/Mvv and http://explain.depesz.com/s/xxF9

it seems that i need to dig more on query planner parameter.

BTW, thanks all for the helps.


On Sat, Aug 23, 2014 at 4:33 PM, Alban Hertroys <haramrae@gmail.com> wrote:
On 23 Aug 2014, at 4:34, Soni M <diptatapa@gmail.com> wrote:
> On Fri, Aug 22, 2014 at 9:10 PM, Alban Hertroys <haramrae@gmail.com> wrote:
> On 22 August 2014 14:26, Soni M <diptatapa@gmail.com> wrote:
> > Currently we have only latest_transmission_id as FK, described here :
> > TABLE "ticket" CONSTRAINT "fkcbe86b0c6ddac9e" FOREIGN KEY
> > (latest_transmission_id) REFERENCES transmission_base(transmission_id)
> >
> > Change the query to include only FK still result the same:
> > explain select t.ticket_id ,
> > tb.transmission_id
> > from ticket t,
> > transmission_base tb
> > where t.latest_transmission_id = tb.transmission_id
> > and tb.parse_date > ('2014-07-31');
> >                                                             QUERY PLAN
> > ----------------------------------------------------------------------------------------------------------------------------------
> >  Hash Join  (cost=113928.06..2583606.96 rows=200338 width=8)
> >    Hash Cond: (t.latest_transmission_id = tb.transmission_id)
> >    ->  Seq Scan on ticket t  (cost=0.00..1767767.26 rows=69990826 width=8)
> >    ->  Hash  (cost=108923.38..108923.38 rows=400374 width=4)
> >          ->  Index Scan using transmission_base_by_parse_date on
> > transmission_base tb  (cost=0.00..108923.38 rows=400374 width=4)
> >                Index Cond: (parse_date > '2014-07-31 00:00:00'::timestamp
> > without time zone)
> > (6 rows)
>
> Do you have an index on ticket (latest_transmission_id)?
>
> Yes, both t.latest_transmission_id and tb.transmission_id is indexed.
>
> Indexes:
>     "transmission_base_pkey" PRIMARY KEY, btree (transmission_id) CLUSTER
> Indexes:
>     "ticket_by_latest_transmission" btree (latest_transmission_id)

Okay, so we got those indexes. So much for the low-hanging fruit.

From the above plan we learn that the database estimates[1] that 400k rows from transmission match your condition (parse_date > '2014-07-31’). The ticket table has a foreign key to that table, which suggests a 1:n relationship. It also has around 70M rows, or at least the database seems to think that about that amount will match those 400k transmissions.

That means that if on average 175 (=70M/400k) ticket ID’s match a transmission ID, the database would be needing all those 70M rows anyway - and even if it only needs every 175th row, a sequential scan is not a particularly inefficient way to go about this.
The alternative is a whole lot of index lookups, probably not in the same order as either the index or the rows on disk, meaning quite a bit of random disk I/O.

I’m suspecting that the cost estimates for this query with seq-scans disabled aren’t very different, provided doing so comes up with a comparable plan.

Things you might want to verify/try:
* Are those estimated numbers of rows accurate? If not, is autovacuum (or scheduled vacuum) keeping up with the amount of data churn on these tables? Do you collect a sufficiently large sample for the statistics?
* How much bloat is in these tables/indexes?
* Did you change planner settings (such as disabling bitmap scans; I kind of expected one here) or did you change cost estimate parameters?
* Does it help to put an index on transmission (parse_date, transmission_id)?
* If none of that helps, we’re going to need the output of explain analyze - that will probably take long to create, so you might as well start with that and do the other stuff at the side.

What kind of hardware are these disks on? Is it possible that disk I/O on this particular machine is relatively slow (relative to the seq/random cost factor for disk access as specified in your postgresql.conf)?

Cheers,

Alban Hertroys

[1] You did not provide explain analyse output, so we only have estimates to work with.
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.




--
Regards,

Soni Maula Harriz

Re: Query planner question

From
Jeff Janes
Date:
On Wed, Aug 20, 2014 at 6:16 PM, Soni M <diptatapa@gmail.com> wrote:
Hi Everyone,

I have this query :

select t.ticket_id ,
tb.transmission_id
from ticket t,
transmission_base tb
where t.latest_transmission_id = tb.transmission_id
and t.ticket_number = tb.ticket_number
and tb.parse_date > ('2014-07-31');


Indexes on ticket :
    "ticket_pkey" PRIMARY KEY, btree (ticket_id) CLUSTER
    "ticket_by_latest_transmission" btree (latest_transmission_id)
    "ticket_by_ticket_number" btree (ticket_number)

This query only returns some portions of rows from ticket table.
The question is, Why does postgres need to get all the rows from ticket table in order to complete this query?
Can't postgres use indexes to get only needed rows on ticket table?

I try set seqscan to off, but still index scan try to get all rows on ticket table.
Here's the execution plan : http://explain.depesz.com/s/abH2

If you want to force a nested loop, you probably need to disable the mergejoin as well, and maybe the hashjoin.  Forcing the planner to do things the way you want can be difficult.

Cheers,

Jeff

Re: Query planner question

From
Bill Moran
Date:
On Mon, 25 Aug 2014 09:09:07 -0700
Jeff Janes <jeff.janes@gmail.com> wrote:

> On Wed, Aug 20, 2014 at 6:16 PM, Soni M <diptatapa@gmail.com> wrote:
>
> > Hi Everyone,
> >
> > I have this query :
> >
> > select t.ticket_id ,
> > tb.transmission_id
> > from ticket t,
> > transmission_base tb
> > where t.latest_transmission_id = tb.transmission_id
> > and t.ticket_number = tb.ticket_number
> > and tb.parse_date > ('2014-07-31');
> >
> > Execution plan: http://explain.depesz.com/s/YAak
> >
> > Indexes on ticket :
> >     "ticket_pkey" PRIMARY KEY, btree (ticket_id) CLUSTER
> >     "ticket_by_latest_transmission" btree (latest_transmission_id)
> >     "ticket_by_ticket_number" btree (ticket_number)
> >
> > This query only returns some portions of rows from ticket table.
> > The question is, Why does postgres need to get all the rows from ticket
> > table in order to complete this query?
> > Can't postgres use indexes to get only needed rows on ticket table?
> >
> > I try set seqscan to off, but still index scan try to get all rows on
> > ticket table.
> > Here's the execution plan : http://explain.depesz.com/s/abH2

That's probably not the best approach, it's likely that something is feeding
the planner wrong information.  An EXPLAIN ANALYZE might reveal if that's the
case.

Some other things to check: are these two tables being analyzed frequently
enough that their statistics are up to date? (EXPLAIN ANALYZE will generally
show if that's a problem too).  It would seem that the planner thinks that
the distribution of tb.ticket_number is large enough that it will probably
have to fetch most of the rows from ticket anyway, which is a logical reason
for it to skip the index and just do a seq scan.  Can you confirm/deny whether
that's the case?  If not, and you're analyzing the tables often enough, you
may need to raise your statistics target on those tables.

--
Bill Moran
I need your help to succeed:
http://gamesbybill.com