Re: Query planner question - Mailing list pgsql-general

From Soni M
Subject Re: Query planner question
Date
Msg-id CAAMgDXktLYem+sye3ny6fO=SDOn-7T2McecmSwUrSb0NHjb0vQ@mail.gmail.com
Whole thread Raw
In response to Re: Query planner question  (Alban Hertroys <haramrae@gmail.com>)
List pgsql-general
here's the explain analyze result : http://explain.depesz.com/s/Mvv and http://explain.depesz.com/s/xxF9

it seems that i need to dig more on query planner parameter.

BTW, thanks all for the helps.


On Sat, Aug 23, 2014 at 4:33 PM, Alban Hertroys <haramrae@gmail.com> wrote:
On 23 Aug 2014, at 4:34, Soni M <diptatapa@gmail.com> wrote:
> On Fri, Aug 22, 2014 at 9:10 PM, Alban Hertroys <haramrae@gmail.com> wrote:
> On 22 August 2014 14:26, Soni M <diptatapa@gmail.com> wrote:
> > Currently we have only latest_transmission_id as FK, described here :
> > TABLE "ticket" CONSTRAINT "fkcbe86b0c6ddac9e" FOREIGN KEY
> > (latest_transmission_id) REFERENCES transmission_base(transmission_id)
> >
> > Change the query to include only FK still result the same:
> > explain select t.ticket_id ,
> > tb.transmission_id
> > from ticket t,
> > transmission_base tb
> > where t.latest_transmission_id = tb.transmission_id
> > and tb.parse_date > ('2014-07-31');
> >                                                             QUERY PLAN
> > ----------------------------------------------------------------------------------------------------------------------------------
> >  Hash Join  (cost=113928.06..2583606.96 rows=200338 width=8)
> >    Hash Cond: (t.latest_transmission_id = tb.transmission_id)
> >    ->  Seq Scan on ticket t  (cost=0.00..1767767.26 rows=69990826 width=8)
> >    ->  Hash  (cost=108923.38..108923.38 rows=400374 width=4)
> >          ->  Index Scan using transmission_base_by_parse_date on
> > transmission_base tb  (cost=0.00..108923.38 rows=400374 width=4)
> >                Index Cond: (parse_date > '2014-07-31 00:00:00'::timestamp
> > without time zone)
> > (6 rows)
>
> Do you have an index on ticket (latest_transmission_id)?
>
> Yes, both t.latest_transmission_id and tb.transmission_id is indexed.
>
> Indexes:
>     "transmission_base_pkey" PRIMARY KEY, btree (transmission_id) CLUSTER
> Indexes:
>     "ticket_by_latest_transmission" btree (latest_transmission_id)

Okay, so we got those indexes. So much for the low-hanging fruit.

From the above plan we learn that the database estimates[1] that 400k rows from transmission match your condition (parse_date > '2014-07-31’). The ticket table has a foreign key to that table, which suggests a 1:n relationship. It also has around 70M rows, or at least the database seems to think that about that amount will match those 400k transmissions.

That means that if on average 175 (=70M/400k) ticket ID’s match a transmission ID, the database would be needing all those 70M rows anyway - and even if it only needs every 175th row, a sequential scan is not a particularly inefficient way to go about this.
The alternative is a whole lot of index lookups, probably not in the same order as either the index or the rows on disk, meaning quite a bit of random disk I/O.

I’m suspecting that the cost estimates for this query with seq-scans disabled aren’t very different, provided doing so comes up with a comparable plan.

Things you might want to verify/try:
* Are those estimated numbers of rows accurate? If not, is autovacuum (or scheduled vacuum) keeping up with the amount of data churn on these tables? Do you collect a sufficiently large sample for the statistics?
* How much bloat is in these tables/indexes?
* Did you change planner settings (such as disabling bitmap scans; I kind of expected one here) or did you change cost estimate parameters?
* Does it help to put an index on transmission (parse_date, transmission_id)?
* If none of that helps, we’re going to need the output of explain analyze - that will probably take long to create, so you might as well start with that and do the other stuff at the side.

What kind of hardware are these disks on? Is it possible that disk I/O on this particular machine is relatively slow (relative to the seq/random cost factor for disk access as specified in your postgresql.conf)?

Cheers,

Alban Hertroys

[1] You did not provide explain analyse output, so we only have estimates to work with.
--
If you can't see the forest for the trees,
cut the trees and you'll find there is no forest.




--
Regards,

Soni Maula Harriz

pgsql-general by date:

Previous
From: Marc Mamin
Date:
Subject: Way to identify the current session's temp tables within pg_class ?
Next
From: hubert depesz lubaczewski
Date:
Subject: Re: deadlock in single-row select-for-update + update scenario? How could it happen?