Re: Help with tuning this query - Mailing list pgsql-performance
From | Ken Egervari |
---|---|
Subject | Re: Help with tuning this query |
Date | |
Msg-id | 000901c51f55$aa4b7400$6a01a8c0@KEN Whole thread Raw |
In response to | Help with tuning this query ("Ken Egervari" <ken@upfactor.com>) |
Responses |
Re: Help with tuning this query
|
List | pgsql-performance |
>> select s.* >> from shipment s >> inner join carrier_code cc on s.carrier_code_id = cc.id >> inner join carrier c on cc.carrier_id = c.id >> inner join carrier_to_person ctp on ctp.carrier_id = c.id >> inner join person p on p.id = ctp.person_id >> inner join shipment_status cs on s.current_status_id = cs.id >> inner join release_code rc on cs.release_code_id = rc.id >> left join shipment_status ss on ss.shipment_id = s.id >> where >> p.id = :personId and >> s.is_purged = false and >> rc.number = '9' and >> cs is not null and >> cs.date >= current_date - 31 >> order by cs.date desc >> ... >> shipment contains 40,000 rows >> shipment_status contains 80,000 rows > > I may be missing something, but it looks like the second join > on shipment_status (the left join) is not adding anything to your > results, except more work. ss is not used for output, nor in the where > clause, so what is its purpose ? It does look strange doesn't it? I would think the same thing if it were the first time I looked at it. But rest assured, it's done by design. A shipment relates to many shipment_status rows, but only 1 is the current shipment_status for the shipment. The first does queries on the current status only and doesn't analyze the rest of the related items. The second left join is for eager loading so that I don't have to run a seperate query to fetch the children for each shipment. This really does improve performance because otherwise you'll have to make N+1 queries to the database, and that's just too much overhead. Since I need all the shipment_status children along with the shipment for the domain logic to work on them, I have to load them all. On average, a shipment will have 2 shipment_status rows. So if the query selects 100 shipments, the query returns 200 rows. Hibernate is intelligent enough to map the shipment_status children to the appropriate shipment automatically. > if cs.date has an upper limit, it might be helpful to change the > condition to a BETWEEN Well, I could create an upper limit. It would be the current date. Would adding in this redundant condition improve performance? I've clustered the shipment table so that the dates are together, which has improved performance. I'm not sure adding in this implicit condition will speed up anything, but I will definately try it. > in any case, i would think you might need an index on > shipment(carrier_code_id) > shipment(current_status_id) > shipment_status(id) Unfortunately, I have indexes on all three (Postgres implicitly creates indexes for unique keys). Here are the other 2 that are already created: CREATE INDEX shipment_carrier_code_id_idx ON shipment USING btree (carrier_code_id); CREATE INDEX shipment_current_status_id_idx ON shipment USING btree (current_status_id); So I guess we've been thinking the same thing. Don't get me wrong. These indexes speed up the query from 1.6 seconds to 250 milliseconds. I just need to be around 30 milliseconds. Another idea that had occured to me was trying to force postgres to driver on the person table because that filter ratio is so great compared to everything else, but I do remember looking at the explain days ago and it was one of the last tables being filtered/joined. Is there anyway to force postgres to pick person? The reason I ask is because this would really reduce the number of rows it pulls out from the shipment table. Thanks for comments. I'll try making that date explicit and change the query to use between to see if that does anything. Regards and many thanks, Ken
pgsql-performance by date: