Re: Help with tuning this query (more musings) - Mailing list pgsql-performance
From | Ken Egervari |
---|---|
Subject | Re: Help with tuning this query (more musings) |
Date | |
Msg-id | 001b01c51fbb$39377b90$6a01a8c0@KEN Whole thread Raw |
In response to | Help with tuning this query ("Ken Egervari" <ken@upfactor.com>) |
Responses |
Re: Help with tuning this query (more musings)
|
List | pgsql-performance |
John, >Why are you now left joining driver and carrier code, but inner joining >shipment_status? I assume this is the *real* query that you are executing. Well, the old and new versions are real queries. I changed the query a bit because I noticed for some users, the listing was pulling out many different drivers. Each separate query on the driver took about 10 milliseconds. For a listing of 39 results, that's a possible 390 milliseconds assuming all the drivers are different and none of them are cached. So, I just left joined the driver and it added about 5 milliseconds of overhead to this query. I apoligize for not communicating this change, but I had to make it to speed this stuff up during the day until I could fix the root of the problem. One thing that I learned is that left joining and including lots of columns rarely slows the query. The same was done for the carrier_code, although this only saved 15 milliseconds. The end result is still high because the query we are talking about is very expensive, but at least the following queries that appeared after are eliminated altogether. The overhead and separate queries really places a hamper on overall performance. For the person 355, the overhead was about 300 milliseconds since 10 of the drivers were null. I hope this makes sense. From the earlier explain analyze, and your statements, the initial >person p should be the heavily selective portion. I totally agree. I just never really figured out how to tell postgres my intentions. >You are also double joining against carrier code, once as a left outer >join, and once in the inner join. Yes, that was my mistake since Hibernate didn't generate that - I manually put in those sub-selects. >This query doesn't seem quite right. Are you sure it is generating the >rows you are expecting? Yes, the results are the same with the left joins. I didn't include d.* and cc.* in the select, which again, is my mistake. The main problem is when I make changes to the query, I don't think about it in terms of how SQL does it. I think about Hibernate does it. Earger loading rows is different from selecting the main row at the top of the query. I bet this comes as very strange, but in Hibernate they are two-different things. I've been using Hibernate for so long that working with SQL is not so natural for me. This is my mistake and I apologize. >You might want to post the explain analyze of this query to have a point >of reference, but what about something like this: >select s.*, ss.* Okay. Here is syntax-corrected version of your very creative query. I wouldn't have thought of doing something like this at all. It makes perfect sense that you are commanding the database to do what it should be doing, which is something I really like since the concept of a planner picking stuff for me makes me unsettled (even if it is doing it right). select i.*, ss.* from shipment_status ss inner join release_code rc on ss.release_code_id=rc.id, ( select s.* from shipment s where s.current_status_id is not null and s.is_purged=false and s.carrier_code_id in ( select cc.id from person p inner join carrier_to_person ctp on p.id=ctp.person_id inner join carrier c on ctp.carrier_id=c.id inner join carrier_code cc on cc.carrier_id = c.id where p.id = 355 ) ) as i where (rc.number='9' ) and(i.current_status_id = ss.id) and(ss.date between current_date-31 and current_date); When running this on my production database, the speed is 265 milliseconds on average running it 20 times (lowest was 250, highest was 281). Not quite what we want, but I'm sure the tuning of this new query hasn't really started. Here is the EXPLAIN ANALYZE. It seems very similiar to the one postgres picked out but it's a bit shorter. Hash IN Join (cost=676.15..1943.11 rows=14 width=91) (actual time=250.000..328.000 rows=39 loops=1) Hash Cond: ("outer".carrier_code_id = "inner".id) -> Merge Join (cost=661.65..1926.51 rows=392 width=91) (actual time=250.000..328.000 rows=310 loops=1) Merge Cond: ("outer".current_status_id = "inner".id) -> Index Scan using shipment_current_status_id_idx on shipment s (cost=0.00..2702.56 rows=27257 width=66) (actual time=0.000..110.000 rows=27711 loops=1) Filter: ((current_status_id IS NOT NULL) AND (is_purged = false)) -> Sort (cost=661.65..666.46 rows=1922 width=25) (actual time=140.000..172.000 rows=6902 loops=1) Sort Key: ss.id -> Hash Join (cost=1.11..556.82 rows=1922 width=25) (actual time=0.000..94.000 rows=6902 loops=1) Hash Cond: ("outer".release_code_id = "inner".id) -> Index Scan using current_status_date_idx on shipment_status ss (cost=0.01..459.64 rows=15372 width=25) (actual time=0.000..94.000 rows=14925 loops=1) Index Cond: ((date >= (('now'::text)::date - 31)) AND (date <= ('now'::text)::date)) -> Hash (cost=1.10..1.10 rows=1 width=4) (actual time=0.000..0.000 rows=0 loops=1) -> Seq Scan on release_code rc (cost=0.00..1.10 rows=1 width=4) (actual time=0.000..0.000 rows=1 loops=1) Filter: ((number)::text = '9'::text) -> Hash (cost=14.49..14.49 rows=2 width=4) (actual time=0.000..0.000 rows=0 loops=1) -> Nested Loop (cost=6.87..14.49 rows=2 width=4) (actual time=0.000..0.000 rows=2 loops=1) -> Index Scan using person_pkey on person p (cost=0.00..5.73 rows=1 width=4) (actual time=0.000..0.000 rows=1 loops=1) Index Cond: (id = 355) -> Hash Join (cost=6.87..8.74 rows=2 width=8) (actual time=0.000..0.000 rows=2 loops=1) Hash Cond: ("outer".carrier_id = "inner".carrier_id) -> Seq Scan on carrier_code cc (cost=0.00..1.57 rows=57 width=8) (actual time=0.000..0.000 rows=57 loops=1) -> Hash (cost=6.86..6.86 rows=1 width=12) (actual time=0.000..0.000 rows=0 loops=1) -> Hash Join (cost=3.04..6.86 rows=1 width=12) (actual time=0.000..0.000 rows=1 loops=1) Hash Cond: ("outer".id = "inner".carrier_id) -> Seq Scan on carrier c (cost=0.00..3.54 rows=54 width=4) (actual time=0.000..0.000 rows=54 loops=1) -> Hash (cost=3.04..3.04 rows=1 width=8) (actual time=0.000..0.000 rows=0 loops=1) -> Index Scan using carrier_to_person_person_id_idx on carrier_to_person ctp (cost=0.00..3.04 rows=1 width=8) (actual time=0.000..0.000 rows=1 loops=1) Index Cond: (355 = person_id) Total runtime: 344.000 ms >My idea with this query is to minimize the number of shipment rows that >need to be generated before joining with the other rows. My syntax is >probably a little bit off, since I can't actually run it against real >tables. Yes, I tried adding redundant 'from clauses' with the shipment or shipment_status tables and each caluse adds 100 milliseconds. I wish they weren't so expensive. >But looking at your *original* query, you were getting 15000 rows out of >shipment_status, and then 27700 rows out of shipment, which was then >being merge-joined down to only 300 rows, and then hash-joined down to 39. >I'm just trying to think of ways to prevent it from blossoming into 27k >rows to start with. Yes, nothing has changed from the original query. By the looks of things, the sub-select version returns slightly less rows but not much unfortunately. I'm trying to figure out how to minimize the rows traversals. Maybe I should explain a bit about the app so you can get an idea on why the shipment rows are so big? You see, the app keeps track of custom status for shipments. Either the status comes in, so the shipment row is created along with 1 or more shipment_status rows, or the shipments are prepared in advance (so no shipment_status rows are assigned to them immediately). In the case of p.id = 355, there are ~27000 shipments. But most of these are prepared in advance, which don't concern this query at all and should be filtered out. That's why the "s.current_status is not null" is important. This filter will reduce the rows from 27000 to about 3500, which is all the real shipments with customs status. The others will gain rows in shipment_status over time, but new shipment rows will be created in advance as well. At some point, it will probably balance out, but since the features to prepare shipments in advance are new, only some carriers will have more shipments than shipment_status rows. In some cases, there are no prepared shipments. When this happens, there is usually a 1:2 ratio between shipment and shipment_status. I think this weird distribution makes queries like this kind of hard to predict the performance of. Anyway, I think it's better to assume that previous case where shipment rows > shipment_status will tend to be the norm over time. If the query won't perform properly, I'm wondering if the requirements should really change. For example, there is another table called release_office that is also associated with shipment. I could filter by that too. I could then offer a screen to select the release office first and only show the shipments with that release office. The will reduce the number of shipments for some users, but not all. Some users use only one or two release offices, so it wouldn't be a big help. I could also make the query select a certain day instead of a range. Like I said in a previous post, this makes the query run at 47 milliseconds. However, this might make it harder for users to access the information... and if they clicked 31 days on the calendar, that's really 47*31 milliseconds total. I guess I'd have to ask for usability studies or something to figure out what people really hope to gain from these listings in the first place and how they'd want to work with them. Maybe it's not a performance problem - maybe it's a usability problem. However, even if that were the case, I'd still want to know how to fix something like this for my own knowledge since I'm still learning. I also know others are using postgres quite successfully with tables containing millions of rows, in applications far more riskier than mine. I'm not sure why this query is any different. Is there a configuration setting I can use to make things speed up perhaps? Anyhow, thanks for taking the time helping me out John. I'm going to play with more sub-selects and see if I find a combination that works a bit better. I'll post my results in a bit. If we do figure this out, it might be worthwhile for me to make a case-study and make it available over www.postgres.org so other people can benefit from this experience too. Many thanks! Ken
pgsql-performance by date: