Re: Performance problem with semi-large tables - Mailing list pgsql-performance
From | PFC |
---|---|
Subject | Re: Performance problem with semi-large tables |
Date | |
Msg-id | opslexchx2th1vuj@musicbox Whole thread Raw |
In response to | Re: Performance problem with semi-large tables ("Ken Egervari" <ken@upfactor.com>) |
Responses |
Re: Performance problem with semi-large tables
|
List | pgsql-performance |
>> SELECT * FROM shipment_status WHERE release_code_id = constant ORDER BY >> release_code_id DESC, date DESC LIMIT 100; > > I have done this in other queries where sorting by both release code and > date were important. You are right, it is very fast and I do have this > index in play. However, most of the time I retreive shipment's when > their shipment_status all have the same release_code, which makes > sorting kind of moot :/ I guess that answers your comment below. Ah, well in this case, ORDER BY release_code_id DESC seems of course useless because you only have one order_code_id, but it is in fact necessary to make the planner realize it can use the index on (release_code_id,date) for the ordering. If you just ORDER BY date, the planner will not use your index. > Thanks again for your response. I'll try and clarify some metrics that > I took a few days to figure out what would be the best join order. > > By running some count queries on the production database, I noticed > there were only 8 rows in release_code. The filtered column is unique, Let's forget the shipments table for now. So you mean there is an unique, one-to-one relation between release_code_id and filtered_column ? The planner is not able to derermine this ahead of time ; and in your case, it's important that it be unique to be able to use the index to retrieve quickly the rows in (date DESC) order. So if you'll join only to ONE release_code_id, you can do this : (SELECT * FROM shipment_status WHERE release_code_id = (SELECT r.id FROM release_code WHERE r.filtered_column = '5' LIMIT 1) ORDER BY release_code_id DESC, date DESC LIMIT 100) Which is no longer a join and will get your shipment_status_id's very quickly. > so that means the filter ratio is 0.125. However, the data distribution > is not normal. When the filtered column is the constant '5', Postgres > will join to 54% of the shipment_status rows. Since shipment_status has > 32,000+ rows, this join is not a very good one to make. Sure ! > The shipment table has 17k rows, but also due to the distribution of > data, almost every shipment will join to a shipment_status with a > release_code of '5'. For your information, this column indicates that a > shipment has been "released", as most shipments will move to this state > eventually. The actual join ratio from shipment_status to shipment is > about 98.5% of the rows in the shipment table, which is still basically > 17k rows. > > I was simply curious how to make something like this faster. You see, > it's the table size and the bad filters are really destroying this query > example. I would never make a query to the database like this in > practice, but I have similar queries that I do make that aren't much > better (and can't be due to business requirements). > > For example, let's add another filter to get all the shipments with > release code '5' that are 7 days old or newer. > > ss.date >= current_date - 7 It's the order by + limit which makes the query behaves badly, and which forces use of kludges to use the index. If you add another condition like that, it should be a breeze. > By analyzing the production data, this where clause has a filter ratio > of 0.08, which is far better than the release_code filter both in ratio > and in the number of rows that it can avoid joining. However, if I had > this filter into the original query, Postgres will not act on it first - > and I think it really should before it even touches release_code. Well I think too. What with the subqueries I wrote with the LIMIT inside the subquery ? Any better ? Normally the planner is able to deconstruct subqueries and change the order as it sees fit, but if there are LIMIT's I don't know. > However, the planner (using EXPLAIN ANALYZE) will actually pick this > filter last and will join 17k rows prematurely to release_code. In this > example, I'd like force postgres to do the date filter first, join to > release_code next, then finally to shipment. You could use the JOIN keywords to specify the join order youself. > Another example is filtering by the driver_id, which is a foreign key > column on the shipment table itself to a driver table. This has a > filter ratio of 0.000625 when analyzing the production data. However, > PostgreSQL will not act on this filter first either. The sad part is > that since drivers are actually distributed more evenly in the database, > it would filter out the shipment table from 17k to about 10 shipments on > average. In most cases, it ends up being more than 10, but not more > than 60 or 70, which is very good since some drivers don't have any > shipments (I question why they are even in the database, but that's > another story). As you can see, joining to shipment_status at this > point (using the primary key index from shipment.current_status_id to > shipment_status.id) should be extremely efficient. Yet, Postgres's > planner/optimizer won't make the right call until might later in the > plan. And if you select on shipment_status where driver_id=something, does it use the index ? > Well, the filtered column is actually unique (but it's not the primary > key). Should I just make it the primary key? Can't postgres be equally It won't change anything, so probably not. What will make it faster will be changing : WHERE release_code_id IN (SELECT r.id into : WHERE release_code_id = (SELECT r.id > efficient when using other candidate keys as well? If not, then I will > definately change the design of my database. I mostly use synthetic > keys to make Hibernate configuration fairly straight-forward and to make > it easy so all of my entities extend from the same base class. > > You mean by avoiding the filter on number and avoiding the join? You > see, I never thought joining to release_code should be so bad since the > table only has 8 rows in it. It's not the join itself that's bad, it's the order by... Wel the planner insisting on joining the two big tables before limiting, also is worrying. > Anyway, I hope my comments provide you with better insight to the > problem I'm having. I really do appreciate your comments because I > think you are right on target with your direction, discussing things I > haven't really thought up on my own. I thank you. Thanks ;)
pgsql-performance by date: