Re: Performance problem with semi-large tables - Mailing list pgsql-performance
From | Ken Egervari |
---|---|
Subject | Re: Performance problem with semi-large tables |
Date | |
Msg-id | 000a01c5066a$0d1f43e0$cd422418@a96dfxb4kjzogw Whole thread Raw |
In response to | Performance problem with semi-large tables ("Ken Egervari" <ken@upfactor.com>) |
Responses |
Re: Performance problem with semi-large tables
|
List | pgsql-performance |
Thanks again for your response. I'll try and clarify some metrics that I took a few days to figure out what would be the best join order. By running some count queries on the production database, I noticed there were only 8 rows in release_code. The filtered column is unique, so that means the filter ratio is 0.125. However, the data distribution is not normal. When the filtered column is the constant '5', Postgres will join to 54% of the shipment_status rows. Since shipment_status has 32,000+ rows, this join is not a very good one to make. The shipment table has 17k rows, but also due to the distribution of data, almost every shipment will join to a shipment_status with a release_code of '5'. For your information, this column indicates that a shipment has been "released", as most shipments will move to this state eventually. The actual join ratio from shipment_status to shipment is about 98.5% of the rows in the shipment table, which is still basically 17k rows. I was simply curious how to make something like this faster. You see, it's the table size and the bad filters are really destroying this query example. I would never make a query to the database like this in practice, but I have similar queries that I do make that aren't much better (and can't be due to business requirements). For example, let's add another filter to get all the shipments with release code '5' that are 7 days old or newer. ss.date >= current_date - 7 By analyzing the production data, this where clause has a filter ratio of 0.08, which is far better than the release_code filter both in ratio and in the number of rows that it can avoid joining. However, if I had this filter into the original query, Postgres will not act on it first - and I think it really should before it even touches release_code. However, the planner (using EXPLAIN ANALYZE) will actually pick this filter last and will join 17k rows prematurely to release_code. In this example, I'd like force postgres to do the date filter first, join to release_code next, then finally to shipment. Another example is filtering by the driver_id, which is a foreign key column on the shipment table itself to a driver table. This has a filter ratio of 0.000625 when analyzing the production data. However, PostgreSQL will not act on this filter first either. The sad part is that since drivers are actually distributed more evenly in the database, it would filter out the shipment table from 17k to about 10 shipments on average. In most cases, it ends up being more than 10, but not more than 60 or 70, which is very good since some drivers don't have any shipments (I question why they are even in the database, but that's another story). As you can see, joining to shipment_status at this point (using the primary key index from shipment.current_status_id to shipment_status.id) should be extremely efficient. Yet, Postgres's planner/optimizer won't make the right call until might later in the plan. > SELECT * FROM shipment_status WHERE release_code_id = constant ORDER BY > release_code_id DESC, date DESC LIMIT 100; > > In this case, if you have an index on (release_code_id, date), the > planner will use a limited index scan which will yield the rows in index > order, which will be very fast. I have done this in other queries where sorting by both release code and date were important. You are right, it is very fast and I do have this index in play. However, most of the time I retreive shipment's when their shipment_status all have the same release_code, which makes sorting kind of moot :/ I guess that answers your comment below. > However, if you just have an index on date, this won't help you. > In your case, moreover, you don't use release_code_id = constant, but it > comes from a join. So there may be several different values for > release_code_id ; thus the planner can't use the optimization, it has to > find the rows with the release_code_id first. And it can't use the index > on (release_code_id, date) to get the rows in sorted order precisely > because there could be several different values for the release_code_id. > And then it has to sort by date. Well, the filtered column is actually unique (but it's not the primary key). Should I just make it the primary key? Can't postgres be equally efficient when using other candidate keys as well? If not, then I will definately change the design of my database. I mostly use synthetic keys to make Hibernate configuration fairly straight-forward and to make it easy so all of my entities extend from the same base class. > I hope this makes it clearer. If you are absolutely sure there is only > one row in release_code with r.filtered_column = '5', then this means > release_code_id is a constant and your query could get a huge speedup by > writing it differently. You mean by avoiding the filter on number and avoiding the join? You see, I never thought joining to release_code should be so bad since the table only has 8 rows in it. Anyway, I hope my comments provide you with better insight to the problem I'm having. I really do appreciate your comments because I think you are right on target with your direction, discussing things I haven't really thought up on my own. I thank you.
pgsql-performance by date: