Performance problem with semi-large tables - Mailing list pgsql-performance

From Ken Egervari
Subject Performance problem with semi-large tables
Date
Msg-id 001601c50416$57090e10$cd422418@a96dfxb4kjzogw
Whole thread Raw
Responses Re: Performance problem with semi-large tables
Re: Performance problem with semi-large tables
List pgsql-performance
Hi everyone.
 
I'm new to this forum and was wondering if anyone would be kind enough to help me out with a pretty severe performance issue.  I believe the problem to be rather generic, so I'll put it in generic terms.  Since I'm at home and not a work (but this is really bugging me), I can't post any specifics.  However, I think my explaination will suffice.
 
I have a 2 tables that are are getting large and will only get larger with time (expoentially as more users sign on to the system).  Right the now, a table called 'shipment' contains about 16,000 rows and 'shipment_status' contains about 32,500 rows.  These aren't massive rows (I keep reading about tables with millions), but they will definately get into 6 digits by next year and query performance is quite poor.
 
Now, from what I can understand about tuning, you want to specify good filters, provide good indexes on the driving filter as well as any referencial keys that are used while joining.  This has helped me solve performance problems many times in the past (for example, changing a query speed from 2 seconds to 21 milliseconds). 
 
However, I am now tuning queries that operate on these two tables and the filters aren't very good (the best is a filter ratio of 0.125) and the number of rows returned is very large (not taking into consideration limits).
 
For example, consider something like this query that takes ~1 second to finish:
 
select s.*, ss.*
from shipment s, shipment_status ss, release_code r
where s.current_status_id = ss.id
   and ss.release_code_id = r.id
   and r.filtered_column = '5'
order by ss.date desc
limit 100;
 
Release code is just a very small table of 8 rows by looking at the production data, hence the 0.125 filter ratio.  However, the data distribution is not normal since the filtered column actually pulls out about 54% of the rows in shipment_status when it joins.  Postgres seems to be doing a sequencial scan to pull out all of these rows.  Next, it joins approx 17550 rows to shipment.  Since this query has a limit, it only returns the first 100, which seems like a waste.
 
Now, for this query, I know I can filter out the date instead to speed it up.  For example, I can probably search for all the shipments in the last 3 days instead of limiting it to 100.  But since this isn't a real production query, I only wanted to show it as an example since many times I cannot do a filter by the date (and the sort may be date or something else irrelavant).
 
I'm just stressed out how I can make queries like this more efficient since all I see is a bunch of hash joins and sequencial scans taking all kinds of time.
 
I guess here are my 2 questions:
 
1. Should I just change beg to change the requirements so that I can make more specific queries and more screens to access those?
2. Can you recommend ways so that postgres acts on big tables more efficiently?  I'm not really interested in this specific case (I just made it up).  I'm more interested in general solutions to this general problem of big table sizes with bad filters and where join orders don't seem to help much.
 
Thank you very much for your help.
 
Best Regards,
Ken Egervari

pgsql-performance by date:

Previous
From: Tom Lane
Date:
Subject: Re: Postgres server getting slow!!
Next
From: Christopher Weimann
Date:
Subject: Re: PostgreSQL clustering VS MySQL clustering