Re: Performance problem with semi-large tables - Mailing list pgsql-performance

From Ken Egervari
Subject Re: Performance problem with semi-large tables
Date
Msg-id 000a01c5066a$0d1f43e0$cd422418@a96dfxb4kjzogw
Whole thread Raw
In response to Performance problem with semi-large tables  ("Ken Egervari" <ken@upfactor.com>)
Responses Re: Performance problem with semi-large tables
List pgsql-performance
Thanks again for your response.  I'll try and clarify some metrics that I
took a few days to figure out what would be the best join order.

By running some count queries on the production database, I noticed there
were only 8 rows in release_code.  The filtered column is unique, so that
means the filter ratio is 0.125.  However, the data distribution is not
normal.  When the filtered column is the constant '5', Postgres will join to
54% of the shipment_status rows.  Since shipment_status has 32,000+ rows,
this join is not a very good one to make.

The shipment table has 17k rows, but also due to the distribution of data,
almost every shipment will join to a shipment_status with a release_code of
'5'.  For your information, this column indicates that a shipment has been
"released", as most shipments will move to this state eventually.  The
actual join ratio from shipment_status to shipment is about 98.5% of the
rows in the shipment table, which is still basically 17k rows.

I was simply curious how to make something like this faster.  You see, it's
the table size and the bad filters are really destroying this query example.
I would never make a query to the database like this in practice, but I have
similar queries that I do make that aren't much better (and can't be due to
business requirements).

For example, let's add another filter to get all the shipments with release
code '5' that are 7 days old or newer.

    ss.date >= current_date - 7

By analyzing the production data, this where clause has a filter ratio of
0.08, which is far better than the release_code filter both in ratio and in
the number of rows that it can avoid joining.  However, if I had this filter
into the original query, Postgres will not act on it first - and I think it
really should before it even touches release_code.  However, the planner
(using EXPLAIN ANALYZE) will actually pick this filter last and will join
17k rows prematurely to release_code.  In this example, I'd like force
postgres to do the date filter first, join to release_code next, then
finally to shipment.

Another example is filtering by the driver_id, which is a foreign key column
on the shipment table itself to a driver table.  This has a filter ratio of
0.000625 when analyzing the production data.  However, PostgreSQL will not
act on this filter first either.  The sad part is that since drivers are
actually distributed more evenly in the database, it would filter out the
shipment table from 17k to about 10 shipments on average.  In most cases, it
ends up being more than 10, but not more than 60 or 70, which is very good
since some drivers don't have any shipments (I question why they are even in
the database, but that's another story).  As you can see, joining to
shipment_status at this point (using the primary key index from
shipment.current_status_id to shipment_status.id) should be extremely
efficient.  Yet, Postgres's planner/optimizer won't make the right call
until might later in the plan.

> SELECT * FROM shipment_status WHERE release_code_id = constant ORDER BY
> release_code_id DESC, date DESC LIMIT 100;
>
> In this case, if you have an index on (release_code_id, date), the
> planner will use a limited index scan which will yield the rows in index
> order, which will be very fast.

I have done this in other queries where sorting by both release code and
date were important. You are right, it is very fast and I do have this index
in play.  However, most of the time I retreive shipment's when their
shipment_status all have the same release_code, which makes sorting kind of
moot :/  I guess that answers your comment below.

> However, if you just have an index on date, this won't help you.
> In your case, moreover, you don't use release_code_id = constant, but it
> comes from a join. So there may be several different values for
> release_code_id ; thus the planner can't use the optimization, it has to
> find the rows with the release_code_id first. And it can't use the index
> on (release_code_id, date) to get the rows in sorted order precisely
> because there could be several different values for the release_code_id.
> And then it has to sort by date.

Well, the filtered column is actually unique (but it's not the primary key).
Should I just make it the primary key?  Can't postgres be equally efficient
when using other candidate keys as well?  If not, then I will definately
change the design of my database.  I mostly use synthetic keys to make
Hibernate configuration fairly straight-forward and to make it easy so all
of my entities extend from the same base class.

> I hope this makes it clearer. If you are absolutely sure there is only
> one row in release_code with r.filtered_column = '5', then this means
> release_code_id is a constant and your query could get a huge speedup by
> writing it differently.

You mean by avoiding the filter on number and avoiding the join?  You see, I
never thought joining to release_code should be so bad since the table only
has 8 rows in it.

Anyway, I hope my comments provide you with better insight to the problem
I'm having.  I really do appreciate your comments because I think you are
right on target with your direction, discussing things I haven't really
thought up on my own.  I thank you.


pgsql-performance by date:

Previous
From: PFC
Date:
Subject: Re: Performance problem with semi-large tables
Next
From: "N S"
Date:
Subject: Re: Postgres server getting slow!!