Thread: Performance problem with semi-large tables

Performance problem with semi-large tables

From

"Ken Egervari"

Date:

29 January 2005, 21:41:46

Hi everyone.

I'm new to this forum and was wondering if anyone would be kind enough to help me out with a pretty severe performance issue. I believe the problem to be rather generic, so I'll put it in generic terms. Since I'm at home and not a work (but this is really bugging me), I can't post any specifics. However, I think my explaination will suffice.

I have a 2 tables that are are getting large and will only get larger with time (expoentially as more users sign on to the system). Right the now, a table called 'shipment' contains about 16,000 rows and 'shipment_status' contains about 32,500 rows. These aren't massive rows (I keep reading about tables with millions), but they will definately get into 6 digits by next year and query performance is quite poor.

Now, from what I can understand about tuning, you want to specify good filters, provide good indexes on the driving filter as well as any referencial keys that are used while joining. This has helped me solve performance problems many times in the past (for example, changing a query speed from 2 seconds to 21 milliseconds).

However, I am now tuning queries that operate on these two tables and the filters aren't very good (the best is a filter ratio of 0.125) and the number of rows returned is very large (not taking into consideration limits).

For example, consider something like this query that takes ~1 second to finish:

select s.*, ss.*

from shipment s, shipment_status ss, release_code r

where s.current_status_id = ss.id

and ss.release_code_id = r.id

and r.filtered_column = '5'

order by ss.date desc

limit 100;

Release code is just a very small table of 8 rows by looking at the production data, hence the 0.125 filter ratio. However, the data distribution is not normal since the filtered column actually pulls out about 54% of the rows in shipment_status when it joins. Postgres seems to be doing a sequencial scan to pull out all of these rows. Next, it joins approx 17550 rows to shipment. Since this query has a limit, it only returns the first 100, which seems like a waste.

Now, for this query, I know I can filter out the date instead to speed it up. For example, I can probably search for all the shipments in the last 3 days instead of limiting it to 100. But since this isn't a real production query, I only wanted to show it as an example since many times I cannot do a filter by the date (and the sort may be date or something else irrelavant).

I'm just stressed out how I can make queries like this more efficient since all I see is a bunch of hash joins and sequencial scans taking all kinds of time.

I guess here are my 2 questions:

1. Should I just change beg to change the requirements so that I can make more specific queries and more screens to access those?

2. Can you recommend ways so that postgres acts on big tables more efficiently? I'm not really interested in this specific case (I just made it up). I'm more interested in general solutions to this general problem of big table sizes with bad filters and where join orders don't seem to help much.

Thank you very much for your help.

Best Regards,

Ken Egervari

Re: Performance problem with semi-large tables

From

Josh Berkus

Date:

29 January 2005, 21:52:22

Ken,

Actually, your problem isn't that generic, and might be better solved by
dissecting an EXPLAIN ANALYZE.

> 1. Should I just change beg to change the requirements so that I can make
> more specific queries and more screens to access those?

This is always good.

> 2. Can you
> recommend ways so that postgres acts on big tables more efficiently?  I'm
> not really interested in this specific case (I just made it up).  I'm more
> interested in general solutions to this general problem of big table sizes
> with bad filters and where join orders don't seem to help much.

Well, you appear to be using ORDER BY ... LIMIT.   Is there a corresponding
index on the order by criteria?

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: Performance problem with semi-large tables

From

"David Parker"

Date:

29 January 2005, 22:04:43

You don't mention if you have run VACUUM or VACUUM ANALYZE lately. That's generally one of the first things that folks will suggest. If you have a lot of updates then VACUUM will clean up dead tuples; if you have a lot of inserts then VACUUM ANALYZE will update statistics so that the planner can make better decisions (as I understand it).

Another data point people will ask for in helping you will be EXPLAIN ANALYZE output from running the queries you think are slowing down.

- DAP

From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Ken Egervari
Sent: Wednesday, January 26, 2005 9:17 PM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] Performance problem with semi-large tables

Hi everyone.

I'm new to this forum and was wondering if anyone would be kind enough to help me out with a pretty severe performance issue. I believe the problem to be rather generic, so I'll put it in generic terms. Since I'm at home and not a work (but this is really bugging me), I can't post any specifics. However, I think my explaination will suffice.

I have a 2 tables that are are getting large and will only get larger with time (expoentially as more users sign on to the system). Right the now, a table called 'shipment' contains about 16,000 rows and 'shipment_status' contains about 32,500 rows. These aren't massive rows (I keep reading about tables with millions), but they will definately get into 6 digits by next year and query performance is quite poor.

Now, from what I can understand about tuning, you want to specify good filters, provide good indexes on the driving filter as well as any referencial keys that are used while joining. This has helped me solve performance problems many times in the past (for example, changing a query speed from 2 seconds to 21 milliseconds).

However, I am now tuning queries that operate on these two tables and the filters aren't very good (the best is a filter ratio of 0.125) and the number of rows returned is very large (not taking into consideration limits).

For example, consider something like this query that takes ~1 second to finish:

select s.*, ss.*
from shipment s, shipment_status ss, release_code r
where s.current_status_id = ss.id
and ss.release_code_id = r.id
and r.filtered_column = '5'
order by ss.date desc
limit 100;

Release code is just a very small table of 8 rows by looking at the production data, hence the 0.125 filter ratio. However, the data distribution is not normal since the filtered column actually pulls out about 54% of the rows in shipment_status when it joins. Postgres seems to be doing a sequencial scan to pull out all of these rows. Next, it joins approx 17550 rows to shipment. Since this query has a limit, it only returns the first 100, which seems like a waste.

Now, for this query, I know I can filter out the date instead to speed it up. For example, I can probably search for all the shipments in the last 3 days instead of limiting it to 100. But since this isn't a real production query, I only wanted to show it as an example since many times I cannot do a filter by the date (and the sort may be date or something else irrelavant).

I'm just stressed out how I can make queries like this more efficient since all I see is a bunch of hash joins and sequencial scans taking all kinds of time.

I guess here are my 2 questions:

1. Should I just change beg to change the requirements so that I can make more specific queries and more screens to access those?
2. Can you recommend ways so that postgres acts on big tables more efficiently? I'm not really interested in this specific case (I just made it up). I'm more interested in general solutions to this general problem of big table sizes with bad filters and where join orders don't seem to help much.

Thank you very much for your help.

Best Regards,
Ken Egervari

Re: Performance problem with semi-large tables

From

PFC

Date:

29 January 2005, 22:06:48


> select s.*, ss.*
> from shipment s, shipment_status ss, release_code r
> where s.current_status_id = ss.id
>    and ss.release_code_id = r.id
>    and r.filtered_column = '5'
> order by ss.date desc
> limit 100;

> Release code is just a very small table of 8 rows by looking at the
> production data, hence the 0.125 filter ratio.  However, the data
> distribution is not normal since the filtered column actually pulls out
> about 54% of the rows in shipment_status when it joins.  Postgres seems
> to be doing a sequencial scan to pull out all of these rows.  Next, it
> joins approx 17550 rows to shipment.  Since this query has a limit, it
> only returns the first 100, which seems like a waste.

    Well, postgres does what you asked. It will be slow, because you have a
full table join. LIMIT does not change this because the rows have to be
sorted first.

    The date is in shipment_status so you should first get the
shipment_status.id that you need and later join to shipment. This will
avoid the big join :


SELECT s.*, ss.* FROM
    (SELECT * FROM shipment_status WHERE release_code_id IN
        (SELECT r.id FROM release_code WHERE r.filtered_column = '5')
    ORDER BY date DESC LIMIT 100
    ) as ss, shipment s
WHERE s.current_status_id = ss.id
ORDER BY date DESC LIMIT 100

    Is this better ?

Re: Performance problem with semi-large tables

From

"Ken Egervari"

Date:

29 January 2005, 22:40:34

Yes, I'm very well aware of VACUUM and VACUUM ANALYZE. I've even clusted the date index and so on to ensure faster performance.

----- Original Message -----
From: David Parker
To: Ken Egervari ; pgsql-performance@postgresql.org
Sent: Saturday, January 29, 2005 5:04 PM
Subject: Re: [PERFORM] Performance problem with semi-large tables

You don't mention if you have run VACUUM or VACUUM ANALYZE lately. That's generally one of the first things that folks will suggest. If you have a lot of updates then VACUUM will clean up dead tuples; if you have a lot of inserts then VACUUM ANALYZE will update statistics so that the planner can make better decisions (as I understand it).

Another data point people will ask for in helping you will be EXPLAIN ANALYZE output from running the queries you think are slowing down.

- DAP

From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Ken Egervari
Sent: Wednesday, January 26, 2005 9:17 PM
To: pgsql-performance@postgresql.org
Subject: [PERFORM] Performance problem with semi-large tables

Hi everyone.

I'm new to this forum and was wondering if anyone would be kind enough to help me out with a pretty severe performance issue. I believe the problem to be rather generic, so I'll put it in generic terms. Since I'm at home and not a work (but this is really bugging me), I can't post any specifics. However, I think my explaination will suffice.

I have a 2 tables that are are getting large and will only get larger with time (expoentially as more users sign on to the system). Right the now, a table called 'shipment' contains about 16,000 rows and 'shipment_status' contains about 32,500 rows. These aren't massive rows (I keep reading about tables with millions), but they will definately get into 6 digits by next year and query performance is quite poor.

Now, from what I can understand about tuning, you want to specify good filters, provide good indexes on the driving filter as well as any referencial keys that are used while joining. This has helped me solve performance problems many times in the past (for example, changing a query speed from 2 seconds to 21 milliseconds).

However, I am now tuning queries that operate on these two tables and the filters aren't very good (the best is a filter ratio of 0.125) and the number of rows returned is very large (not taking into consideration limits).

For example, consider something like this query that takes ~1 second to finish:

select s.*, ss.*
from shipment s, shipment_status ss, release_code r
where s.current_status_id = ss.id
and ss.release_code_id = r.id
and r.filtered_column = '5'
order by ss.date desc
limit 100;

Release code is just a very small table of 8 rows by looking at the production data, hence the 0.125 filter ratio. However, the data distribution is not normal since the filtered column actually pulls out about 54% of the rows in shipment_status when it joins. Postgres seems to be doing a sequencial scan to pull out all of these rows. Next, it joins approx 17550 rows to shipment. Since this query has a limit, it only returns the first 100, which seems like a waste.

Now, for this query, I know I can filter out the date instead to speed it up. For example, I can probably search for all the shipments in the last 3 days instead of limiting it to 100. But since this isn't a real production query, I only wanted to show it as an example since many times I cannot do a filter by the date (and the sort may be date or something else irrelavant).

I'm just stressed out how I can make queries like this more efficient since all I see is a bunch of hash joins and sequencial scans taking all kinds of time.

I guess here are my 2 questions:

1. Should I just change beg to change the requirements so that I can make more specific queries and more screens to access those?
2. Can you recommend ways so that postgres acts on big tables more efficiently? I'm not really interested in this specific case (I just made it up). I'm more interested in general solutions to this general problem of big table sizes with bad filters and where join orders don't seem to help much.

Thank you very much for your help.

Best Regards,
Ken Egervari

Re: Performance problem with semi-large tables

From

"Ken Egervari"

Date:

29 January 2005, 22:44:40

> Well, postgres does what you asked. It will be slow, because you have a
> full table join. LIMIT does not change this because the rows have to be
> sorted first.

I am aware that limit doesn't really affect the execution time all that
much.  It does speed up ORM though and keeps the rows to a manageable list
so users don't have to look at thousands, which is good enough for me.   My
intention here is that the date was supposed to be a good filter.

> The date is in shipment_status so you should first get the
> shipment_status.id that you need and later join to shipment. This will
> avoid the big join :
>
>
> SELECT s.*, ss.* FROM
> (SELECT * FROM shipment_status WHERE release_code_id IN
> (SELECT r.id FROM release_code WHERE r.filtered_column = '5')
> ORDER BY date DESC LIMIT 100
> ) as ss, shipment s
> WHERE s.current_status_id = ss.id
> ORDER BY date DESC LIMIT 100
>
> Is this better ?

This looks like it might be what I want.  It's not that I was not aware of
the correct join order.  I used Dan Tow's diagram method and learned that
filtering on date first is the best approach, then releae code, then finally
shipment for this particular query.  I just didn't know how to tell
PostgreSQL how to do this.

So are you suggesting as a general rule then that sub-queries are the way to
force a specific join order in postgres?  If that is the case, I will do
this from now on.

Re: Performance problem with semi-large tables

From

PFC

Date:

29 January 2005, 23:26:53

> So are you suggesting as a general rule then that sub-queries are the
> way to force a specific join order in postgres?  If that is the case, I
> will do this from now on.

    I'll try to explain a bit better...
    Here's your original query :

> select s.*, ss.*
> from shipment s, shipment_status ss, release_code r
> where s.current_status_id = ss.id
>    and ss.release_code_id = r.id
>    and r.filtered_column = '5'
> order by ss.date desc
> limit 100;

    If you write something like :

SELECT * FROM shipment_status WHERE release_code_id = constant ORDER BY
release_code_id DESC, date DESC LIMIT 100;

    In this case, if you have an index on (release_code_id, date), the
planner will use a limited index scan which will yield the rows in index
order, which will be very fast.

    However, if you just have an index on date, this won't help you.
    In your case, moreover, you don't use release_code_id = constant, but it
comes from a join. So there may be several different values for
release_code_id ; thus the planner can't use the optimization, it has to
find the rows with the release_code_id first. And it can't use the index
on (release_code_id, date) to get the rows in sorted order precisely
because there could be several different values for the release_code_id.
And then it has to sort by date.

    I hope this makes it clearer. If you are absolutely sure there is only
one row in release_code with r.filtered_column = '5', then this means
release_code_id is a constant and your query could get a huge speedup by
writing it differently.

Re: Performance problem with semi-large tables

From

"Ken Egervari"

Date:

30 January 2005, 01:26:09

Thanks again for your response.  I'll try and clarify some metrics that I
took a few days to figure out what would be the best join order.

By running some count queries on the production database, I noticed there
were only 8 rows in release_code.  The filtered column is unique, so that
means the filter ratio is 0.125.  However, the data distribution is not
normal.  When the filtered column is the constant '5', Postgres will join to
54% of the shipment_status rows.  Since shipment_status has 32,000+ rows,
this join is not a very good one to make.

The shipment table has 17k rows, but also due to the distribution of data,
almost every shipment will join to a shipment_status with a release_code of
'5'.  For your information, this column indicates that a shipment has been
"released", as most shipments will move to this state eventually.  The
actual join ratio from shipment_status to shipment is about 98.5% of the
rows in the shipment table, which is still basically 17k rows.

I was simply curious how to make something like this faster.  You see, it's
the table size and the bad filters are really destroying this query example.
I would never make a query to the database like this in practice, but I have
similar queries that I do make that aren't much better (and can't be due to
business requirements).

For example, let's add another filter to get all the shipments with release
code '5' that are 7 days old or newer.

    ss.date >= current_date - 7

By analyzing the production data, this where clause has a filter ratio of
0.08, which is far better than the release_code filter both in ratio and in
the number of rows that it can avoid joining.  However, if I had this filter
into the original query, Postgres will not act on it first - and I think it
really should before it even touches release_code.  However, the planner
(using EXPLAIN ANALYZE) will actually pick this filter last and will join
17k rows prematurely to release_code.  In this example, I'd like force
postgres to do the date filter first, join to release_code next, then
finally to shipment.

Another example is filtering by the driver_id, which is a foreign key column
on the shipment table itself to a driver table.  This has a filter ratio of
0.000625 when analyzing the production data.  However, PostgreSQL will not
act on this filter first either.  The sad part is that since drivers are
actually distributed more evenly in the database, it would filter out the
shipment table from 17k to about 10 shipments on average.  In most cases, it
ends up being more than 10, but not more than 60 or 70, which is very good
since some drivers don't have any shipments (I question why they are even in
the database, but that's another story).  As you can see, joining to
shipment_status at this point (using the primary key index from
shipment.current_status_id to shipment_status.id) should be extremely
efficient.  Yet, Postgres's planner/optimizer won't make the right call
until might later in the plan.

> SELECT * FROM shipment_status WHERE release_code_id = constant ORDER BY
> release_code_id DESC, date DESC LIMIT 100;
>
> In this case, if you have an index on (release_code_id, date), the
> planner will use a limited index scan which will yield the rows in index
> order, which will be very fast.

I have done this in other queries where sorting by both release code and
date were important. You are right, it is very fast and I do have this index
in play.  However, most of the time I retreive shipment's when their
shipment_status all have the same release_code, which makes sorting kind of
moot :/  I guess that answers your comment below.

> However, if you just have an index on date, this won't help you.
> In your case, moreover, you don't use release_code_id = constant, but it
> comes from a join. So there may be several different values for
> release_code_id ; thus the planner can't use the optimization, it has to
> find the rows with the release_code_id first. And it can't use the index
> on (release_code_id, date) to get the rows in sorted order precisely
> because there could be several different values for the release_code_id.
> And then it has to sort by date.

Well, the filtered column is actually unique (but it's not the primary key).
Should I just make it the primary key?  Can't postgres be equally efficient
when using other candidate keys as well?  If not, then I will definately
change the design of my database.  I mostly use synthetic keys to make
Hibernate configuration fairly straight-forward and to make it easy so all
of my entities extend from the same base class.

> I hope this makes it clearer. If you are absolutely sure there is only
> one row in release_code with r.filtered_column = '5', then this means
> release_code_id is a constant and your query could get a huge speedup by
> writing it differently.

You mean by avoiding the filter on number and avoiding the join?  You see, I
never thought joining to release_code should be so bad since the table only
has 8 rows in it.

Anyway, I hope my comments provide you with better insight to the problem
I'm having.  I really do appreciate your comments because I think you are
right on target with your direction, discussing things I haven't really
thought up on my own.  I thank you.

Re: Performance problem with semi-large tables

From

PFC

Date:

30 January 2005, 11:10:34

>> SELECT * FROM shipment_status WHERE release_code_id = constant ORDER BY
>> release_code_id DESC, date DESC LIMIT 100;
>
> I have done this in other queries where sorting by both release code and
> date were important. You are right, it is very fast and I do have this
> index in play.  However, most of the time I retreive shipment's when
> their shipment_status all have the same release_code, which makes
> sorting kind of moot :/  I guess that answers your comment below.

    Ah, well in this case, ORDER BY release_code_id DESC seems of course
useless because you only have one order_code_id, but it is in fact
necessary to make the planner realize it can use the index on
(release_code_id,date) for the ordering. If you just ORDER BY date, the
planner will not use your index.

> Thanks again for your response.  I'll try and clarify some metrics that
> I took a few days to figure out what would be the best join order.
>
> By running some count queries on the production database, I noticed
> there were only 8 rows in release_code.  The filtered column is unique,

    Let's forget the shipments table for now.

    So you mean there is an unique, one-to-one relation between
release_code_id and filtered_column ?
    The planner is not able to derermine this ahead of time ; and in your
case, it's important that it be unique to be able to use the index to
retrieve quickly the rows in (date DESC) order.
    So if you'll join only to ONE release_code_id, you can do this :

(SELECT * FROM shipment_status WHERE release_code_id =
(SELECT r.id FROM release_code WHERE r.filtered_column = '5' LIMIT 1)
ORDER BY release_code_id DESC, date DESC LIMIT 100)

    Which is no longer a join and will get your shipment_status_id's very
quickly.

> so that means the filter ratio is 0.125.  However, the data distribution
> is not normal.  When the filtered column is the constant '5', Postgres
> will join to 54% of the shipment_status rows.  Since shipment_status has
> 32,000+ rows, this join is not a very good one to make.

    Sure !

> The shipment table has 17k rows, but also due to the distribution of
> data, almost every shipment will join to a shipment_status with a
> release_code of '5'.  For your information, this column indicates that a
> shipment has been "released", as most shipments will move to this state
> eventually.  The actual join ratio from shipment_status to shipment is
> about 98.5% of the rows in the shipment table, which is still basically
> 17k rows.
>
> I was simply curious how to make something like this faster.  You see,
> it's the table size and the bad filters are really destroying this query
> example. I would never make a query to the database like this in
> practice, but I have similar queries that I do make that aren't much
> better (and can't be due to business requirements).
>
> For example, let's add another filter to get all the shipments with
> release code '5' that are 7 days old or newer.
>
>     ss.date >= current_date - 7

    It's the order by + limit which makes the query behaves badly, and which
forces use of kludges to use the index. If you add another condition like
that, it should be a breeze.

> By analyzing the production data, this where clause has a filter ratio
> of 0.08, which is far better than the release_code filter both in ratio
> and in the number of rows that it can avoid joining.  However, if I had
> this filter into the original query, Postgres will not act on it first -
> and I think it really should before it even touches release_code.

    Well I think too.
    What with the subqueries I wrote with the LIMIT inside the subquery ? Any
better ?
    Normally the planner is able to deconstruct subqueries and change the
order as it sees fit, but if there are LIMIT's I don't know.

> However, the planner (using EXPLAIN ANALYZE) will actually pick this
> filter last and will join 17k rows prematurely to release_code.  In this
> example, I'd like force postgres to do the date filter first, join to
> release_code next, then finally to shipment.

    You could use the JOIN keywords to specify the join order youself.

> Another example is filtering by the driver_id, which is a foreign key
> column on the shipment table itself to a driver table.  This has a
> filter ratio of 0.000625 when analyzing the production data.  However,
> PostgreSQL will not act on this filter first either.  The sad part is
> that since drivers are actually distributed more evenly in the database,
> it would filter out the shipment table from 17k to about 10 shipments on
> average.  In most cases, it ends up being more than 10, but not more
> than 60 or 70, which is very good since some drivers don't have any
> shipments (I question why they are even in the database, but that's
> another story).  As you can see, joining to shipment_status at this
> point (using the primary key index from shipment.current_status_id to
> shipment_status.id) should be extremely efficient.  Yet, Postgres's
> planner/optimizer won't make the right call until might later in the
> plan.

    And if you select on shipment_status where driver_id=something, does it
use the index ?

> Well, the filtered column is actually unique (but it's not the primary
> key). Should I just make it the primary key?  Can't postgres be equally

    It won't change anything, so probably not. What will make it faster will
be changing :
WHERE release_code_id IN (SELECT r.id
    into :
WHERE release_code_id = (SELECT r.id

> efficient when using other candidate keys as well?  If not, then I will
> definately change the design of my database.  I mostly use synthetic
> keys to make Hibernate configuration fairly straight-forward and to make
> it easy so all of my entities extend from the same base class.
>

> You mean by avoiding the filter on number and avoiding the join?  You
> see, I never thought joining to release_code should be so bad since the
> table only has 8 rows in it.

    It's not the join itself that's bad, it's the order by...
    Wel the planner insisting on joining the two big tables before limiting,
also is worrying.

> Anyway, I hope my comments provide you with better insight to the
> problem I'm having.  I really do appreciate your comments because I
> think you are right on target with your direction, discussing things I
> haven't really thought up on my own.  I thank you.

    Thanks ;)

Re: Performance problem with semi-large tables

From

Tom Lane

Date:

30 January 2005, 17:51:36

PFC <lists@boutiquenumerique.com> writes:
>> For example, let's add another filter to get all the shipments with
>> release code '5' that are 7 days old or newer.
>>
>> ss.date >= current_date - 7

>     It's the order by + limit which makes the query behaves badly, and which
> forces use of kludges to use the index. If you add another condition like
> that, it should be a breeze.

Actually, that date condition has its own problem, namely that the
compared-to value isn't a constant.  The 8.0 planner is able to realize
that this is a pretty selective condition, but prior releases fall back
on a very pessimistic default estimate.  I'm sure that has something to
do with Ken not being able to get it to use an index on date.

            regards, tom lane