Re: How to force Nested Loop plan? - Mailing list pgsql-performance

From Rob Nagler
Subject Re: How to force Nested Loop plan?
Date
Msg-id 16208.50934.351000.90704@gargle.gargle.HOWL
Whole thread Raw
In response to Re: How to force Nested Loop plan?  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: How to force Nested Loop plan?
Re: How to force Nested Loop plan?
List pgsql-performance
Tom Lane writes:
> The reason the planner does not much like this plan is that it's
> estimating that quite a lot of rows will have to be hit in min_date_time
> order before it finds enough rows with server_id = 21.  Thus the high
> cost estimate for the above step.

Thanks for the speedy and useful reply!  More questions follow. :)

Very interesting.  How does it know "quite a lot"?  Is there something
I can do to get the planner to analyze the data better?

> I suspect that the reason you like this plan is that there's actually
> substantial correlation between server_id and min_date_time, such that
> the required rows are found quickly.  Before trying to force the planner
> into what you consider an optimal plan, you had better ask yourself
> whether you can expect that correlation to hold up in the future.
> If not, your plan could become pessimal pretty quickly.

The correlation holds.  min_date_time increases over time as records
are inserted.  server_id is uniformly distributed over time.  There's
no randomness.  There is at least one 21 record for every value of
min_date_time.  21 is a special server_id containing aggregate
(denormalized) data for the other servers.  I thought about putting it
in a separate table, but this would complicate the code as the data is
identical to the non-aggregated case.

Do you have any suggestions for organizing the data/query now that you
know this?

> I'd suggest creating a double-column index:

Thanks.  I'll try this.

I'm a very big fan of declarative programming.  However, there's a
danger in declarative programming when the interperter isn't smart
enough.  When I add this index, I will slow down inserts (about
20K/day) and increase data size (this is the second largest table in
the database).  Moreover, if the planner is improved, I've should fix
my code, delete the index, etc.

Is there a way of giving the planner direct hints as in Oracle?  They
can be ignored when the optimizer is improved, just as "register" is
ignored by C compilers nowadays.

Adding the extra index and ORDER BY is also not easy in our case.  The
query is dynamically generated.  I can force the query ORDER BY to be
whatever I want, but I would lose the ability to do interesting
things, like the automatic generation of ORDER BY when someone clicks
on a column header in the application.  Indeed there are times when
people want to sort on other columns in the query. I reduced the
problem to the salient details for my post to this board.  What if the
ORDER BY was:

    ORDER BY aa_t.server_id DESC, cc_t.name ASC

Would the planner do the right thing?

> PS: does server_id really need to be NUMERIC?  Why not integer, or at
> worst bigint?

It is a NUMERIC(18).  It could be a bigint.  What would be the change
in performance of this query if we changed it to bigint?

BTW, my customer is probably going to be switching to Oracle.  This
particular query has been one of the reasons.  Maybe this change will
help us stay with Postgres.

Thanks,
Rob



pgsql-performance by date:

Previous
From: "Matt Clark"
Date:
Subject: Re: Hardware recommendations to scale to silly load
Next
From: Ron Johnson
Date:
Subject: Re: Selecting random rows efficiently