Re: Avoiding bad prepared-statement plans. - Mailing list pgsql-hackers

From Mark Mielke
Subject Re: Avoiding bad prepared-statement plans.
Date
Msg-id 4B8782AB.4050504@mark.mielke.cc
Whole thread Raw
In response to Re: Avoiding bad prepared-statement plans.  (Tom Lane <tgl@sss.pgh.pa.us>)
Responses Re: Avoiding bad prepared-statement plans.  (Jeroen Vermeulen <jtv@xs4all.nl>)
List pgsql-hackers
My preference is to deal with the specific value vs generic value issue.

For this issue, it can affect performance even if PREPARE/EXECUTE is 
execute exactly once.

In the last case I saw, a certain query was executing once every second, 
and with a specific value it would take < 1 ms, and with a generic value 
it would take > 50 ms. That's 5% system load for one CPU core to do 
nothing. After analysis, it was clearly a "common value" vs "not common 
value" problem. For this particular table, it stored an integer, but 
only used two values across something like 100k rows. The query was for 
a third value that did not exist. The difference was a sequential scan 
vs an index lookup.

I do not know whether the application was doing PREPARE/EXECUTE each 
time, or whether it was doing PREPARE once in advance and then EXECUTE 
each time after that, but I don't think it matters, either, as I think 
both cases deserve attention, and the problem is the same in both cases. 
Even one generic plan run costs 50+ the cost of both planning and execution.

Re-planning a generic plan with another generic plan may generate zero 
benefit, with a measurable cost. More on this after...

All the points about ms seem invalid to me. There are many reason why ms 
could increase, and many of them have nothing to do with plan 
efficiency. Again, re-planning due to a high ms, or a high ratio of ms, 
does not indicate that re-planning will improve the success of the plan. 
The planning process does not measure ms or predict ms.

My idea of an optimal system is as follows:

1) Prepare gathers and caches data about the tables involved in the 
query, including column statistics that are likely to be required during 
the planning process, but prepare does not running the planning process.

2) Execute runs the planning process re-using data cached by prepare, 
and then executes the plan.

3) Advanced: Execute may cache the selected plan for re-use only if it 
can identify a set of criteria that would allow the selected plan to be 
tested and invalidated if the parameter nature has changed such that a 
re-planning would likely choose another plan. Execute may cache multiple 
plans against a prepared statement, provided that each cached plan 
identify invalidation criteria.

4) Even more Advanced: Prepare may identify that elements of the plan 
that will always be the same, no matter what parameter is specified, and 
cache these results for substitution into the planning phase when 
execute is run. (Effectively lifting the planning from execute to 
prepare, but only where it makes obvious [= cheap to detect] sense)

This treats the whole statement planning and execution as a pipeline, 
lengthening the pipeline, and adjusting some of the pipeline elements 
from prepare to execute. It has the benefit of having fast 
prepare/execute whether execute is invoked only once or many times. The 
effect is that all statements are specifically planned, but specific 
plans are re-used wherever possible.

To support the case of changing data, I think the analyze process should 
be able to force invalidation of cached plans, and force the cached 
column statistics for prepared statements to be invalidated and 
re-queried on demand, or push new statistics directly into the prepared 
statements. It makes no sense (to me) to re-plan for the same parameters 
until an analyze is done, so this tells me that analyze is the event 
that should cause the re-plan to occur.

I think anything less than the above will increasing the performance of 
some queries while describing the performance of other queries. It might 
be possible to guess which queries are more valuable to people than 
others, and hard code solutions for these specific queries, but hard 
coding solutions will probably always be a "lowest hanging fruit" solution.

After writing this, I'm pretty sure that implementation of the above 
into PostgreSQL would be difficult, and it could be a valid concern that 
the investment is not worth the benefit at this time. It's a tough problem.

My $0.01 CDN. :-)

Cheers,
mark



pgsql-hackers by date:

Previous
From: Alex Hunsaker
Date:
Subject: Re: Avoiding bad prepared-statement plans.
Next
From: Greg Smith
Date:
Subject: Hot Standby query cancellation and Streaming Replication integration