Thread: Caching of Queries

Caching of Queries

From
Scott Kirkwood
Date:
I couldn't find anything in the docs or in the mailing list on this,
but it is something that Oracle appears to do as does MySQL.
The idea, I believe, is to do a quick (hash) string lookup of the
query and if it's exactly the same as another query that has been done
recently to re-use the old parse tree.
It should save the time of doing the parsing of the SQL and looking up
the object in the system tables.
It should probably go through the planner again because values passed
as parameters may have changed. Although, for extra points it could
look at the previous query plan as a hint.
On the surface it looks like an easy enhancement, but what do I know?
I suppose it would benefit mostly those programs that use a lot of
PQexecParams() with simple queries where a greater percentage of the
time is spent parsing the SQL rather than building the execute plan.
What do you think?

Re: Caching of Queries

From
Tom Lane
Date:
Scott Kirkwood <scottakirkwood@gmail.com> writes:
> What do you think?

I think this would allow the problems of cached plans to bite
applications that were previously not subject to them :-(.
An app that wants plan re-use can use PREPARE to identify the
queries that are going to be re-executed.

            regards, tom lane

Re: Caching of Queries

From
Neil Conway
Date:
On Thu, 2004-09-23 at 05:59, Tom Lane wrote:
> I think this would allow the problems of cached plans to bite
> applications that were previously not subject to them :-(.
> An app that wants plan re-use can use PREPARE to identify the
> queries that are going to be re-executed.

I agree; if you want to do some work in this area, making improvements
to PREPARE would IMHO be the best bet. For example, some people have
talked about having PREPARE store queries in shared memory. Another idea
would be to improve the quality of the plan we generate at PREPARE time:
for instance you could generate 'n' plans for various combinations of
input parameters, and then choose the best query plan at EXECUTE time.
It's a difficult problem to solve, however (consider multiple parameters
to PREPARE, for example).

-Neil



Re: Caching of Queries

From
"Gary Doades"
Date:
On 22 Sep 2004 at 15:59, Tom Lane wrote:

> Scott Kirkwood <scottakirkwood@gmail.com> writes:
> > What do you think?
>
> I think this would allow the problems of cached plans to bite
> applications that were previously not subject to them :-(.
> An app that wants plan re-use can use PREPARE to identify the
> queries that are going to be re-executed.
>
>             regards, tom lane
>

And then there are the people that would like to upgrade and get a
performance gain without having to change their programs. A simple
conf flag could turn query/plan caching off for all those that rely on each
statement being re-planned.

This is where SQLServer etc. tend to get big wins. I know from direct
comparisons that SQLServer often takes quite a bit longer to parse/plan
a select statement than Postgres, but wins out overall from its
query/plan caching.

Regards,
Gary.


Re: Caching of Queries

From
Gaetano Mendola
Date:
Neil Conway wrote:
> Another idea would be to improve the quality of the plan we generate at PREPARE time:
> for instance you could generate 'n' plans for various combinations of
> input parameters, and then choose the best query plan at EXECUTE time.
> It's a difficult problem to solve, however (consider multiple parameters
> to PREPARE, for example).

Do you mean store different plans for each different histogram segment ?

Regards
Gaetano Mendola


Re: Caching of Queries

From
Harald Fuchs
Date:
In article <b3dc511704092212502a2ddb09@mail.gmail.com>,
Scott Kirkwood <scottakirkwood@gmail.com> writes:

> I couldn't find anything in the docs or in the mailing list on this,
> but it is something that Oracle appears to do as does MySQL.
> The idea, I believe, is to do a quick (hash) string lookup of the
> query and if it's exactly the same as another query that has been done
> recently to re-use the old parse tree.

That's not was MySQL is doing.  MySQL caches not the query plan, but
the result set for the (hashed) query string.  If the same query comes
again, it is not executed at all (unless one of the tables involved
have been changed meanwhile).

Re: Caching of Queries

From
Mr Pink
Date:
Not knowing anything about the internals of pg, I don't know how this relates, but in theory,
query plan caching is not just about saving time re-planning queries, it's about scalability.
Optimizing queries requires shared locks on the database metadata, which, as I understand it
causes contention and serialization, which kills scalability.

I read this thread from last to first, and I'm not sure if I missed something, but if pg isnt
caching plans, then I would say plan caching should be a top priority for future enhancements. It
needn't be complex either: if the SQL string is the same, and none of the tables involved in the
query have changed (in structure), then re-use the cached plan. Basically, DDL and updated
statistics would have to invalidate plans for affected tables.

Preferably, it should work equally for prepared statements and those not pre-prepared. If you're
not using prepare (and bind variables) though, your plan caching down the drain anyway...

I don't think that re-optimizing based on values of bind variables is needed. It seems like it
could actually be counter-productive and difficult to asses it's impact.

That's the way I see it anyway.

:)

--- Scott Kirkwood <scottakirkwood@gmail.com> wrote:

> I couldn't find anything in the docs or in the mailing list on this,
> but it is something that Oracle appears to do as does MySQL.
> The idea, I believe, is to do a quick (hash) string lookup of the
> query and if it's exactly the same as another query that has been done
> recently to re-use the old parse tree.
> It should save the time of doing the parsing of the SQL and looking up
> the object in the system tables.
> It should probably go through the planner again because values passed
> as parameters may have changed. Although, for extra points it could
> look at the previous query plan as a hint.
> On the surface it looks like an easy enhancement, but what do I know?
> I suppose it would benefit mostly those programs that use a lot of
> PQexecParams() with simple queries where a greater percentage of the
> time is spent parsing the SQL rather than building the execute plan.
> What do you think?
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
>




_______________________________
Do you Yahoo!?
Declare Yourself - Register online to vote today!
http://vote.yahoo.com

Re: Caching of Queries

From
"Jason Coene"
Date:
I'm not an expert, but I've been hunting down a killer performance problem
for a while now.  It seems this may be the cause.

At peak load, our database slows to a trickle.  The CPU and disk utilization
are normal - 20-30% used CPU and disk performance good.

All of our "postgres" processes end up in the "semwai" state - seemingly
waiting on other queries to complete.  If the system isn't taxed in CPU or
disk, I have a good feeling that this may be the cause.  I didn't know that
planning queries could create such a gridlock, but based on Mr Pink's
explanation, it sounds like a very real possibility.

We're running on SELECT's, and the number of locks on our "high traffic"
tables grows to the hundreds.  If it's not the SELECT locking (and we don't
get that many INSERT/UPDATE on these tables), could the planner be doing it?

At peak load (~ 1000 queries/sec on highest traffic table, all very
similar), the serialized queries pile up and essentially create a DoS on our
service - requiring a restart of the PG daemon.  Upon stop & start, it's
back to normal.

I've looked at PREPARE, but apparently it only lasts per-session - that's
worthless in our case (web based service, one connection per data-requiring
connection).

Does this sound plausible?  Is there an alternative way to do this that I
don't know about?  Additionally, in our case, I personally don't see any
downside to caching and using the same query plan when the only thing
substituted are variables.  In fact, I'd imagine it would help performance
significantly in high-volume web applications.

Thanks,

Jason

> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-
> owner@postgresql.org] On Behalf Of Mr Pink
> Sent: Thursday, September 23, 2004 11:29 AM
> To: Scott Kirkwood; pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] Caching of Queries
>
> Not knowing anything about the internals of pg, I don't know how this
> relates, but in theory,
> query plan caching is not just about saving time re-planning queries, it's
> about scalability.
> Optimizing queries requires shared locks on the database metadata, which,
> as I understand it
> causes contention and serialization, which kills scalability.
>
> I read this thread from last to first, and I'm not sure if I missed
> something, but if pg isnt
> caching plans, then I would say plan caching should be a top priority for
> future enhancements. It
> needn't be complex either: if the SQL string is the same, and none of the
> tables involved in the
> query have changed (in structure), then re-use the cached plan. Basically,
> DDL and updated
> statistics would have to invalidate plans for affected tables.
>
> Preferably, it should work equally for prepared statements and those not
> pre-prepared. If you're
> not using prepare (and bind variables) though, your plan caching down the
> drain anyway...
>
> I don't think that re-optimizing based on values of bind variables is
> needed. It seems like it
> could actually be counter-productive and difficult to asses it's impact.
>
> That's the way I see it anyway.
>
> :)
>


Re: Caching of Queries

From
Tom Lane
Date:
"Jason Coene" <jcoene@gotfrag.com> writes:
> All of our "postgres" processes end up in the "semwai" state - seemingly
> waiting on other queries to complete.  If the system isn't taxed in CPU or
> disk, I have a good feeling that this may be the cause.

Whatever that is, I'll bet lunch that it's got 0 to do with caching
query plans.  Can you get stack tracebacks from some of the stuck
processes?  What do they show in "ps"?

            regards, tom lane

Re: Caching of Queries

From
Gaetano Mendola
Date:
Jason Coene wrote:
> I'm not an expert, but I've been hunting down a killer performance problem
> for a while now.  It seems this may be the cause.
>
> At peak load, our database slows to a trickle.  The CPU and disk utilization
> are normal - 20-30% used CPU and disk performance good.

For a peak load 20-30% used CPU this mean you reached your IO bottleneck.

> All of our "postgres" processes end up in the "semwai" state - seemingly
> waiting on other queries to complete.  If the system isn't taxed in CPU or
> disk, I have a good feeling that this may be the cause.  I didn't know that
> planning queries could create such a gridlock, but based on Mr Pink's
> explanation, it sounds like a very real possibility.
>
> We're running on SELECT's, and the number of locks on our "high traffic"
> tables grows to the hundreds.  If it's not the SELECT locking (and we don't
> get that many INSERT/UPDATE on these tables), could the planner be doing it?
>
> At peak load (~ 1000 queries/sec on highest traffic table, all very
> similar), the serialized queries pile up and essentially create a DoS on our
> service - requiring a restart of the PG daemon.  Upon stop & start, it's
> back to normal.

Give us informations on this queries, a explain analyze could be a good start
point.

> I've looked at PREPARE, but apparently it only lasts per-session - that's
> worthless in our case (web based service, one connection per data-requiring
> connection).

Trust me the PREPARE is not doing miracle in shenarios like yours . If you use postgres
in a web service environment what you can use is a connection pool ( look for pgpoll IIRC ),
if you use a CMS then try to enable the cache in order to avoid to hit the DB for each
request.




Regards
Gaetano Mendola

Re: Caching of Queries

From
Date:
Scott:

We have seen similar issues when we have had massive load on our web
server. My determination was that simply the act of spawning and
stopping postgres sessions was very heavy on the box, and by
implementing connection pooling (sqlrelay), we got much higher
throughput, and better response on the server then we would get any
other way.

-----Original Message-----
From: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Jason Coene
Sent: Thursday, September 23, 2004 10:53 AM
To: 'Mr Pink'; 'Scott Kirkwood'
Cc: pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Caching of Queries

I'm not an expert, but I've been hunting down a killer performance
problem
for a while now.  It seems this may be the cause.

At peak load, our database slows to a trickle.  The CPU and disk
utilization
are normal - 20-30% used CPU and disk performance good.

All of our "postgres" processes end up in the "semwai" state - seemingly
waiting on other queries to complete.  If the system isn't taxed in CPU
or
disk, I have a good feeling that this may be the cause.  I didn't know
that
planning queries could create such a gridlock, but based on Mr Pink's
explanation, it sounds like a very real possibility.

We're running on SELECT's, and the number of locks on our "high traffic"
tables grows to the hundreds.  If it's not the SELECT locking (and we
don't
get that many INSERT/UPDATE on these tables), could the planner be doing
it?

At peak load (~ 1000 queries/sec on highest traffic table, all very
similar), the serialized queries pile up and essentially create a DoS on
our
service - requiring a restart of the PG daemon.  Upon stop & start, it's
back to normal.

I've looked at PREPARE, but apparently it only lasts per-session -
that's
worthless in our case (web based service, one connection per
data-requiring
connection).

Does this sound plausible?  Is there an alternative way to do this that
I
don't know about?  Additionally, in our case, I personally don't see any
downside to caching and using the same query plan when the only thing
substituted are variables.  In fact, I'd imagine it would help
performance
significantly in high-volume web applications.

Thanks,

Jason

> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-
> owner@postgresql.org] On Behalf Of Mr Pink
> Sent: Thursday, September 23, 2004 11:29 AM
> To: Scott Kirkwood; pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] Caching of Queries
>
> Not knowing anything about the internals of pg, I don't know how this
> relates, but in theory,
> query plan caching is not just about saving time re-planning queries,
it's
> about scalability.
> Optimizing queries requires shared locks on the database metadata,
which,
> as I understand it
> causes contention and serialization, which kills scalability.
>
> I read this thread from last to first, and I'm not sure if I missed
> something, but if pg isnt
> caching plans, then I would say plan caching should be a top priority
for
> future enhancements. It
> needn't be complex either: if the SQL string is the same, and none of
the
> tables involved in the
> query have changed (in structure), then re-use the cached plan.
Basically,
> DDL and updated
> statistics would have to invalidate plans for affected tables.
>
> Preferably, it should work equally for prepared statements and those
not
> pre-prepared. If you're
> not using prepare (and bind variables) though, your plan caching down
the
> drain anyway...
>
> I don't think that re-optimizing based on values of bind variables is
> needed. It seems like it
> could actually be counter-productive and difficult to asses it's
impact.
>
> That's the way I see it anyway.
>
> :)
>


---------------------------(end of broadcast)---------------------------
TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org

Re: Caching of Queries

From
"Jason Coene"
Date:
Hi Tom,

Easily recreated with Apache benchmark, "ab -n 30000 -c 3000
http://webserver ".  This runs 1 query per page, everything else is cached
on webserver.

The lone query:

SELECT
    id,
    gameid,
    forumid,
    subject
  FROM threads
  WHERE nuked = 0
  ORDER BY nuked DESC,
    lastpost DESC LIMIT 8

Limit  (cost=0.00..1.99 rows=8 width=39) (actual time=27.865..28.027 rows=8
loops=1)
   ->  Index Scan Backward using threads_ix_nuked_lastpost on threads
(cost=0.0 0..16824.36 rows=67511 width=39) (actual time=27.856..27.989
rows=8 loops=1)
         Filter: (nuked = 0)
 Total runtime: 28.175 ms

I'm not sure how I go about getting the stack traceback you need.  Any info
on this?  Results of "ps" below.  System is dual xeon 2.6, 2gb ram, hardware
raid 10 running FreeBSD 5.2.1.

Jason

last pid: 96094;  load averages:  0.22,  0.35,  0.38
up 19+20:50:37  13:10:45
161 processes: 2 running, 151 sleeping, 8 lock
CPU states: 12.2% user,  0.0% nice, 16.9% system,  1.6% interrupt, 69.4%
idle
Mem: 120M Active, 1544M Inact, 194M Wired, 62M Cache, 112M Buf, 2996K Free
Swap: 4096M Total, 4096M Free

  PID USERNAME PRI NICE   SIZE    RES STATE  C   TIME   WCPU    CPU COMMAND
50557 pgsql     98    0 95276K  4860K select 0  24:00  0.59%  0.59% postgres
95969 pgsql      4    0 96048K 34272K sbwait 0   0:00  2.10%  0.29% postgres
95977 pgsql     -4    0 96048K 29620K semwai 2   0:00  1.40%  0.20% postgres
96017 pgsql      4    0 96048K 34280K sbwait 0   0:00  2.05%  0.20% postgres
95976 pgsql     -4    0 96048K 30564K semwai 3   0:00  1.05%  0.15% postgres
95970 pgsql     -4    0 96048K 24404K semwai 1   0:00  1.05%  0.15% postgres
95972 pgsql     -4    0 96048K 21060K semwai 1   0:00  1.05%  0.15% postgres
96053 pgsql     -4    0 96048K 24140K semwai 3   0:00  1.54%  0.15% postgres
96024 pgsql     -4    0 96048K 22192K semwai 3   0:00  1.54%  0.15% postgres
95985 pgsql     -4    0 96048K 15208K semwai 3   0:00  1.54%  0.15% postgres
96033 pgsql     98    0 95992K  7812K *Giant 2   0:00  1.54%  0.15% postgres
95973 pgsql     -4    0 96048K 30936K semwai 3   0:00  0.70%  0.10% postgres
95966 pgsql      4    0 96048K 34272K sbwait 0   0:00  0.70%  0.10% postgres
95983 pgsql      4    0 96048K 34272K sbwait 2   0:00  1.03%  0.10% postgres
95962 pgsql      4    0 96048K 34268K sbwait 2   0:00  0.70%  0.10% postgres
95968 pgsql     -4    0 96048K 26232K semwai 2   0:00  0.70%  0.10% postgres
95959 pgsql      4    0 96048K 34268K sbwait 2   0:00  0.70%  0.10% postgres

> -----Original Message-----
> From: Tom Lane [mailto:tgl@sss.pgh.pa.us]
> Sent: Thursday, September 23, 2004 1:06 PM
> To: Jason Coene
> Cc: 'Mr Pink'; 'Scott Kirkwood'; pgsql-performance@postgresql.org
> Subject: Re: [PERFORM] Caching of Queries
>
> "Jason Coene" <jcoene@gotfrag.com> writes:
> > All of our "postgres" processes end up in the "semwai" state - seemingly
> > waiting on other queries to complete.  If the system isn't taxed in CPU
> or
> > disk, I have a good feeling that this may be the cause.
>
> Whatever that is, I'll bet lunch that it's got 0 to do with caching
> query plans.  Can you get stack tracebacks from some of the stuck
> processes?  What do they show in "ps"?
>
>             regards, tom lane


Re: Caching of Queries

From
Greg Stark
Date:
"Jason Coene" <jcoene@gotfrag.com> writes:

> All of our "postgres" processes end up in the "semwai" state - seemingly
> waiting on other queries to complete.  If the system isn't taxed in CPU or
> disk, I have a good feeling that this may be the cause.

Well, it's possible contention of some sort is an issue but it's not clear
that it's planning related contention.

> We're running on SELECT's, and the number of locks on our "high traffic"
> tables grows to the hundreds.

Where are you seeing this? What information do you have about these locks?

> I've looked at PREPARE, but apparently it only lasts per-session - that's
> worthless in our case (web based service, one connection per data-requiring
> connection).

Well the connection time in postgres is pretty quick. But a lot of other
things, including prepared queries but also including other factors are a lot
more effective if you have long-lived sessions.

I would strongly recommend you consider some sort of persistent database
connection for your application. Most web based services run queries from a
single source base where all the queries are written in-house. In that
situation you can ensure that one request never leaves the session in an
unusual state (like setting guc variables strangely, or leaving a transaction
open, or whatever).

That saves you the reconnect time, which as I said is actually small, but
could still be contributing to your problem. I think it also makes the buffer
cache more effective as well. And It also means you can prepare all your
queries and reuse them on subsequent requests.

The nice thing about web based services is that while each page only executes
each query once, you tend to get the same pages over and over thousands of
times. So if they prepare their queries the first time around they can reuse
those prepared queries thousands of times.

Using a text cache of the query string on the server side is just a
work-around for failing to do that on the client side. It's much more
efficient and more flexible to do it on the client-side.

--
greg

Re: Caching of Queries

From
Tom Lane
Date:
"Jason Coene" <jcoene@gotfrag.com> writes:
> I'm not sure how I go about getting the stack traceback you need.  Any info
> on this?  Results of "ps" below.  System is dual xeon 2.6, 2gb ram, hardware
> raid 10 running FreeBSD 5.2.1.

Hmm.  Dual Xeon sets off alarm bells ...

I think you are probably looking at the same problem previously reported
by Josh Berkus among others.  Does the rate of context swaps shown by
vmstat go through the roof when this happens?  If you strace or ktrace
one of the backends, do you see lots of semop()s and little else?

Check the archives for this thread among others:
http://archives.postgresql.org/pgsql-performance/2004-04/msg00249.php
The test case you are talking about is a tight indexscan loop, which
is pretty much the same scenario as here:
http://archives.postgresql.org/pgsql-performance/2004-04/msg00280.php

The fundamental problem is heavy contention for access to a shared data
structure.  We're still looking for good solutions, but in the context
of this thread it's worth pointing out that a shared query-plan cache
would itself be subject to heavy contention, and arguably would make
this sort of problem worse not better.

            regards, tom lane

Re: Caching of Queries

From
"Matt Clark"
Date:
> I've looked at PREPARE, but apparently it only lasts
> per-session - that's worthless in our case (web based
> service, one connection per data-requiring connection).

That's a non-sequitur.  Most 'normal' high volume web apps have persistent
DB connections, one per http server process.  Are you really dropping DB
connections and reconnecting each time a new HTTP request comes in?

M


Re: Caching of Queries

From
Gaetano Mendola
Date:
Jason Coene wrote:
> Hi Tom,
>
> Easily recreated with Apache benchmark, "ab -n 30000 -c 3000
> http://webserver ".  This runs 1 query per page, everything else is cached
> on webserver.

That test require 30000 access with 3000 connections that is not a normal
load. Describe us your HW.

3000 connections means a very huge load, may you provide also the result of
"vmstat 5" my webserver trash already with -c 120 !

how many connection your postgres can manage ?

You have to consider to use a connection pool with that ammount of connections.


Regards
Gaetano Mendola

Re: Caching of Queries

From
Markus Schaber
Date:
Hi, Jason,

On Thu, 23 Sep 2004 12:53:25 -0400
"Jason Coene" <jcoene@gotfrag.com> wrote:

> I've looked at PREPARE, but apparently it only lasts per-session - that's
> worthless in our case (web based service, one connection per data-requiring
> connection).

This sounds like the loads of connection init and close may be the
reason for the slowdown. Can you use connection pooling in your service?

HTH,
Markus


--
markus schaber | dipl. informatiker
logi-track ag | rennweg 14-16 | ch 8001 zürich
phone +41-43-888 62 52 | fax +41-43-888 62 53
mailto:schabios@logi-track.com | www.logi-track.com

Re: Caching of Queries

From
Josh Berkus
Date:
Tom,

> I think you are probably looking at the same problem previously reported
> by Josh Berkus among others.  Does the rate of context swaps shown by
> vmstat go through the roof when this happens?  If you strace or ktrace
> one of the backends, do you see lots of semop()s and little else?

That would be interesting.   Previously we've only demonstrated the problem on
long-running queries, but I suppose it could also affect massive concurrent
query access.

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: Caching of Queries

From
Tom Lane
Date:
Josh Berkus <josh@agliodbs.com> writes:
>> I think you are probably looking at the same problem previously reported
>> by Josh Berkus among others.

> That would be interesting.  Previously we've only demonstrated the
> problem on long-running queries, but I suppose it could also affect
> massive concurrent query access.

Well, the test cases we used were designed to get the system into a
tight loop of grabbing and releasing shared buffers --- a long-running
index scan is certainly one of the best ways to do that, but there are
others.

I hadn't focused before on the point that Jason is launching a new
connection for every query.  In that scenario I think the bulk of the
cycles are going to go into loading the per-backend catalog caches with
the system catalog rows that are needed to parse and plan the query.
The catalog fetches to get those rows are effectively mini-queries
with preset indexscan plans, so it's not hard to believe that they'd be
hitting the BufMgrLock nearly as hard as a tight indexscan loop.  Once
all the pages needed are cached in shared buffers, there's no I/O delays
to break the loop, and so you could indeed get into the context swap
storm regime we saw before.

I concur with the thought that using persistent connections might go a
long way towards alleviating his problem.

            regards, tom lane

Re: Caching of Queries

From
"Jason Coene"
Date:
Hi All,

It does sound like we should be pooling connections somehow.  I'll be
looking at implementing that shortly.  I'd really like to understand what
the actual problem is, though.

Sorry, I meant 30,000 with 300 connections - not 3,000.  The 300 connections
/ second is realistic, if not underestimated.  As is the nature of our site
(realtime information about online gaming), there's a huge fan base and as a
big upset happens, we'll do 50,000 page views in a span of 3-5 minutes.

I get the same results with:

ab -n 10000 -c 150 http://www.gotfrag.com/portal/news/

I've attached results from the above test, showing open locks, top output,
and vmstat 5.

Tom, I've run the test described in:

http://archives.postgresql.org/pgsql-performance/2004-04/msg00280.php

Results attached in mptest.txt.  The box did experience the same problems as
we've seen before.  I ran it under a separate database (test), and it still
caused our other queries to slow significantly from our production database
(gf) - semwait again.

It does look like the "cs" column under CPU (which I'd assume is Context
Swap) does bump up significantly (10x or better) during both my ab test, and
the test you suggested in that archived message.

Reading the first thread you pointed out (2004-04/msg00249.php), Josh Berkus
was questioning the ServerWorks chipsets.  We're running on the Intel E7501
Chipset (MSI board).  Our CPU's are 2.66 GHz with 533MHz FSB, Hyperthreading
enabled.  Unfortunately, I don't have physical access to the machine to turn
HT off.


Thanks,

Jason



> -----Original Message-----
> From: Gaetano Mendola [mailto:mendola@bigfoot.com]
> Sent: Thursday, September 23, 2004 1:41 PM
> To: Jason Coene
> Subject: Re: Caching of Queries
>
> Jason Coene wrote:
> > Hi Tom,
> >
> > Easily recreated with Apache benchmark, "ab -n 30000 -c 3000
> > http://webserver ".  This runs 1 query per page, everything else is
> cached
> > on webserver.
>
> That test require 30000 access with 3000 connections that is not a normal
> load. Describe us your HW.
>
> 3000 connections means a very huge load, may you provide also the result
> of
> "vmstat 5" my webserver trash already with -c 120 !
>
> how many connection your postgres can manage ?
>
> You have to consider to use a connection pool with that ammount of
> connections.
>
>
> Regards
> Gaetano Mendola


Attachment

Re: Caching of Queries

From
"Jason Coene"
Date:
Update:

I just tried running the same test (ab with 150 concurrent connections)
while connecting to postgres through 35 persistent connections (PHP
library), and had roughly the same type of results.  This should eliminate
the "new connection" overhead.  I've attached top and vmstat.  I let it run
until it had completed 800 requests.  Unless I'm missing something, there's
more than the "new connection" IO load here.

Jason

> -----Original Message-----
> From: Jason Coene [mailto:jcoene@gotfrag.com]
> Sent: Thursday, September 23, 2004 3:08 PM
> To: pgsql-performance@postgresql.org
> Cc: mendola@bigfoot.com; tgl@sss.pgh.pa.us; josh@agliodbs.com
> Subject: RE: Caching of Queries
>
> Hi All,
>
> It does sound like we should be pooling connections somehow.  I'll be
> looking at implementing that shortly.  I'd really like to understand what
> the actual problem is, though.
>
> Sorry, I meant 30,000 with 300 connections - not 3,000.  The 300
> connections
> / second is realistic, if not underestimated.  As is the nature of our
> site
> (realtime information about online gaming), there's a huge fan base and as
> a
> big upset happens, we'll do 50,000 page views in a span of 3-5 minutes.
>
> I get the same results with:
>
> ab -n 10000 -c 150 http://www.gotfrag.com/portal/news/
>
> I've attached results from the above test, showing open locks, top output,
> and vmstat 5.
>
> Tom, I've run the test described in:
>
> http://archives.postgresql.org/pgsql-performance/2004-04/msg00280.php
>
> Results attached in mptest.txt.  The box did experience the same problems
> as
> we've seen before.  I ran it under a separate database (test), and it
> still
> caused our other queries to slow significantly from our production
> database
> (gf) - semwait again.
>
> It does look like the "cs" column under CPU (which I'd assume is Context
> Swap) does bump up significantly (10x or better) during both my ab test,
> and
> the test you suggested in that archived message.
>
> Reading the first thread you pointed out (2004-04/msg00249.php), Josh
> Berkus
> was questioning the ServerWorks chipsets.  We're running on the Intel
> E7501
> Chipset (MSI board).  Our CPU's are 2.66 GHz with 533MHz FSB,
> Hyperthreading
> enabled.  Unfortunately, I don't have physical access to the machine to
> turn
> HT off.
>
>
> Thanks,
>
> Jason
>
>
>
> > -----Original Message-----
> > From: Gaetano Mendola [mailto:mendola@bigfoot.com]
> > Sent: Thursday, September 23, 2004 1:41 PM
> > To: Jason Coene
> > Subject: Re: Caching of Queries
> >
> > Jason Coene wrote:
> > > Hi Tom,
> > >
> > > Easily recreated with Apache benchmark, "ab -n 30000 -c 3000
> > > http://webserver ".  This runs 1 query per page, everything else is
> > cached
> > > on webserver.
> >
> > That test require 30000 access with 3000 connections that is not a
> normal
> > load. Describe us your HW.
> >
> > 3000 connections means a very huge load, may you provide also the result
> > of
> > "vmstat 5" my webserver trash already with -c 120 !
> >
> > how many connection your postgres can manage ?
> >
> > You have to consider to use a connection pool with that ammount of
> > connections.
> >
> >
> > Regards
> > Gaetano Mendola


Attachment

Re: Caching of Queries

From
"Josh Berkus"
Date:
Jason,

> Sorry, I meant 30,000 with 300 connections - not 3,000.  The 300
> connections
> / second is realistic, if not underestimated.  As is the nature of
> our site
> (realtime information about online gaming), there's a huge fan base
> and as a
> big upset happens, we'll do 50,000 page views in a span of 3-5
> minutes.

First, your posts show no evidences of the CS storm bug.

Second, 300 *new* connections a second is a lot.   Each new connection
requires a significant amount of both database and OS overhead.   This
is why all the other web developers use a connection pool.

In fact, I wouldn't be surprised if your lockups are on the OS level,
even; I don't recall that you cited what OS you're using, but I can
imagine locking up Linux 2.4 trying to spawn 300 new processes a
second.

--Josh

Re: Caching of Queries

From
"Joshua D. Drake"
Date:

Sorry, I meant 30,000 with 300 connections - not 3,000.  The 300
connections
/ second is realistic, if not underestimated.  As is the nature of
our site
(realtime information about online gaming), there's a huge fan base
and as a
big upset happens, we'll do 50,000 page views in a span of 3-5
minutes.   
First, your posts show no evidences of the CS storm bug.

Second, 300 *new* connections a second is a lot.   Each new connection
requires a significant amount of both database and OS overhead.   This
is why all the other web developers use a connection pool.
 
I would second this. You need to be running a connection pool and probably multiple web servers in
front of that. You are talking about a huge amount of connections in that amount of time.

Josh Drake



In fact, I wouldn't be surprised if your lockups are on the OS level,
even; I don't recall that you cited what OS you're using, but I can
imagine locking up Linux 2.4 trying to spawn 300 new processes a
second.

--Josh

---------------------------(end of broadcast)---------------------------
TIP 8: explain analyze is your friend 


-- 
Command Prompt, Inc., home of Mammoth PostgreSQL - S/ODBC and S/JDBC
Postgresql support, programming shared hosting and dedicated hosting.
+1-503-667-4564 - jd@commandprompt.com - http://www.commandprompt.com
PostgreSQL Replicator -- production quality replication for PostgreSQL

Re: Caching of Queries (now with pgpool)

From
"Jason Coene"
Date:
Hi Josh,

I just tried using pgpool to pool the connections, and ran:

ab -n 1000 -c 50 http://wstg.int/portal/news/

I ran some previous queries to get pgpool to pre-establish all the
connections, and ab ran for a few minutes (with one query per page, eek!).
It was still exhibiting the same problems as before.  While so many new
connections at once can surely make the problem worse (and pgpool will
surely help there), shouldn't this prove that it's not the only issue?

We're running FreeBSD 5.2.1

I've attached open locks, running queries, query plans, top output and
vmstat 5 output for while ab was running, from start to finish.

Any ideas?

Jason


> -----Original Message-----
> From: pgsql-performance-owner@postgresql.org [mailto:pgsql-performance-
> owner@postgresql.org] On Behalf Of Josh Berkus
> Sent: Thursday, September 23, 2004 8:06 PM
> To: Jason Coene; pgsql-performance@postgresql.org
> Cc: mendola@bigfoot.com; tgl@sss.pgh.pa.us; josh@agliodbs.com
> Subject: Re: [PERFORM] Caching of Queries
>
> Jason,
>
> > Sorry, I meant 30,000 with 300 connections - not 3,000.  The 300
> > connections
> > / second is realistic, if not underestimated.  As is the nature of
> > our site
> > (realtime information about online gaming), there's a huge fan base
> > and as a
> > big upset happens, we'll do 50,000 page views in a span of 3-5
> > minutes.
>
> First, your posts show no evidences of the CS storm bug.
>
> Second, 300 *new* connections a second is a lot.   Each new connection
> requires a significant amount of both database and OS overhead.   This
> is why all the other web developers use a connection pool.
>
> In fact, I wouldn't be surprised if your lockups are on the OS level,
> even; I don't recall that you cited what OS you're using, but I can
> imagine locking up Linux 2.4 trying to spawn 300 new processes a
> second.
>
> --Josh
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend

Attachment

Re: Caching of Queries (now with pgpool)

From
Michael Adler
Date:
On Thu, Sep 23, 2004 at 09:23:51PM -0400, Jason Coene wrote:
> I ran some previous queries to get pgpool to pre-establish all the
> connections, and ab ran for a few minutes (with one query per page, eek!).
> It was still exhibiting the same problems as before.  While so many new
> connections at once can surely make the problem worse (and pgpool will
> surely help there), shouldn't this prove that it's not the only issue?

> Any ideas?

Now that your connections are persistent, you may benefit from using
PREPAREd queries.

-Mike

Re: Caching of Queries

From
Gaetano Mendola
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Joshua D. Drake wrote:
|
|>>Sorry, I meant 30,000 with 300 connections - not 3,000.  The 300
|>>connections
|>>/ second is realistic, if not underestimated.  As is the nature of
|>>our site
|>>(realtime information about online gaming), there's a huge fan base
|>>and as a
|>>big upset happens, we'll do 50,000 page views in a span of 3-5
|>>minutes.
|>>
|>>
|>
|>First, your posts show no evidences of the CS storm bug.
|>
|>Second, 300 *new* connections a second is a lot.   Each new connection
|>requires a significant amount of both database and OS overhead.   This
|>is why all the other web developers use a connection pool.
|>
|>
|>
| I would second this. You need to be running a connection pool and
| probably multiple web servers in
| front of that. You are talking about a huge amount of connections in
| that amount of time.
|
| Josh Drake
|
|
|
|>In fact, I wouldn't be surprised if your lockups are on the OS level,
|>even; I don't recall that you cited what OS you're using, but I can
|>imagine locking up Linux 2.4 trying to spawn 300 new processes a
|>second.

Not to mention that a proxy squid mounted in reverse proxy mode will
help a lot.


Regards
Gaetano Mendola






-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (MingW32)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBVaOg7UpzwH2SGd4RAnW4AJ9TYV0oSjYcv8Oxt4Ot/T/nJikoRgCg1Egx
r4KKm14ziu/KWFb3SnTK/U8=
=xgmw
-----END PGP SIGNATURE-----


Re: Caching of Queries (now with pgpool)

From
Gaetano Mendola
Date:
-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Michael Adler wrote:
| On Thu, Sep 23, 2004 at 09:23:51PM -0400, Jason Coene wrote:
|
|>I ran some previous queries to get pgpool to pre-establish all the
|>connections, and ab ran for a few minutes (with one query per page, eek!).
|>It was still exhibiting the same problems as before.  While so many new
|>connections at once can surely make the problem worse (and pgpool will
|>surely help there), shouldn't this prove that it's not the only issue?
|
|
|>Any ideas?
|
|
| Now that your connections are persistent, you may benefit from using
| PREPAREd queries.
|
| -Mike

With his load will not change anything.


Regards
Gaetano Mendola



-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.5 (MingW32)
Comment: Using GnuPG with Thunderbird - http://enigmail.mozdev.org

iD8DBQFBVany7UpzwH2SGd4RAj9UAJ0SO3VE7zMbwrgdwPQc+HP5PHClMACgtTvn
KIp1TK2lVbmXZ+s62fpJ46U=
=sjT0
-----END PGP SIGNATURE-----


Re: Caching of Queries

From
"Aaron Werman"
Date:
There is a difference between MySQL and Oracle here.

Oracle, to reduce parse/planner costs, hashes statements to see if it can
match an existing optimizer plan. This is optional and there are a few
flavors that range from a characher to characyter match through parse tree
matches through replacing of literals in the statements with parameters.
This dramatically improves performance in almost all high transaction rate
systems.

MySQL stores a statement with its results. This is optional and when a
client allows this type of processing, the SQL is hashed and matched to the
statement - and the stored *result* is returned. The point is that a lot of
systems do lots of static queries, such as a pick list on a web page - but
if the data changes the prior result is returned. This (plus a stable jdbc
driver) was the reason MySQL did well in the eWeek database comparison.

/Aaron



----- Original Message -----
From: "Scott Kirkwood" <scottakirkwood@gmail.com>
To: <pgsql-performance@postgresql.org>
Sent: Wednesday, September 22, 2004 3:50 PM
Subject: [PERFORM] Caching of Queries


> I couldn't find anything in the docs or in the mailing list on this,
> but it is something that Oracle appears to do as does MySQL.
> The idea, I believe, is to do a quick (hash) string lookup of the
> query and if it's exactly the same as another query that has been done
> recently to re-use the old parse tree.
> It should save the time of doing the parsing of the SQL and looking up
> the object in the system tables.
> It should probably go through the planner again because values passed
> as parameters may have changed. Although, for extra points it could
> look at the previous query plan as a hint.
> On the surface it looks like an easy enhancement, but what do I know?
> I suppose it would benefit mostly those programs that use a lot of
> PQexecParams() with simple queries where a greater percentage of the
> time is spent parsing the SQL rather than building the execute plan.
> What do you think?
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
>

Re: Caching of Queries

From
Neil Conway
Date:
On Thu, 2004-09-23 at 07:43, Aaron Werman wrote:
> MySQL stores a statement with its results. This is optional and when a
> client allows this type of processing, the SQL is hashed and matched to the
> statement - and the stored *result* is returned. The point is that a lot of
> systems do lots of static queries, such as a pick list on a web page - but
> if the data changes the prior result is returned. This (plus a stable jdbc
> driver) was the reason MySQL did well in the eWeek database comparison.

I think the conclusion of past discussions about this feature is that
it's a bad idea. Last I checked, MySQL has to clear the *entire* query
cache when a single DML statement modifying the table in question is
issued. Not to mention that the feature is broken for non-deterministic
queries (like now(), ORDER BY random(), or nextval('some_seq'), and so
on). That makes the feature close to useless for a lot of situations,
albeit not every situation.

-Neil



Re: Caching of Queries

From
Tom Lane
Date:
Neil Conway <neilc@samurai.com> writes:
> I think the conclusion of past discussions about this feature is that
> it's a bad idea. Last I checked, MySQL has to clear the *entire* query
> cache when a single DML statement modifying the table in question is
> issued.

Do they actually make a rigorous guarantee that the cached result is
still accurate when/if it is returned to the client?  (That's an honest
question --- I don't know how MySQL implements this.)

IIRC, in our past threads on this topic, it was suggested that if you
can tolerate not-necessarily-up-to-date results, you should be doing
this sort of caching on the client side and not in the DB server at all.
I wouldn't try that in a true "client" scenario, but when the DB client
is application-server middleware, it would make some sense to cache in
the application server.

            regards, tom lane

Re: Caching of Queries

From
Greg Stark
Date:
Neil Conway <neilc@samurai.com> writes:

> I think the conclusion of past discussions about this feature is that
> it's a bad idea. Last I checked, MySQL has to clear the *entire* query
> cache when a single DML statement modifying the table in question is
> issued. Not to mention that the feature is broken for non-deterministic
> queries (like now(), ORDER BY random(), or nextval('some_seq'), and so
> on). That makes the feature close to useless for a lot of situations,
> albeit not every situation.

Well there's no reason to assume that just because other implementations are
weak that postgres would have to slavishly copy them.

I've often wondered whether it would make sense to cache the intermediate
results in queries. Any time there's a Materialize node, the database is
storing all those data somewhere; it could note the plan and parameters that
generated the data and reuse them if it sees the same plan and parameters --
including keeping track of whether the source tables have changed or whether
there were any non-immutable functions of course.

This could be quite helpful as people often do a series of queries on the same
basic data. Things like calculating the total number of records matching the
user's query then fetching only the records that fit on the current page. Or
fetching records for a report then having to calculate subtotals and totals
for that same report. Or even generating multiple reports breaking down the
same data along different axes.

--
greg

Re: Caching of Queries

From
Scott Kirkwood
Date:
On Mon, 27 Sep 2004 15:03:01 +1000, Neil Conway <neilc@samurai.com> wrote:
> I think the conclusion of past discussions about this feature is that
> it's a bad idea. Last I checked, MySQL has to clear the *entire* query
> cache when a single DML statement modifying the table in question is
> issued. Not to mention that the feature is broken for non-deterministic
> queries (like now(), ORDER BY random(), or nextval('some_seq'), and so
> on). That makes the feature close to useless for a lot of situations,
> albeit not every situation.

I think it's important to demark three levels of possible caching:
1) Caching of the parsed query tree
2) Caching of the query execute plan
3) Caching of the query results

I think caching the query results (3) is pretty dangerous and
difficult to do correctly.

Caching of the the execute plan (2) is not dangerous but may actually
execute more slowly by caching a bad plan (i.e. a plan not suited to
the current data)

Caching of the query tree (1) to me has very little downsides (except
extra coding). But may not have a lot of win either, depending how
much time/resources are required to parse the SQL and lookup the
objects in the system tables (something I've never gotten a
satisfactory answer about). Also, some of the query cache would have
to be cleared when DDL statements are performed.

-Scott

Re: Caching of Queries

From
Mitch Pirtle
Date:
On Mon, 27 Sep 2004 01:18:56 -0400, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> IIRC, in our past threads on this topic, it was suggested that if you
> can tolerate not-necessarily-up-to-date results, you should be doing
> this sort of caching on the client side and not in the DB server at all.
> I wouldn't try that in a true "client" scenario, but when the DB client
> is application-server middleware, it would make some sense to cache in
> the application server.

I'd also like to add that when one of the Mambo community members
started running benchmarks of popular Content Management Systems
(CMS), the ones that implemented page-level caching were significantly
more scalable as a result of the decreased load on the database (and
application server, as a result):

http://forum.mamboserver.com/showthread.php?t=11782

Caching at the database level provides the smallest possible
performance boost (at least regarding caching), as caching the query
on the webserver (via ADOdb's query cache) avoids the database server
altogether; and page-level caching gives you the biggest possible
benefit.

Yes, you have to be careful how you cache your data, but for many
applications it is easy to implement a trigger that clears the cache
when certain data is updated.

-- Mitch

Re: Caching of Queries

From
Bruce Momjian
Date:
Added to TODO:

* Consider automatic caching of queries at various levels:
        o Parsed query tree
        o Query execute plan
        o Query results


---------------------------------------------------------------------------

Scott Kirkwood wrote:
> On Mon, 27 Sep 2004 15:03:01 +1000, Neil Conway <neilc@samurai.com> wrote:
> > I think the conclusion of past discussions about this feature is that
> > it's a bad idea. Last I checked, MySQL has to clear the *entire* query
> > cache when a single DML statement modifying the table in question is
> > issued. Not to mention that the feature is broken for non-deterministic
> > queries (like now(), ORDER BY random(), or nextval('some_seq'), and so
> > on). That makes the feature close to useless for a lot of situations,
> > albeit not every situation.
>
> I think it's important to demark three levels of possible caching:
> 1) Caching of the parsed query tree
> 2) Caching of the query execute plan
> 3) Caching of the query results
>
> I think caching the query results (3) is pretty dangerous and
> difficult to do correctly.
>
> Caching of the the execute plan (2) is not dangerous but may actually
> execute more slowly by caching a bad plan (i.e. a plan not suited to
> the current data)
>
> Caching of the query tree (1) to me has very little downsides (except
> extra coding). But may not have a lot of win either, depending how
> much time/resources are required to parse the SQL and lookup the
> objects in the system tables (something I've never gotten a
> satisfactory answer about). Also, some of the query cache would have
> to be cleared when DDL statements are performed.
>
> -Scott
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: Caching of Queries

From
Harald Fuchs
Date:
In article <1096261381.25688.741.camel@localhost.localdomain>,
Neil Conway <neilc@samurai.com> writes:

> I think the conclusion of past discussions about this feature is that
> it's a bad idea. Last I checked, MySQL has to clear the *entire* query
> cache when a single DML statement modifying the table in question is
> issued.

Nope, it deletes only queries using that table.

> Not to mention that the feature is broken for non-deterministic
> queries (like now(), ORDER BY random(), or nextval('some_seq'), and so
> on).

Queries containing now(), rand(), or similar functions aren't cached by MySQL.

Re: Caching of Queries

From
"Aaron Werman"
Date:
From: "Scott Kirkwood" <scottakirkwood@gmail.com>

> On Mon, 27 Sep 2004 15:03:01 +1000, Neil Conway <neilc@samurai.com> wrote:
> > I think the conclusion of past discussions about this feature is that
> > it's a bad idea. Last I checked, MySQL has to clear the *entire* query
> > cache when a single DML statement modifying the table in question is
> > issued. Not to mention that the feature is broken for non-deterministic
> > queries (like now(), ORDER BY random(), or nextval('some_seq'), and so
> > on). That makes the feature close to useless for a lot of situations,
> > albeit not every situation.

Only the cache of changed tables are cleared. MySQL sanely doesn't cache
statements with unstable results. The vast majority of statements are
stable. The feature is likely to dramatically improve performance of most
applications; ones with lots of queries are obvious, but even data
warehouses have lots of (expensive) repetitious queries against static data.

>
> I think it's important to demark three levels of possible caching:
> 1) Caching of the parsed query tree
> 2) Caching of the query execute plan
> 3) Caching of the query results
>
> I think caching the query results (3) is pretty dangerous and
> difficult to do correctly.

I think it's very hard to cache results on the client side without guidance
because it is expensive to notify the client of change events. A changing
table couldn't be cached on client side without a synchronous check to the
db - defeating the purpose.

Guidance should work, though - I also think an optional client configuration
table which specified static tables would work and the cost of a sparse XOR
hash of statements to find match candidate statements would be negligible.
The list of tables would be a contract that they won't change. The fact is
that there often are a lot of completely static tables in high volume
transaction systems, and the gain of SQUID style proxying could be an
enormous performance gain (effort, network overhead, latency, DB server cont
ext switching, ...) especially in web farm and multi tiered applications
(and middleware doing caching invests so many cycles to do so).

Caching results on the server would also dramatically improve performance of
high transaction rate applications, but less than at the client. The
algorithm of only caching small result sets for tables that haven't changed
recently is trivial, and the cost of first pass filtering of candidate
statements to use a cache result through sparse XOR hashes is low. The
statement/results cache would need to be invalidated when any referenced
table is changed. This option seems like a big win.

>
> Caching of the the execute plan (2) is not dangerous but may actually
> execute more slowly by caching a bad plan (i.e. a plan not suited to
> the current data)

This concern could be resolved by aging plans out of cache.

This concern relates to an idiosyncrasy of pg, that vacuum has such a
profound effect. Anyone who has designed very high transaction rate systems
appreciates DB2 static binding, where a plan is determined and stored in the
database, and precompiled code uses those plans - and is both stable and
free of plan cost. The fact is that at a high transaction rate, we often see
the query parse and optimization as the most expensive activity. The planner
design has to be "dumbed down" to reduce overhead (and even forced to geqo
choice).

The common development philosophy in pg is expecting explicit prepares and
executes against bind variables (relatively rare, but useful in high volume
situations), and otherwise (commonly) using explicit literals in statements.
The problem here is the prepare/execute only works in monolithic
applications, and the chance of reuse of SQL statements with literals is
much lower.

(On a blue sky note, I would love to see a planner that dynamically changed
search depth of execution paths, so it could exhaustively build best plans
at low usage times and be less sophisticated when the load was higher... or
better yet, try alternatively for very high transaction frequency plans
until it found the best one in practice! The identified correct plan would
be used subsequently.)

>
> Caching of the query tree (1) to me has very little downsides (except
> extra coding). But may not have a lot of win either, depending how
> much time/resources are required to parse the SQL and lookup the
> objects in the system tables (something I've never gotten a
> satisfactory answer about). Also, some of the query cache would have
> to be cleared when DDL statements are performed.

Parse cache is obviously easy - just store the parse tree with a hash and
the SQL string. This would only help some very specific types of transaction
mixes. The issue is why go through all this trouble without caching the
plan? The same issues exist in both - the cost of matching, the need to
invalidate if objects definitions change, but the win would be so much less.

>
> -Scott
>
> ---------------------------(end of broadcast)---------------------------
> TIP 7: don't forget to increase your free space map settings
>




Re: Caching of Queries

From
"Matt Clark"
Date:
> I think it's very hard to cache results on the client side
> without guidance because it is expensive to notify the client
> of change events. A changing table couldn't be cached on
> client side without a synchronous check to the db - defeating
> the purpose.

This is very true.  Client side caching is an enormous win for apps, but it
requires quite a lot of logic, triggers to update last-modified fields on
relevant tables, etc etc.  Moving some of this logic to the DB would perhaps
not usually be quite as efficient as a bespoke client caching solution, but
it will above all be a lot easier for the application developer!

The other reason why it is god for the DB to support this feature is that in
typical web apps there are multiple web/app servers in a farm, but mostly
just one live DB instance, so effective client side caching requires a
distributed cache, or a SQL proxy, both of which are the kind of middleware
that tends to give cautious people cause to fret.

As a side effect, this would also satisfy the common gotcha of count(),
max() and other aggregates always needing a scan.  There are _so_ many
occasions where 'select count(*) from bar' really does not need to be that
accurate.

So yeah, here's another vote for this feature.  It doesn't even need to
happen automagically to be honest, so long as it's really simple for the
client to turn on (preferably per-statement or per-table).

Actually, that gives me an implementation idea.  How about cacheable views?
So you might do:

CREATE [ CACHEABLE ] VIEW view
    [ MAXSTALEDATA seconds ]
    [ MAXSTALEPLAN seconds ]
AS ...

That would be tidy I think...

M


Re: Caching of Queries

From
Rod Taylor
Date:
> So yeah, here's another vote for this feature.  It doesn't even need to
> happen automagically to be honest, so long as it's really simple for the
> client to turn on (preferably per-statement or per-table).

It might be easiest to shove the caching logic into pgpool instead.

Create an extension of EXPLAIN which returns data in an easy to
understand format for computers so that pgpool can retrieve information
such as a list of tables involved,

Extend LISTEN to be able to listen for a SELECT on a table --
replacement for dynamically adding triggers to send a notify on inserts,
updates, deletes.

Create some kind of generic LISTEN for structural changes. I know SLONY
could make use of triggers on ALTER TABLE, and friends as well.


When pg_pool is told to cache a query, it can get a table list and
monitor for changes. When it gets changes, simply dumps the cache.


Re: Caching of Queries

From
Jeff
Date:
[ discussion of server side result caching ]

and lets not forget PG's major fork it will throw into things:  MVCC
The results of query A may hold true for txn 1, but not txn 2 and so on
.
That would have to be taken into account as well and would greatly
complicate things.

It is always possible to do a "poor man"'s query cache with triggers..
which would just leave you with basically a materialized view.

--
Jeff Trout <jeff@jefftrout.com>
http://www.jefftrout.com/
http://www.stuarthamm.net/


Re: Caching of Queries

From
Mitch Pirtle
Date:
On Mon, 27 Sep 2004 18:20:48 +0100, Matt Clark <matt@ymogen.net> wrote:
> This is very true.  Client side caching is an enormous win for apps, but it
> requires quite a lot of logic, triggers to update last-modified fields on
> relevant tables, etc etc.  Moving some of this logic to the DB would perhaps
> not usually be quite as efficient as a bespoke client caching solution, but
> it will above all be a lot easier for the application developer!

In the world of PHP it is trivial thanks to PEAR's Cache_Lite.  The
project lead for Mambo implemented page-level caching in a day, and
had all the triggers for clearing the cache included in the content
management interface - not difficult at all.

Basically you set a default in seconds for the HTML results to be
cached, and then have triggers set that force the cache to regenerate
(whenever CRUD happens to the content, for example).

Can't speak for Perl/Python/Ruby/.Net/Java, but Cache_Lite sure made a
believer out of me!

-- Mitch

Re: Caching of Queries

From
"Jim C. Nasby"
Date:
On Thu, Sep 23, 2004 at 08:29:25AM -0700, Mr Pink wrote:
> Not knowing anything about the internals of pg, I don't know how this relates, but in theory,
> query plan caching is not just about saving time re-planning queries, it's about scalability.
> Optimizing queries requires shared locks on the database metadata, which, as I understand it
> causes contention and serialization, which kills scalability.

One of the guru's can correct me if I'm wrong here, but AFAIK metadata
lookups use essentially the same access methods as normal queries. This
means MVCC is used and no locking is required. Even if locks were
required, they would be shared read locks which wouldn't block each
other.
--
Jim C. Nasby, Database Consultant               decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: Caching of Queries

From
Matt Clark
Date:
>Basically you set a default in seconds for the HTML results to be
>cached, and then have triggers set that force the cache to regenerate
>(whenever CRUD happens to the content, for example).
>
>Can't speak for Perl/Python/Ruby/.Net/Java, but Cache_Lite sure made a
>believer out of me!
>
>
>
Nice to have it in a library, but if you want to be that simplistic then
it's easy in any language.  What if a process on server B modifies a n
important value that server A has cached though?  Coherency (albeit that
the client may choose to not use it) is a must for a general solution.

Re: Caching of Queries

From
Matt Clark
Date:
>It might be easiest to shove the caching logic into pgpool instead.
>
>...
>
>When pg_pool is told to cache a query, it can get a table list and
>monitor for changes. When it gets changes, simply dumps the cache.
>
>
>
>
It's certainly the case that the typical web app (which, along with
warehouses, seems to be one half of the needy apps), could probably do
worse than use pooling as well.  I'm not well up enough on pooling to
know how bulletproof it is though, which is why I included it in my list
of things that make me go 'hmm....'.  It would be really nice not to
have to take both things together.

More to the point though, I think this is a feature that really really
should be in the DB, because then it's trivial for people to use.
Taking an existing production app and justifying a switch to an extra
layer of pooling software is relatively hard compared with grabbing data
from a view instead of a table (or setting a variable, or adding a tweak
to a query, or however else it might be implemented).

Eminiently doable in pgpool though, and just the right thing for anyone
already using it.

M

Re: Caching of Queries

From
Rod Taylor
Date:
> More to the point though, I think this is a feature that really really
> should be in the DB, because then it's trivial for people to use.

How does putting it into PGPool make it any less trivial for people to
use?


Re: Caching of Queries

From
Steve Atkins
Date:
On Mon, Sep 27, 2004 at 09:19:12PM +0100, Matt Clark wrote:

> >Basically you set a default in seconds for the HTML results to be
> >cached, and then have triggers set that force the cache to regenerate
> >(whenever CRUD happens to the content, for example).
> >
> >Can't speak for Perl/Python/Ruby/.Net/Java, but Cache_Lite sure made a
> >believer out of me!
> >
> >
> >
> Nice to have it in a library, but if you want to be that simplistic then
> it's easy in any language.  What if a process on server B modifies a n
> important value that server A has cached though?  Coherency (albeit that
> the client may choose to not use it) is a must for a general solution.

memcached is one solution designed for that situation. Easy to use
from most languages.  Works. Lets you use memory on systems where you
have it, rather than using up valuable database server RAM that's
better spent caching disk sectors.

Any competently written application where caching results would be a
suitable performance boost can already implement application or
middleware caching fairly easily, and increase performance much more
than putting result caching into the database would.

I don't see caching results in the database as much of a win for most
well written applications.  Toy benchmarks, sure, but for real apps it
seems it would add a lot of complexity, and violate the whole point of
using an ACID database.

(Caching parse trees or query plans, though? It'd be interesting to
 model what effect that'd have.)

Cheers,
  Steve


Re: Caching of Queries

From
Matt Clark
Date:

More to the point though, I think this is a feature that really really 
should be in the DB, because then it's trivial for people to use.     
How does putting it into PGPool make it any less trivial for people to
use?
The answers are at  http://www2b.biglobe.ne.jp/~caco/pgpool/index-e.html .  Specifically, it's a separate application that needs configuration, the homepage has no real discussion of the potential pitfalls of pooling and what this implementation does to get around them, you get the idea.  I'm sure it's great software, but it doesn't come as part of the DB server, so 95% of people who would benefit from query caching being implemented in it never will.  If it shipped with and was turned on by default in SUSE or RedHat that would be a different matter.  Which I realise makes me look like one of those people who doesn't appreciate code unless it's 'popular', but I hope I'm not *that* bad...

Oh OK, I'll say it, this is a perfect example of why My*** has so much more mindshare.  It's not better, but it sure makes the average Joe _feel_ better.  Sorry, I've got my corporate hat on today, I'm sure I'll feel a little less cynical tomorrow.

M

Re: Caching of Queries

From
Matt Clark
Date:
>Any competently written application where caching results would be a
>suitable performance boost can already implement application or
>middleware caching fairly easily, and increase performance much more
>than putting result caching into the database would.
>
>
>
I guess the performance increase is that you can spend $10,000 on a
developer, or $10,000 on hardware, and for the most part get a more
reliable result the second way.  MemcacheD is fine(ish), but it's not a
panacea, and it's more than easy to shoot yourself in the foot with it.
Caching is hard enough that lots of people do it badly - I'd rather use
an implementation from the PG team than almost anywhere else.

>I don't see caching results in the database as much of a win for most
>well written applications.  Toy benchmarks, sure, but for real apps it
>seems it would add a lot of complexity, and violate the whole point of
>using an ACID database.
>
>
>
Well the point surely is to _remove_ complexity from the application,
which is written by God Knows Who, and put it in the DB, which is
written by God And You.  And you can still have ACID (cached data is not
the same as stale data, although once you have the former, the latter
can begin to look tempting sometimes).

M

Re: Caching of Queries

From
"Iain"
Date:
Jim,

I can only tell you (roughly) how it works wth Oracle, and it's a very well
documented and laboured point over there - it's the cornerstone of Oracle's
scalability architecture, so if you don't believe me, or my explanation is
just plain lacking, then it wouldn't be a bad idea to check it out. The
"other Tom" aka Tomas Kyte runs the Ask Tom site which is a great source of
info on this. It's also very well explained in his book "Expert one on one
Oracle" I think it was called. I havn't seen any reason yet as to why the
same issues shouldn't, don't or  wouldn't apply to pg.

Your comment is both right and wrong. Yes, metadata lookups are essentially
the same as as access methods for normal queries. Any time you read data in
the DB you have to place a shared lock, often called a latch - it's a
lightweight type of lock. The trouble is that while a data page can have
multiple latches set at any time, only 1 process can be placing a  a latch
on a page at a time. This doesn't sound so serious so far, latches are
"lightweight" afterall, however... even in a database of a billion rows and
100+ tables, the database metadata is a very _small_ area. You must put
latches on the metadata tables to do optimization, so for example, if you
are optimizing a 10 table join, you must queue up 10 times to place your
latchs. You then do your optimization and queue up 10 more times to remove
your latches. In fact it is worse than this, because you won't queue up 10
times it's more likely to be a hundred times since it is far more complex
than 1 latch per table being optimized (you will be looking up statistics
and other things).

As I already said, even in a huge DB of a billion rows, these latches are
happening on a realatively small and concentrated data set - the metadata.
Even if there is no contention for the application data, the contention for
the metadata may be furious. Consider this scenario, you have a 1000 users
constantly submitting queries that must not only be soft parsed (SQL
statement syntax) but hard parsed (optimized) because you have no query
cache. Even if they are looking at completely different data, they'll all be
queuing up for latches on the same little patch of metadata. Doubling your
CPU speed or throwing in a fibre channel disk array will not help here, the
system smply won't scale.

Tom Lane noted that since the query cache would be in shared memory the
contention issue does not go away. This is true, but I don't think that it's
hard to see that the amount of contention is consderably less in any system
that is taking advantage of the caching facility - ie applications using
bind variables to reduce hard parsing. However, badly written applications
(from the point of view of query cache utilization) could very well
experience a degradation in performance. This could be handled with an
option to disable caching - or even better to disable caching of any sql not
using binds. I don't think even the mighty Oracle has that option.

As you may have guessed, my vote is for implementing a query cache that
includes plans.

I have no specific preference as to data caching. It doesn't seem to be so
important to me.

Regards
Iain



> On Thu, Sep 23, 2004 at 08:29:25AM -0700, Mr Pink wrote:
> > Not knowing anything about the internals of pg, I don't know how this
relates, but in theory,
> > query plan caching is not just about saving time re-planning queries,
it's about scalability.
> > Optimizing queries requires shared locks on the database metadata,
which, as I understand it
> > causes contention and serialization, which kills scalability.
>
> One of the guru's can correct me if I'm wrong here, but AFAIK metadata
> lookups use essentially the same access methods as normal queries. This
> means MVCC is used and no locking is required. Even if locks were
> required, they would be shared read locks which wouldn't block each
> other.
> --
> Jim C. Nasby, Database Consultant               decibel@decibel.org
> Give your computer some brain candy! www.distributed.net Team #1828
>
> Windows: "Where do you want to go today?"
> Linux: "Where do you want to go tomorrow?"
> FreeBSD: "Are you guys coming, or what?"
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend


Re: Caching of Queries

From
Tom Lane
Date:
"Iain" <iain@mst.co.jp> writes:
> I can only tell you (roughly) how it works wth Oracle,

Which unfortunately has little to do with how it works with Postgres.
This "latches" stuff is irrelevant to us.

In practice, any repetitive planning in PG is going to be consulting
catalog rows that it draws from the backend's local catalog caches.
After the first read of a given catalog row, the backend won't need
to re-read it unless the associated table has a schema update.  (There
are some other cases, like a VACUUM FULL of the catalog the rows came
from, but in practice catalog cache entries don't change often in most
scenarios.)  We need place only one lock per table referenced in order
to interlock against schema updates; not one per catalog row used.

The upshot of all this is that any sort of shared plan cache is going to
create substantially more contention than exists now --- and that's not
even counting the costs of managing the cache, ie deciding when to throw
away entries.

A backend-local plan cache would avoid the contention issues, but would
of course not allow amortizing planning costs across multiple backends.

I'm personally dubious that sharing planning costs is a big deal.
Simple queries generally don't take that long to plan.  Complicated
queries do, but I think the reusability odds go down with increasing
query complexity.

            regards, tom lane

Re: Caching of Queries

From
"Iain"
Date:
Hi Tom,

> This "latches" stuff is irrelevant to us.

Well, that's good to know anyway, thanks for setting me straight. Maybe
Oracle could take a leaf out of PGs book instead of the other way around. I
recall that you mentioned the caching of the schema before, so even though I
assumed PG was latching the metadata, I had begun to wonder if it was
actually neccessary.

While it7s obviously not as critical as I thought, I think there may still
be some potential for query caching by pg. It would be nice to have the
option anyway, as different applications have different needs.

I think that re-use of SQL in applications (ie controlling the proliferation
of SQL statements that are minor variants of each other) is a good goal for
maintainability, even if it doesn't have a major impact on performance as it
seems you are suggesting in the case of pg. Even complex queries that must
be constructed dynamically typically only have a finite number of options
and can still use bind variables, so in a well tuned system, they should
still be viable candidates for caching (ie, if they aren't being bumped out
of the cache by thousands of little queries not using binds).

I'll just finish by saying that, developing applications in a way that would
take advantage of any query caching still seems like good practice to me,
even if the target DBMS has no query caching. For now, that's what I plan to
do with future PG/Oracle/Hypersonic (my 3 favourite DBMSs) application
development anyway.

Regards
Iain


----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: "Iain" <iain@mst.co.jp>
Cc: "Jim C. Nasby" <decibel@decibel.org>; <pgsql-performance@postgresql.org>
Sent: Tuesday, September 28, 2004 12:17 PM
Subject: Re: [PERFORM] Caching of Queries


> "Iain" <iain@mst.co.jp> writes:
> > I can only tell you (roughly) how it works wth Oracle,
>
> Which unfortunately has little to do with how it works with Postgres.
> This "latches" stuff is irrelevant to us.
>
> In practice, any repetitive planning in PG is going to be consulting
> catalog rows that it draws from the backend's local catalog caches.
> After the first read of a given catalog row, the backend won't need
> to re-read it unless the associated table has a schema update.  (There
> are some other cases, like a VACUUM FULL of the catalog the rows came
> from, but in practice catalog cache entries don't change often in most
> scenarios.)  We need place only one lock per table referenced in order
> to interlock against schema updates; not one per catalog row used.
>
> The upshot of all this is that any sort of shared plan cache is going to
> create substantially more contention than exists now --- and that's not
> even counting the costs of managing the cache, ie deciding when to throw
> away entries.
>
> A backend-local plan cache would avoid the contention issues, but would
> of course not allow amortizing planning costs across multiple backends.
>
> I'm personally dubious that sharing planning costs is a big deal.
> Simple queries generally don't take that long to plan.  Complicated
> queries do, but I think the reusability odds go down with increasing
> query complexity.
>
> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
>                http://archives.postgresql.org


Re: Caching of Queries

From
"Aaron Werman"
Date:
----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: "Iain" <iain@mst.co.jp>
Cc: "Jim C. Nasby" <decibel@decibel.org>; <pgsql-performance@postgresql.org>
Sent: Monday, September 27, 2004 11:17 PM
Subject: Re: [PERFORM] Caching of Queries


> "Iain" <iain@mst.co.jp> writes:
> > I can only tell you (roughly) how it works wth Oracle,
>
> Which unfortunately has little to do with how it works with Postgres.
> This "latches" stuff is irrelevant to us.

Latches are the Oracle term for semaphores. Both Oracle and pg use
semaphores and spin locks to serialize activity in critical sections. I
believe that the point that blocking/queuing reduces scalability is valid.

>
> In practice, any repetitive planning in PG is going to be consulting
> catalog rows that it draws from the backend's local catalog caches.
> After the first read of a given catalog row, the backend won't need
> to re-read it unless the associated table has a schema update.  (There
> are some other cases, like a VACUUM FULL of the catalog the rows came
> from, but in practice catalog cache entries don't change often in most
> scenarios.)  We need place only one lock per table referenced in order
> to interlock against schema updates; not one per catalog row used.
>
> The upshot of all this is that any sort of shared plan cache is going to
> create substantially more contention than exists now --- and that's not
> even counting the costs of managing the cache, ie deciding when to throw
> away entries.

I imagine a design where a shared plan cache would consist of the plans,
indexed by a statement hash and again by dependant objects.  A statement to
be planned would be hashed and matched to the cache. DDL would need to
synchronously destroy all dependant plans. If each plan maintains a validity
flag, changing the cache wouldn't have to block so I don't see where there
would be contention.

>
> A backend-local plan cache would avoid the contention issues, but would
> of course not allow amortizing planning costs across multiple backends.
>
> I'm personally dubious that sharing planning costs is a big deal.
> Simple queries generally don't take that long to plan.  Complicated
> queries do, but I think the reusability odds go down with increasing
> query complexity.
>

I think both the parse and planning are major tasks if the transaction rate
is high. Simple queries can easily take much longer to plan than execute, so
this is a scalability concern. Caching complicated queries is valuable -
apps seem to have lots of similar queries because they are intimately
related to the data model.

> regards, tom lane
>
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
>                http://archives.postgresql.org
>

Re: Caching of Queries

From
Tom Lane
Date:
"Aaron Werman" <awerman2@hotmail.com> writes:
> I imagine a design where a shared plan cache would consist of the plans,
> indexed by a statement hash and again by dependant objects.  A statement to
> be planned would be hashed and matched to the cache. DDL would need to
> synchronously destroy all dependant plans. If each plan maintains a validity
  ^^^^^^^^^^^^^
> flag, changing the cache wouldn't have to block so I don't see where there
                           ^^^^^^^^^^^^^^^^^^^^^^
> would be contention.

You have contention to access a shared data structure *at all* -- for
instance readers must lock out writers.  Or didn't you notice the self-
contradictions in what you just said?

Our current scalability problems dictate reducing such contention, not
adding whole new sources of it.

            regards, tom lane

Re: Caching of Queries

From
"Aaron Werman"
Date:
----- Original Message -----
From: "Tom Lane" <tgl@sss.pgh.pa.us>
To: "Aaron Werman" <awerman2@hotmail.com>
Cc: "Iain" <iain@mst.co.jp>; "Jim C. Nasby" <decibel@decibel.org>;
<pgsql-performance@postgresql.org>
Sent: Tuesday, September 28, 2004 9:58 AM
Subject: Re: [PERFORM] Caching of Queries


> "Aaron Werman" <awerman2@hotmail.com> writes:
> > I imagine a design where a shared plan cache would consist of the plans,
> > indexed by a statement hash and again by dependant objects.  A statement
to
> > be planned would be hashed and matched to the cache. DDL would need to
> > synchronously destroy all dependant plans. If each plan maintains a
validity
>   ^^^^^^^^^^^^^
> > flag, changing the cache wouldn't have to block so I don't see where
there
>                            ^^^^^^^^^^^^^^^^^^^^^^
> > would be contention.
>
> You have contention to access a shared data structure *at all* -- for
> instance readers must lock out writers.  Or didn't you notice the self-
> contradictions in what you just said?
>
> Our current scalability problems dictate reducing such contention, not
> adding whole new sources of it.

You're right - that seems unclear. What I meant is that there can be a
global hash table that is never locked, and the hashes point to chains of
plans that are only locally locked for maintenance, such as gc and chaining
hash collisions. If maintenance was relatively rare and only local, my
assumption is that it wouldn't have global impact.

The nice thing about plan caching is that it can be sloppy, unlike block
cache, because it is only an optimization tweak. So, for example, if the
plan has atomic refererence times or counts there is no need to block, since
overwriting is not so bad. If the multiprocessing planner chains the same
plan twice, the second one would ultimately age out....

/Aaron

>
> regards, tom lane
>

Re: Caching of Queries

From
"Matthew Nuzum"
Date:
I could spend a week or two tweaking the performance of my database servers
and probably make some sizeable improvements, but I'm not going to.

Why? Because PostgreSQL screams as it is.

I would make sure that if the consensus is to add some sort of caching that
it be done only if there is no hit to current performance and stability.
That being said, I think that server side caching has major buzz and there's
nothing wrong with adding features that sell.

I will disagree with 3 points made on the argument against caching.
Specifically, the benefit of doing caching on the db server is that the
benefits may be reaped by multiple clients where as caching on the client
side must be done by each client and may not be as effective.

So what if the caching has a slight chance of returning stale results?  Just
make sure people know about it in advance.  There are some things where
stale results are no big deal and if I can easily benefit from an aggressive
caching system, I will (and I do now with the adodb caching library, but
like I said, caching has to be done for each client).  In fact, I'm all for
using a low-tech cache expiration algorithm to keep complexity down.

Finally, if the caching is not likely to help (or may even hurt) simple
queries but is likely to help complex queries then fine, make sure people
know about it and let them decide if they can benefit.

Sorry if I'm beating a dead horse or playing the devil's advocate.  Just
felt compelled to chime in.

--
Matthew Nuzum      + "Man was born free, and everywhere
www.bearfruit.org  :  he is in chains," Rousseau
+~~~~~~~~~~~~~~~~~~+ "Then you will know the truth, and
the TRUTH will set you free," Jesus Christ (John 8:32 NIV)

-----Original Message-----
From: pgsql-performance-owner@postgresql.org
[mailto:pgsql-performance-owner@postgresql.org] On Behalf Of Tom Lane
Sent: Monday, September 27, 2004 1:19 AM
To: Neil Conway
Cc: Aaron Werman; Scott Kirkwood; pgsql-performance@postgresql.org
Subject: Re: [PERFORM] Caching of Queries

Neil Conway <neilc@samurai.com> writes:
> I think the conclusion of past discussions about this feature is that
> it's a bad idea. Last I checked, MySQL has to clear the *entire* query
> cache when a single DML statement modifying the table in question is
> issued.

Do they actually make a rigorous guarantee that the cached result is
still accurate when/if it is returned to the client?  (That's an honest
question --- I don't know how MySQL implements this.)

IIRC, in our past threads on this topic, it was suggested that if you
can tolerate not-necessarily-up-to-date results, you should be doing
this sort of caching on the client side and not in the DB server at all.
I wouldn't try that in a true "client" scenario, but when the DB client
is application-server middleware, it would make some sense to cache in
the application server.

            regards, tom lane

---------------------------(end of broadcast)---------------------------
TIP 2: you can get off all lists at once with the unregister command
    (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)


Re: Caching of Queries

From
"Jim C. Nasby"
Date:
On Mon, Sep 27, 2004 at 09:30:31PM +0100, Matt Clark wrote:
> It's certainly the case that the typical web app (which, along with
> warehouses, seems to be one half of the needy apps), could probably do
> worse than use pooling as well.  I'm not well up enough on pooling to
> know how bulletproof it is though, which is why I included it in my list
> of things that make me go 'hmm....'.  It would be really nice not to
> have to take both things together.

If you're not using a connection pool of some kind then you might as
well forget query plan caching, because your connect overhead will swamp
the planning cost. This does not mean you have to use something like
pgpool (which makes some rather questionable claims IMO); any decent web
application language/environment will support connection pooling.
--
Jim C. Nasby, Database Consultant               decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: Caching of Queries

From
Matt Clark
Date:
>If you're not using a connection pool of some kind then you might as
>well forget query plan caching, because your connect overhead will swamp
>the planning cost. This does not mean you have to use something like
>pgpool (which makes some rather questionable claims IMO); any decent web
>application language/environment will support connection pooling.
>
>
Hmm, a question of definition -  there's a difference between a pool and
a persistent connection.  Pretty much all web apps have one connection
per process, which is persistent (i.e. not dropped and remade for each
request), but not shared between processes, therefore not pooled.

Re: Caching of Queries

From
"Jim C. Nasby"
Date:
On Fri, Oct 01, 2004 at 06:43:42AM +0100, Matt Clark wrote:
>
> >If you're not using a connection pool of some kind then you might as
> >well forget query plan caching, because your connect overhead will swamp
> >the planning cost. This does not mean you have to use something like
> >pgpool (which makes some rather questionable claims IMO); any decent web
> >application language/environment will support connection pooling.
> >
> >
> Hmm, a question of definition -  there's a difference between a pool and
> a persistent connection.  Pretty much all web apps have one connection
> per process, which is persistent (i.e. not dropped and remade for each
> request), but not shared between processes, therefore not pooled.

OK, that'd work too... the point is if you're re-connecting all the time
it doesn't really matter what else you do for performance.
--
Jim C. Nasby, Database Consultant               decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: Caching of Queries

From
"Matt Clark"
Date:
> OK, that'd work too... the point is if you're re-connecting
> all the time it doesn't really matter what else you do for
> performance.

Yeah, although there is the chap who was asking questions on the list
recently who had some very long-running code on his app servers, so was best
off closing the connection because he had far too many postmaster processes
just sitting there idle all the time!

But you're right, it's a killer usually.

M


Re: Caching of Queries

From
"Aaron Werman"
Date:
The context of the discussion was a hack to speed queries against static
tables, so MVCC is not relevent.  As soon as any work unit against a
referenced table commits, the cache is invalid, and in fact the table
shouldn't be a candidate for this caching for a while. In fact, this cache
would reduce some the MVCC 'select count(*) from us_states' type of horrors.

(The attraction of a server side cache is obviously that it could *with no
server or app changes* dramatically improve performance. A materialized view
is a specialized denormalization-ish mechanism to optimize a category of
queries and requires the DBA to sweat the details. It is very hard to cache
things stochastically without writing a server. Trigger managed extracts
won't help you execute 1,000 programs issuing the  query  "select sec_level
from sec where division=23" each second or a big table loaded monthly.)



----- Original Message -----
From: "Jeff" <threshar@torgo.978.org>
To: "Mitch Pirtle" <mitch.pirtle@gmail.com>
Cc: "Aaron Werman" <awerman@hotmail.com>; "Scott Kirkwood"
<scottakirkwood@gmail.com>; "Neil Conway" <neilc@samurai.com>;
<pgsql-performance@postgresql.org>; "Tom Lane" <tgl@sss.pgh.pa.us>
Sent: Monday, September 27, 2004 2:25 PM
Subject: Re: [PERFORM] Caching of Queries


> [ discussion of server side result caching ]
>
> and lets not forget PG's major fork it will throw into things:  MVCC
> The results of query A may hold true for txn 1, but not txn 2 and so on
> .
> That would have to be taken into account as well and would greatly
> complicate things.
>
> It is always possible to do a "poor man"'s query cache with triggers..
> which would just leave you with basically a materialized view.
>
> --
> Jeff Trout <jeff@jefftrout.com>
> http://www.jefftrout.com/
> http://www.stuarthamm.net/
>
>
> ---------------------------(end of broadcast)---------------------------
> TIP 8: explain analyze is your friend
>

Re: Caching of Queries

From
Josh Berkus
Date:
People:

Transparent "query caching" is the "industry standard" for how these things
are handled.   However, Postgres' lack of this feature has made me consider
other approaches, and I'm starting to wonder if the "standard" query caching
-- where a materialized query result, or some reduction thereof, is cached in
database memory -- isn't the best way to cache things.  I'm going to
abbreviate it "SQC" for the rest of this e-mail.

Obviously, the draw of SQC is its transparency to developers.   With it, the
Java/Perl/PHP programmers and the DBA don't have to communicate at all -- you
set it up, give it some RAM, and it "just works".   As someone who frequently
has to consult based on limited knowledge, I can understand the appeal.

However, one of the problems with SQC, aside from the ones already mentioned
of stale data and/or cache-clearing, is that (at least in applications like
MySQL's) it is indiscriminate and caches, at least breifly, unique queries as
readily as common ones.   Possibly Oracle's implementation is more
sophisticated; I've not had an opportunity.

The other half of that problem is that an entire query is cached, rather than
just the relevant data to uniquely identify the request to the application.
This is bad in two respects; one that the entire query needs to be parsed to
see if a new query is materially equivalent, and that two materially
different queries which could utilize overlapping ranges of the same
underlying result set must instead cache their results seperately, eating up
yet more memory.

To explain what I'm talking about, let me give you a counter-example of
another approach.

I have a data-warehousing application with a web front-end.    The data in the
application is quite extensive and complex, and only a summary is presented
to the public users -- but that summary is a query involving about 30 lines
and 16 joins.  This summary information is available in 3 slightly different
forms.  Further, the client has indicated that an up to 1/2 hour delay in
data "freshness" is acceptable.

The first step is forcing that "materialized" view of the data into memory.
Right now I'm working on a reliable way to do that without using Memcached,
which won't install on our Solaris servers.  Temporary tables have the
annoying property of being per-connection, which doesn't work in a pool of 60
connections.

The second step, which I completed first due to the lack of technical
obstacles, is to replace all queries against this data with calls to a
Set-Returning Function (SRF).   This allowed me to re-direct where the data
was coming from -- presumably the same thing could be done through RULES, but
it would have been considerably harder to implement.

The first thing the SRF does is check the criteria passed to it against a set
of cached (in a table) criteria with that user's permission level which is <
1/2 hour old.   If the same criteria are found, then the SRF is returned a
set of row identifiers for the materialized view (MV), and looks up the rows
in the MV and returns those to the web client.

If no identical set of criteria are found, then the query is run to get a set
of identifiers which are then cached, and the SRF returns the queried rows.

Once I surmount the problem of storing all the caching information in
protected memory, the advantages of this approach over SQC are several:

1) The materialized data is available in 3 different forms; a list, a detail
view, and a spreadsheet.  Each form as somewhat different columns and
different rules about ordering, which would likely confuse an SQC planner.
In this implementation, all 3 forms are able to share the same cache.

2) The application is comparing only sets of unambguous criteria rather than
long queries which would need to be compared in planner form in order to
determine query equivalence.

3) With the identifier sets, we are able to cache other information as well,
such as a count of rows, further limiting the number of queries we must run.

4) This approach is ideally suited to the pagination and re-sorting common to
a web result set.  As only the identifiers are cached, the results can be
re-sorted and broken in to pages after the cache read, a fast, all-in-memory
operation.

In conclusion, what I'm saying is that while forms of transparent query
caching (plan, materialized or whatever) may be desirable for other reasons,
it's quite possible to acheive a superior level of "query caching" through
tight integration with the front-end application.

If people are interested in this, I'd love to see some suggestions on ways to
force the materialized view into dedicated memory.

--
Josh Berkus
Aglio Database Solutions
San Francisco

Re: Caching of Queries

From
"Aaron Werman"
Date:
I'm not sure I understand your req fully. If the same request is repeatedly
done with same parameters, you could implement a proxy web server with a
croned script to purge stale pages. If there is substantially the same data
being summarized, doing your own summary tables works; if accessed enough,
they're in memory. I interleaved some notes into your posting.

----- Original Message -----

From: "Josh Berkus" <josh@agliodbs.com>

To: "Postgresql Performance" <pgsql-performance@postgresql.org>

Sent: Friday, October 01, 2004 1:10 PM

Subject: Re: [PERFORM] Caching of Queries




> People:
>
> Transparent "query caching" is the "industry standard" for how these
things
> are handled.   However, Postgres' lack of this feature has made me
consider
> other approaches, and I'm starting to wonder if the "standard" query
caching
> -- where a materialized query result, or some reduction thereof, is cached
in
> database memory -- isn't the best way to cache things.  I'm going to
> abbreviate it "SQC" for the rest of this e-mail.
>
> Obviously, the draw of SQC is its transparency to developers.   With it,
the
> Java/Perl/PHP programmers and the DBA don't have to communicate at all --
you
> set it up, give it some RAM, and it "just works".   As someone who
frequently
> has to consult based on limited knowledge, I can understand the appeal.

My sense is that pg is currently unique among popular dbmses in having the
majority of applications being homegrown (a chicken / egg / advocacy issue -
if I install a CMS, I'm not the DBA or the PHP programmer - and I don't want
to change the code; we'll see more about this when native WinPg happens).




>
> However, one of the problems with SQC, aside from the ones already
mentioned
> of stale data and/or cache-clearing, is that (at least in applications
like
> MySQL's) it is indiscriminate and caches, at least breifly, unique queries
as
> readily as common ones.   Possibly Oracle's implementation is more
> sophisticated; I've not had an opportunity.

I'm not sure I agree here. Stale data and caching choice are
optimizer/buffer manager choices and implementation can decide whether to
allow stale data. These are design choices involving development effort and
choices of where to spend server cycles and memory. All buffering choices
cache unique objects, I'm not sure why this is bad (but sensing you want
control of the choices). FWIW, this is my impression of other dbmses.

In MySQL, a global cache can be specified with size and globally, locally,
or through statement hints in queries to suggest caching results. I don't
believe that these could be used as common subexpressions (with an exception
of MERGE table component results). The optimizer knows nothing about the
cached results - SQL select statements are hashed, and can be replaced by
the the cached statement/results on a match.

In DB2 and Oracle result sets are not cached. They have rich sets of
materialized view features (that match your requirements). They allow a
materialized view to be synchronous with table updates or asynchronous.
Synchronous is often an unrealistic option, and asynchronous materialized
views are refreshed at a specified schedule. The optimizers allow "query
rewrite" (in Oracle it is a session option) so one can connect to the
database and specify that the optimizer is allowed to replace subexpressions
with data from (possibly stale) materialized views. SQL Server 2K has more
restrictive synchronous MVs, but I've never used them.

So, in your example use in Oracle, you would need to define appropriate MVs
with a ½ hour refresh frequency, and hope that the planner would use them in
your queries. The only change in the app is on connection you would allow
use of asynchronous stale data.

You're suggesting an alternative involving identifying common, but
expensive, subexpressions and generating MVs for them. This is a pretty
sophisticated undertaking, and probably requires some theory research to
determine if it's viable.


>
> The other half of that problem is that an entire query is cached, rather
than
> just the relevant data to uniquely identify the request to the
application.
> This is bad in two respects; one that the entire query needs to be parsed
to
> see if a new query is materially equivalent, and that two materially
> different queries which could utilize overlapping ranges of the same
> underlying result set must instead cache their results separately, eating
up
> yet more memory.

There are two separate issues. The cost of parse/optimization and the cost
of results retrieval. Other dbmses hash statement text. This is a good
thing, and probably 3 orders of magnitude faster than parse and
optimization. (Oracle also has options to replace literals with parameters
and match parse trees instead of text, expecting parse costs to be less than
planning costs.) MySQL on a match simply returns the result set. Oracle and
DB2 attempt to rewrite queries to use the DBA selected extracts. The MySQL
approach seems to be almost what you're describing: all it needs is the
statement hash, statement, and result set. The rest of your wish list,
identifying and caching data to satisfy multiple request is what query
rewrite does - as long as you've created the appropriate MV.




>
> To explain what I'm talking about, let me give you a counter-example of
> another approach.
>
> I have a data-warehousing application with a web front-end.    The data in
the
> application is quite extensive and complex, and only a summary is
presented
> to the public users -- but that summary is a query involving about 30
lines
> and 16 joins.  This summary information is available in 3 slightly
different
> forms.  Further, the client has indicated that an up to 1/2 hour delay in
> data "freshness" is acceptable.

This sounds like a requirement for a summary table - if the data can be
summarized appropriately, and a regular refresh process.




>
> The first step is forcing that "materialized" view of the data into
memory.
> Right now I'm working on a reliable way to do that without using
Memcached,
> which won't install on our Solaris servers.  Temporary tables have the
> annoying property of being per-connection, which doesn't work in a pool of
60
> connections.




I'm not clear on your desire to keep the data in memory. If it is because of
I/O cost of the summary table, database buffers should be caching it. If you
want to store calculated results, again - why not use a summary table? The
con of summary tables is the customization / denormalization of the data,
and the need to have programs use them instead of source data - you seem to
be willing to do each of these things.


>
> The second step, which I completed first due to the lack of technical
> obstacles, is to replace all queries against this data with calls to a
> Set-Returning Function (SRF).   This allowed me to re-direct where the
data
> was coming from -- presumably the same thing could be done through RULES,
but
> it would have been considerably harder to implement.
>
> The first thing the SRF does is check the criteria passed to it against a
set
> of cached (in a table) criteria with that user's permission level which is
<
> 1/2 hour old.   If the same criteria are found, then the SRF is returned a
> set of row identifiers for the materialized view (MV), and looks up the
rows
> in the MV and returns those to the web client.
>
> If no identical set of criteria are found, then the query is run to get a
set
> of identifiers which are then cached, and the SRF returns the queried
rows.
>
> Once I surmount the problem of storing all the caching information in
> protected memory, the advantages of this approach over SQC are several:

You are creating summary data on demand. I have had problems with this
approach, mostly because it tends to cost more than doing it in batch and
adds latency (unfortunately adding to peak load - so I tend to prefer
periodic extract/summarize programs). In either approach why don't you want
pg to cache the data? The result also feels more like persisted object data
than typical rdbms processing.

>
> 1) The materialized data is available in 3 different forms; a list, a
detail
> view, and a spreadsheet.  Each form as somewhat different columns and
> different rules about ordering, which would likely confuse an SQC planner.
> In this implementation, all 3 forms are able to share the same cache.




I'm not clear what the issue here is. Are you summarizing data differently
or using some business rules to identify orthogonal queries?


>
> 2) The application is comparing only sets of unambiguous criteria rather
than
> long queries which would need to be compared in planner form in order to
> determine query equivalence.

>
> 3) With the identifier sets, we are able to cache other information as
well,
> such as a count of rows, further limiting the number of queries we must
run.
>
> 4) This approach is ideally suited to the pagination and re-sorting common
to
> a web result set.  As only the identifiers are cached, the results can be
> re-sorted and broken in to pages after the cache read, a fast,
all-in-memory
> operation.
>
> In conclusion, what I'm saying is that while forms of transparent query
> caching (plan, materialized or whatever) may be desirable for other
reasons,
> it's quite possible to achieve a superior level of "query caching" through
> tight integration with the front-end application.

This looks like you're building an object store to support a custom app that
periodically or on demand pulls rdbms data mart data. The description of the
use seems either static, suggesting summary tables or dynamic, suggesting
that you're mimicking some function of a periodically extracted OLAP cube.




>
> If people are interested in this, I'd love to see some suggestions on ways
to
> force the materialized view into dedicated memory.




Can you identify your objections to summarizing the data and letting pg
buffer it?

/Aaron

Re: Caching of Queries

From
Josh Berkus
Date:
Aaron,

> I'm not sure I understand your req fully.

I'm not surprised.    I got wrapped up in an overly involved example and
completely left off the points I was illustrating.   So here's the points, in
brief:

1) Query caching is not  a single problem, but rather several different
problems requiring several different solutions.

2) Of these several different solutions, any particular query result caching
implementation (but particularly MySQL's) is rather limited in its
applicability, partly due to the tradeoffs required.    Per your explanation,
Oracle has improved this by offering a number of configurable options.

3) Certain other caching problems would be solved in part by the ability to
construct "in-memory" tables which would be non-durable and protected from
cache-flushing.  This is what I'm interested in chatting about.

BTW, I AM using a summary table.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: Caching of Queries

From
William Yu
Date:
Josh Berkus wrote:

> 1) Query caching is not  a single problem, but rather several different
> problems requiring several different solutions.
>
> 2) Of these several different solutions, any particular query result caching
> implementation (but particularly MySQL's) is rather limited in its
> applicability, partly due to the tradeoffs required.    Per your explanation,
> Oracle has improved this by offering a number of configurable options.
>
> 3) Certain other caching problems would be solved in part by the ability to
> construct "in-memory" tables which would be non-durable and protected from
> cache-flushing.  This is what I'm interested in chatting about.

Just my 2 cents on this whole issue. I would lean towards having result
caching in pgpool versus the main backend. I want every ounce of memory
on a database server devoted to the database. Caching results would
double the effect of cache flushing ... ie, now both the results and the
pages used to build the results are in memory pushing out other stuff to
disk that may be just as important.

If it was in pgpool or something similar, I could devote a separate
machine just for caching results leaving the db server untouched.

Re: Caching of Queries

From
Josh Berkus
Date:
William,

> Just my 2 cents on this whole issue. I would lean towards having result
> caching in pgpool versus the main backend. I want every ounce of memory
> on a database server devoted to the database. Caching results would
> double the effect of cache flushing ... ie, now both the results and the
> pages used to build the results are in memory pushing out other stuff to
> disk that may be just as important.
>
> If it was in pgpool or something similar, I could devote a separate
> machine just for caching results leaving the db server untouched.

Oddly, Joe Conway just mentioned the same idea to me.

--
--Josh

Josh Berkus
Aglio Database Solutions
San Francisco

Re: Caching of Queries

From
Pierre-Frédéric Caillaud
Date:
> pgpool (which makes some rather questionable claims IMO); any decent web
> application language/environment will support connection pooling.

    That's why it should not be tied to something specific as pgpool.

    If you want performance, which is the case here, usually you have a
webserver serving static files, and an application server serving dynamic
pages.
    This is not necessarily a huge application server, it can be as simple as
an Apache instance serving static files, with a special path mod_proxy'ed
to another instance of apache acting as an application server.
    IMHO this is a nice way to do it, because you have a light weight static
files server which can spawn many processes without using precious
resources like memory and postgres connections, and a specialized server
which has a lot less processes, each one having more size, a db
connection, etc. The connexions are permanent, of course, so there is no
connection overhead. The proxy has an extra advantage buffering the data
 from the "app server" and sending it back slowly to the client, so the app
server can then very quickly process the next request instead of hogging a
db connection while the html is slowly trickled back to the client.
    IMHO the standard PHP way of doing things (just one server) is wrong
because every server process, even if it's serving static files, hogs a
connection and thus needs an extra layer for pooling.
    Thus, I see query result caching as a way to pushing further
architectures which are already optimized for performance, not as a
band-aid for poor design solutions like the one-apache server with pooling.

    Now, a proposition :

    Here is where we are now, a typical slow query :
    PREPARE myquery(text,integer)
    EXECUTE myquery('john',2)

    My proposition :
    PREPARE myquery(text,integer)
    PLANNED USING ('john',2)
    CACHED IF $1 IS NOT NULL AND $2 IS NOT NULL
        DEPENDS ON $1, $2
        MAXIMUM CACHE TIME '5 minute'::interval
        MINIMUM CACHE TIME '1 minute'::interval
        MAXIMUM CACHE SIZE 2000000
    AS SELECT count(*) as number FROM mytable WHERE myname=$2 AND myfield>=$1;

    EXECUTE myquery('john',2)

    Explainations :
    -----------
    PLANNED USING ('john',2)
    Tells the planner to compute the stored query plan using the given
parameters. This is independent from caching but could be a nice feature
as it would avoid the possibility of storing a bad query plan.

    -----------
    CACHED IF $1 IS NOT NULL AND $2 IS NOT NULL
    Specifies that the result is to be cached. There is an optional condition
(here, IF ...) telling postgres of when and where it should cache, or not
cache. It could be useful to avoid wasting cache space.
    -----------
        DEPENDS ON $1, $2
    Defines the cache key. I don't know if this is useful, as the query
parameters make a pretty obvious cache key so why repeat them. It could be
used to add other data as a cache key, like :
        DEPENDS ON (SELECT somefunction($1))
    Also a syntax for specifying which tables should be watched for updates,
and which should be ignored, could be interesting.
    -----------
        MAXIMUM CACHE TIME '5 minute'::interval
    Pretty obvious.
    -----------
        MINIMUM CACHE TIME '1 minute'::interval
    This query is a count and I want a fast but imprecise count. Thus, I
specify a minimum cache time of 1 minute, meaning that the result will
stay in the cache even if the tables change. This is dangerous, so I'd
suggest the following :

        MINIMUM CACHE TIME CASE WHEN result.number>10 THEN '1 minute'::interval
ELSE '5 second'::interval

    Thus the cache time is an expression ; it is evaluated after performed
the query. There needs to be a way to access the 'count' result, which I
called 'result.number' because of the SELECT count() as number.
    The result could also be used in the CACHE IF.

    The idea here is that the count will vary over time, but we accept some
imprecision to gain speed. SWho cares if there are 225 or 227 messages in
a forum thread counter anyway ? However, if there are 2 messages, first
caching the query is less necessary because it's fast, and second a
variation in the count will be much easier to spot, thus we specify a
shorter cache duration for small counts and a longer duration for large
counts.

    For queries returning result sets, this is not usable of course, but a
special feature for speeding count() queries would be welcome !

    -----------
        MAXIMUM CACHE SIZE 2000000
    Pretty obvious. Size in bytes.

    For queries returning several rows, MIN/MAX on result rows could be
useful also :
        MAXIMUM RESULT ROWS nnn
    Or maybe :
        CACHE IF (select count(*) from result) > nnn



    Thinking about it, using prepared queries seems a bad idea ; maybe the
cache should act on the result of functions. This would force the
application programmers to put the queries they want to optimize in
functions, but as function code is shared between connections and prepared
statements are not, maybe it would be easier to implement, and would
shield against some subtle bugs, like PREPARing the different queries
under the same name...

    In that case the cache manager would also know if the function returns
SETOF or not, which would be interesting.

    What do you think of these optimizations ?

    Right now, a count() query cache could be implemented as a simple plsql
function with a table as the cache, by the way.






















Re: Caching of Queries

From
Pierre-Frédéric Caillaud
Date:
> 1) The materialized data is available in 3 different forms; a list, a
> detail
> view, and a spreadsheet.  Each form as somewhat different columns and
> different rules about ordering, which would likely confuse an SQC
> planner.
> In this implementation, all 3 forms are able to share the same cache.

    See my proposal to cache function results.
    You can create a cached function and :

    SELECT your rows FROM cached_function(parameters) WHERE ... ORDER BY...
GROUP BY...

    will only fetch the function result from the cache, and then the only
additional costs are the ORDER and GROUP BY... the query parsing is very
simple, it's just a select, and a "cached function scan"

    I think caching can be made much more powerful if it is made usable like
this. I mean, not only cache a query and its result, but being able to use
cached queries internally like this and manipulaing them, adds value to
the cached data and allows storing less data in the cache because
duplicates are avoided. Thus we could use cached results in CHECK()
conditions, inside plsql functions, anywhere...

Re: Caching of Queries

From
Pierre-Frédéric Caillaud
Date:
> If it was in pgpool or something similar, I could devote a separate
> machine just for caching results leaving the db server untouched.

    BUT you would be limited to caching complete queries. There is a more
efficient strategy...




Re: Caching of Queries

From
Tatsuo Ishii
Date:
> >>More to the point though, I think this is a feature that really really
> >>should be in the DB, because then it's trivial for people to use.
> >>
> >>
> >
> >How does putting it into PGPool make it any less trivial for people to
> >use?
> >
> The answers are at  http://www2b.biglobe.ne.jp/~caco/pgpool/index-e.html
> .  Specifically, it's a separate application that needs configuration,
> the homepage has no real discussion of the potential pitfalls of pooling
> and what this implementation does to get around them, you get the idea.

I don't know what you are exactly referring to in above URL when you
are talking about "potential pitfalls of pooling". Please explain
more.
--
Tatsuo Ishii

Re: Caching of Queries

From
"Jim C. Nasby"
Date:
On Fri, Oct 01, 2004 at 10:10:40AM -0700, Josh Berkus wrote:
> Transparent "query caching" is the "industry standard" for how these things
> are handled.   However, Postgres' lack of this feature has made me consider
> other approaches, and I'm starting to wonder if the "standard" query caching
> -- where a materialized query result, or some reduction thereof, is cached in
> database memory -- isn't the best way to cache things.  I'm going to
> abbreviate it "SQC" for the rest of this e-mail.

Not to quibble, but are you sure that's the standard? Oracle and DB2
don't do this, and I didn't think MSSQL did either. What they do do is
cache query *plans*. This is a *huge* deal in Oracle; search
http://asktom.oracle.com for 'soft parse'.

In any case, I think a means of marking some specific queries as being
cachable is an excellent idea; perfect for 'static data' scenarios. What
I don't know is how much will be saved.
--
Jim C. Nasby, Database Consultant               decibel@decibel.org
Give your computer some brain candy! www.distributed.net Team #1828

Windows: "Where do you want to go today?"
Linux: "Where do you want to go tomorrow?"
FreeBSD: "Are you guys coming, or what?"

Re: Caching of Queries

From
"Matt Clark"
Date:
> I don't know what you are exactly referring to in above URL
> when you are talking about "potential pitfalls of pooling".
> Please explain more.

Sorry, I wasn't implying that pgpool doesn't deal with the issues, just that
some people aren't necessarily aware of them up front.  For instance, pgpool
does an 'abort transaction' and a 'reset all' in lieu of a full reconnect
(of course, since a full reconnect is exactly what we are trying to avoid).
Is this is enough to guarantee that a given pooled connection behaves
exactly as a non-pooled connection would from a client perspective?  For
instance, temporary tables are usually dropped at the end of a session, so a
client (badly coded perhaps) that does not already use persistent
connections might be confused when the sequence 'connect, create temp table
foo ..., disconnect, connect, create temp table foo ...' results in the
error 'Relation 'foo' already exists'.



Re: Caching of Queries

From
Tatsuo Ishii
Date:
> > I don't know what you are exactly referring to in above URL
> > when you are talking about "potential pitfalls of pooling".
> > Please explain more.
>
> Sorry, I wasn't implying that pgpool doesn't deal with the issues, just that
> some people aren't necessarily aware of them up front.  For instance, pgpool
> does an 'abort transaction' and a 'reset all' in lieu of a full reconnect
> (of course, since a full reconnect is exactly what we are trying to avoid).
> Is this is enough to guarantee that a given pooled connection behaves
> exactly as a non-pooled connection would from a client perspective?  For
> instance, temporary tables are usually dropped at the end of a session, so a
> client (badly coded perhaps) that does not already use persistent
> connections might be confused when the sequence 'connect, create temp table
> foo ..., disconnect, connect, create temp table foo ...' results in the
> error 'Relation 'foo' already exists'.

First, it's not a particular problem with pgpool. As far as I know any
connection pool solution has exactly the same problem. Second, it's
easy to fix if PostgreSQL provides a functionarity such as:"drop all
temporary tables if any". I think we should implement it if we agree
that connection pooling should be implemented outside the PostgreSQL
engine itself. I think cores agree with this.
--
Tatsuo Ishii

Re: Caching of Queries

From
Tom Lane
Date:
Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> First, it's not a particular problem with pgpool. As far as I know any
> connection pool solution has exactly the same problem. Second, it's
> easy to fix if PostgreSQL provides a functionarity such as:"drop all
> temporary tables if any".

I don't like that definition exactly --- it would mean that every time
we add more backend-local state, we expect client drivers to know to
issue the right incantation to reset that kind of state.

I'm thinking we need to invent a command like "RESET CONNECTION" that
resets GUC variables, drops temp tables, forgets active NOTIFYs, and
generally does whatever else needs to be done to make the session state
appear virgin.  When we add more such state, we can fix it inside the
backend without bothering clients.

I now realize that our "RESET ALL" command for GUC variables was not
fully thought out.  We could possibly redefine it as doing the above,
but that might break some applications ...

            regards, tom lane

Re: Caching of Queries

From
Tatsuo Ishii
Date:
> Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> > First, it's not a particular problem with pgpool. As far as I know any
> > connection pool solution has exactly the same problem. Second, it's
> > easy to fix if PostgreSQL provides a functionarity such as:"drop all
> > temporary tables if any".
>
> I don't like that definition exactly --- it would mean that every time
> we add more backend-local state, we expect client drivers to know to
> issue the right incantation to reset that kind of state.
>
> I'm thinking we need to invent a command like "RESET CONNECTION" that
> resets GUC variables, drops temp tables, forgets active NOTIFYs, and
> generally does whatever else needs to be done to make the session state
> appear virgin.  When we add more such state, we can fix it inside the
> backend without bothering clients.

Great. It's much better than I propose.

> I now realize that our "RESET ALL" command for GUC variables was not
> fully thought out.  We could possibly redefine it as doing the above,
> but that might break some applications ...
>
>             regards, tom lane
>

Re: Caching of Queries

From
Bruce Momjian
Date:
Added to TODO:

* Add RESET CONNECTION command to reset all session state

  This would include resetting of all variables (RESET ALL), dropping of
  all temporary tables, removal of any NOTIFYs, etc.  This could be used
  for connection pooling.  We could also change RESET ALL to have this
  functionality.


---------------------------------------------------------------------------

Tatsuo Ishii wrote:
> > Tatsuo Ishii <t-ishii@sra.co.jp> writes:
> > > First, it's not a particular problem with pgpool. As far as I know any
> > > connection pool solution has exactly the same problem. Second, it's
> > > easy to fix if PostgreSQL provides a functionarity such as:"drop all
> > > temporary tables if any".
> >
> > I don't like that definition exactly --- it would mean that every time
> > we add more backend-local state, we expect client drivers to know to
> > issue the right incantation to reset that kind of state.
> >
> > I'm thinking we need to invent a command like "RESET CONNECTION" that
> > resets GUC variables, drops temp tables, forgets active NOTIFYs, and
> > generally does whatever else needs to be done to make the session state
> > appear virgin.  When we add more such state, we can fix it inside the
> > backend without bothering clients.
>
> Great. It's much better than I propose.
>
> > I now realize that our "RESET ALL" command for GUC variables was not
> > fully thought out.  We could possibly redefine it as doing the above,
> > but that might break some applications ...
> >
> >             regards, tom lane
> >
>
> ---------------------------(end of broadcast)---------------------------
> TIP 2: you can get off all lists at once with the unregister command
>     (send "unregister YourEmailAddressHere" to majordomo@postgresql.org)
>

--
  Bruce Momjian                        |  http://candle.pha.pa.us
  pgman@candle.pha.pa.us               |  (610) 359-1001
  +  If your life is a hard drive,     |  13 Roberts Road
  +  Christ can be your backup.        |  Newtown Square, Pennsylvania 19073

Re: Caching of Queries

From
Pierre-Frédéric Caillaud
Date:
>> I've looked at PREPARE, but apparently it only lasts per-session -
>> that's
>> worthless in our case (web based service, one connection per
>> data-requiring
>> connection).

    You don't use persistent connections ???????????
    Your problem might simply be the connection time overhead (also including
a few TCP roudtrips).