Thread: query cache

query cache

From
Billy Earney
Date:
Greetings!<br /><br />I've done a brief search of the postgresql mail archives, and I've noticed a few projects for
addingquery caches to postgresql,  (for example, Masanori Yamazaki's query cache proposal for GSOC 2011), as well as
thequery cache announced at <a
href="http://www.postgresql.org/about/news/1296/">http://www.postgresql.org/about/news/1296/</a> (pgc).  Both of these
seemto be external solutions that act more like a proxy between clients and servers, instead of being part of the
serverprocesses.<br /><br />I'm wondering if anyone would be interested in a query cache as a backend to postgresql? 
I'vebeen playing around with the postgresql code, and if I'm understanding the code, I believe this is possible. I've
beenwriting some code, but don't have anything working yet, (I'm receiving a hash table corruption error), but I'm
workingthrough it.<br /><br />here's my basic idea:<br /><br />1.  intercept select queries in execMain.c  at
ExecuteQueryand see if the sourcetext of this query is in the "query hash".  (later we could make this more
sophisticated by  using the query plan or some type of AST) instead of the query text since adding or removing a space
wouldcreate a different query hash key.<br /> 2.  if the query is in the cache, return the cached results of this
query.<br/>3.  if the query is not cached, run the query like normal, grabbing the tuples as they are sent to the
"dest"and store them in the cache. (For now, I'm ignoring storage constraints, etc, but these details will need to be
addedbefore going to production).<br /><br />To invalidate cache entries, look at the transactions being committed (and
writtento WAL log, if my memory serves me) and send a message to the qcache process to invalidate any query which
dependson the modfied relation (ie, table, etc)<br /><br /><br />For the experts out there, does this seem reasonable,
oram I misunderstanding the source code?  Anyone aware of a project trying to accomplish this?<br /><br />Thanks!<br
/><br/>Billy Earney<br /><br /> 

Re: query cache

From
Tom Lane
Date:
Billy Earney <billy.earney@gmail.com> writes:
> I'm wondering if anyone would be interested in a query cache as a backend
> to postgresql?

I believe this has been suggested and rejected several times before.
Did you look through the pgsql-hackers archives?

> To invalidate cache entries, look at the transactions being committed (and
> written to WAL log, if my memory serves me) and send a message to the
> qcache process to invalidate any query which depends on the modfied
> relation (ie, table, etc)

The complication, opportunities for bugs, and general slowdown
associated with that would outweigh any possible gain, in the opinion
of most hackers who have thought about this.
        regards, tom lane


Re: query cache

From
Greg Stark
Date:
On Fri, Mar 23, 2012 at 3:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> The complication, opportunities for bugs, and general slowdown
> associated with that would outweigh any possible gain, in the opinion
> of most hackers who have thought about this.

I wouldn't be quite so pessimistic. I think the problem is that the
hard part in doing this for real is all the parts the proposal glosses
over. How much memory is it worth dedicating to the cache before the
cost of that memory costs more than it helps? How do you invalidate
cache entries efficiently enough that it doesn't become a bottleneck?

Also, you need to identify the specific advantages you hope a built-in
cache would have over one implemented in the ORM or database library.
If there aren't any advantages then those solutions are much simpler.
And they have other advantages as well -- one of the main reason
people implement caches is so they can move the load away from the
bottleneck of the database to the more easily scaled out application.


-- 
greg


Re: query cache

From
Billy Earney
Date:
<br /><br /><div class="gmail_quote">On Fri, Mar 23, 2012 at 11:29 AM, Greg Stark <span dir="ltr"><<a
href="mailto:stark@mit.edu">stark@mit.edu</a>></span>wrote:<br /><blockquote class="gmail_quote" style="margin:0 0 0
.8ex;border-left:1px#ccc solid;padding-left:1ex"><div class="im">On Fri, Mar 23, 2012 at 3:49 PM, Tom Lane <<a
href="mailto:tgl@sss.pgh.pa.us">tgl@sss.pgh.pa.us</a>>wrote:<br /> > The complication, opportunities for bugs,
andgeneral slowdown<br /> > associated with that would outweigh any possible gain, in the opinion<br /> > of most
hackerswho have thought about this.<br /><br /></div>I wouldn't be quite so pessimistic. I think the problem is that
the<br/> hard part in doing this for real is all the parts the proposal glosses<br /> over. How much memory is it worth
dedicatingto the cache before the<br /> cost of that memory costs more than it helps? How do you invalidate<br /> cache
entriesefficiently enough that it doesn't become a bottleneck?<br /><br /> Also, you need to identify the specific
advantagesyou hope a built-in<br /> cache would have over one implemented in the ORM or database library.<br /> If
therearen't any advantages then those solutions are much simpler.<br /> And they have other advantages as well -- one
ofthe main reason<br /> people implement caches is so they can move the load away from the<br /> bottleneck of the
databaseto the more easily scaled out application.<br /><span class="HOEnZb"><font color="#888888"></font></span><br
/></blockquote></div>Thanksfor the input.  I've had many of these thoughts myself, and I guess it depends on the
environmentthe database will be used, memory settings, and other variables,  on how valuable a query cache would be. 
I'lldefinitely give this more thought before sending an official proposal.<br /><br />Billy<br /> 

Re: query cache

From
Robert Haas
Date:
On Fri, Mar 23, 2012 at 12:29 PM, Greg Stark <stark@mit.edu> wrote:
> On Fri, Mar 23, 2012 at 3:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> The complication, opportunities for bugs, and general slowdown
>> associated with that would outweigh any possible gain, in the opinion
>> of most hackers who have thought about this.
>
> I wouldn't be quite so pessimistic. I think the problem is that the
> hard part in doing this for real is all the parts the proposal glosses
> over. How much memory is it worth dedicating to the cache before the
> cost of that memory costs more than it helps? How do you invalidate
> cache entries efficiently enough that it doesn't become a bottleneck?

I think the question of how you would invalidate things is a very good one.

The other thing that makes me skeptical of this proposal is that I am
not very sure that executing absolutely identical queries is a very
common use case for a relational database.  I suppose there might be a
few queries that run over and over again (e.g. whatever you need to
render your home page), but I think those will be the exception, and
not the rule.  It therefore seems likely that the overhead of such a
cache would in most cases be greater than the benefit of having it in
the first place.

What I think is more common is the repeated submission of queries that
are *nearly* identical, but with either different parameter bindings
or different constants.  It would be nice to have some kind of cache
that would allow us to avoid the overhead of parsing and planning
nearly identical statements over and over again, but the trick is that
you have to fingerprint the query to notice that's happening in the
first place, and the fingerprinting has to cost less than what the
cache saves you.  I don't know whether that's possible, but I suspect
it's far from easy.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: query cache

From
Greg Stark
Date:
On Fri, Mar 23, 2012 at 5:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> The other thing that makes me skeptical of this proposal is that I am
> not very sure that executing absolutely identical queries is a very
> common use case for a relational database.  I suppose there might be a
> few queries that run over and over again (e.g. whatever you need to
> render your home page), but I think those will be the exception, and
> not the rule.  It therefore seems likely that the overhead of such a
> cache would in most cases be greater than the benefit of having it in
> the first place.

Well it's not entirely unlikely. If you step back a web application
looks like a big loop with a switch statement to go to different
pages. It keeps executing the same loop over and over again and there
are only a smallish number of web pages. Sure the bind variables
change but there will only be so many bind values and 10% of those
will get 90% of the traffic too.

But the other thing that happens is that people run multiple queries
aggregating or selecting from the same subset of data. So you often
get things like

select count(*) from (<complex subquery>)
select * from (<complex subquery>) order by foo limit 10
select * from (<complex subquery>) order by bar limit 10

for the same <complex subquery>. That means if we could cache the rows
coming out of parts of the plan and remember those rows when we see a
plan with a common subtree in the plan then we could avoid a lot of
repetitive work.

This depends on being able to recognize when we can guarantee that
subtrees of plans produce the same rows even if the surrounding tree
changes. That will be true sometimes but not other times.

--
greg


Re: query cache

From
Merlin Moncure
Date:
On Fri, Mar 23, 2012 at 12:03 PM, Robert Haas <robertmhaas@gmail.com> wrote:
> On Fri, Mar 23, 2012 at 12:29 PM, Greg Stark <stark@mit.edu> wrote:
>> On Fri, Mar 23, 2012 at 3:49 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>>> The complication, opportunities for bugs, and general slowdown
>>> associated with that would outweigh any possible gain, in the opinion
>>> of most hackers who have thought about this.
>>
>> I wouldn't be quite so pessimistic. I think the problem is that the
>> hard part in doing this for real is all the parts the proposal glosses
>> over. How much memory is it worth dedicating to the cache before the
>> cost of that memory costs more than it helps? How do you invalidate
>> cache entries efficiently enough that it doesn't become a bottleneck?
>
> I think the question of how you would invalidate things is a very good one.
>
> The other thing that makes me skeptical of this proposal is that I am
> not very sure that executing absolutely identical queries is a very
> common use case for a relational database.  I suppose there might be a
> few queries that run over and over again (e.g. whatever you need to
> render your home page), but I think those will be the exception, and
> not the rule.  It therefore seems likely that the overhead of such a
> cache would in most cases be greater than the benefit of having it in
> the first place.
>
> What I think is more common is the repeated submission of queries that
> are *nearly* identical, but with either different parameter bindings
> or different constants.  It would be nice to have some kind of cache
> that would allow us to avoid the overhead of parsing and planning
> nearly identical statements over and over again, but the trick is that
> you have to fingerprint the query to notice that's happening in the
> first place, and the fingerprinting has to cost less than what the
> cache saves you.  I don't know whether that's possible, but I suspect
> it's far from easy.

Query cache basically addresses two use cases:
1) read only or mostly read only workloads
2) badly written application code (either by human or machine)

The problem is that #1 can be optimized by any number of simple
techniques, and #2 is not a good basis for complicated internal
features with nasty trade-offs.  mysql's query cache woes are well
known -- it's typical for administrators to turn the feature off.  The
feature is misnamed -- it's a 'benchmark cheating feature' since a lot
of db benchmarks tend to focus on single user loads and/or highly
repetitive queries but completely falls over in production real world
workloads.  Also, it's really not that difficult to rig an ad-hoc
cache in the server or on the client side and you can then gear it
towards your particular use-case.

People that are asking for this probably really want materialized views instead.

merlin


Re: query cache

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> What I think is more common is the repeated submission of queries that
> are *nearly* identical, but with either different parameter bindings
> or different constants.  It would be nice to have some kind of cache
> that would allow us to avoid the overhead of parsing and planning
> nearly identical statements over and over again, but the trick is that
> you have to fingerprint the query to notice that's happening in the
> first place, and the fingerprinting has to cost less than what the
> cache saves you.  I don't know whether that's possible, but I suspect
> it's far from easy.

The traditional solution to this is to make the application do it, ie,
parameterized prepared statements.  Since that has direct benefits to
the application as well, in that it can use out-of-line values and
thereby avoid quoting and SQL-injection risks, it's not apparent that
it's a good idea to expend lots of sweat to reverse-engineer
parameterization from a collection of unparameterized queries.
        regards, tom lane


Re: query cache

From
Robert Haas
Date:
On Fri, Mar 23, 2012 at 1:51 PM, Greg Stark <stark@mit.edu> wrote:
> Well it's not entirely unlikely. If you step back a web application
> looks like a big loop with a switch statement to go to different
> pages. It keeps executing the same loop over and over again and there
> are only a smallish number of web pages. Sure the bind variables
> change but there will only be so many bind values and 10% of those
> will get 90% of the traffic too.

That may be true, but lots of web applications have millions of users.The fact that a few hundred thousand of those may
accountfor most of
 
the traffic doesn't seem like it's going to help much unless there are
not many users in total; and in that case it's plenty fast enough
without a cache anyway.

> But the other thing that happens is that people run multiple queries
> aggregating or selecting from the same subset of data. So you often
> get things like
>
> select count(*) from (<complex subquery>)
> select * from (<complex subquery>) order by foo limit 10
> select * from (<complex subquery>) order by bar limit 10
>
> for the same <complex subquery>. That means if we could cache the rows
> coming out of parts of the plan and remember those rows when we see a
> plan with a common subtree in the plan then we could avoid a lot of
> repetitive work.

Currently, we don't even recognize this situation within a plan; for
example, if you do project pp LEFT JOIN person sr ON pp.sales_rep_id =
sr.id LEFT JOIN person pm ON pp.project_manager_id = pm.id, the query
planner will happily seq-scan the person table twice to build two
copies of the same hash table.

-- 
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: query cache

From
Joshua Berkus
Date:
Billy,

> I've done a brief search of the postgresql mail archives, and I've
> noticed a few projects for adding query caches to postgresql, (for
> example, Masanori Yamazaki's query cache proposal for GSOC 2011),

... which was completed, btw.  Take a look at the current release of pgPool.

Are you proposing this for GSOC2012, or is this just a general idea?

> I'm wondering if anyone would be interested in a query cache as a
> backend to postgresql? I've been playing around with the postgresql
> code, and if I'm understanding the code, I believe this is possible.

Well, you'd have to start by demonstrating the benefit of it.  The advantage of query caches in proxies and clients is
well-known,because you can offload some of the work of the database onto other servers, this increasing capacity.
Addinga query cache to the database server would require the "query identity recognition" of the cache to be far
cheaper(as in 10X cheaper) than planning and running the query, which seems unlikely at best.
 

There are a number of proven caching models which PostgreSQL currently does not yet implement.  I'd think it would be
moreprofitable to pursue one of those, such as:
 

* parse caching in the client (JDBC has this, but libpq does not).
* shared cached plans between sessions (snapshot issues here could be nasty)
* fully automated materialized views

If you want to do something radical and new, then come up with a way for a client to request and then reuse a complete
queryplan by passing it to the server.  That would pave the way for client-side plan caching (and plan manipulation)
codewritten in a variety of languages, and thus further innovation through creative algorithms and other ideas.
 

--Josh Berkus





Re: query cache

From
Tom Lane
Date:
Joshua Berkus <josh@agliodbs.com> writes:
> If you want to do something radical and new, then come up with a way
> for a client to request and then reuse a complete query plan by
> passing it to the server.

[ raised eyebrow ]  That seems like a complete nonstarter on two
different grounds: cache invalidation needs (client won't know if plan
is stale) and security issues (pass broken plan to server, crash
server).  Those problems could be avoided if the client simply has a
token for a plan that's kept on the server side ... but how is that
concept different from a prepared statement?
        regards, tom lane


Re: query cache

From
Billy Earney
Date:


On Sat, Mar 24, 2012 at 3:22 PM, Joshua Berkus <josh@agliodbs.com> wrote:
Billy,

> I've done a brief search of the postgresql mail archives, and I've
> noticed a few projects for adding query caches to postgresql, (for
> example, Masanori Yamazaki's query cache proposal for GSOC 2011),

... which was completed, btw.  Take a look at the current release of pgPool.

Are you proposing this for GSOC2012, or is this just a general idea?

just a general idea, but if someone wants to work on it for GSOC2012, I wouldn't mind giving a helping hand.  I'm not a student, so GSOC probably doesn't apply to me.

> I'm wondering if anyone would be interested in a query cache as a
> backend to postgresql? I've been playing around with the postgresql
> code, and if I'm understanding the code, I believe this is possible.

Well, you'd have to start by demonstrating the benefit of it.  The advantage of query caches in proxies and clients is well-known, because you can offload some of the work of the database onto other servers, this increasing capacity.  Adding a query cache to the database server would require the "query identity recognition" of the cache to be far cheaper (as in 10X cheaper) than planning and running the query, which seems unlikely at best.

I figured I'd create the md5 digest of the sourceText of a query, and then look that up in a hash.  I don't think that will be very expensive.  I'll have another hash to keep track of which queries are dependent on which relations, so that when a relation is changed somehow (and committed), the query is then invalidated and removed from the query hash.


Billy

Re: query cache

From
Tatsuo Ishii
Date:
>> Well, you'd have to start by demonstrating the benefit of it.  The
>> advantage of query caches in proxies and clients is well-known, because you
>> can offload some of the work of the database onto other servers, this
>> increasing capacity.  Adding a query cache to the database server would
>> require the "query identity recognition" of the cache to be far cheaper (as
>> in 10X cheaper) than planning and running the query, which seems unlikely
>> at best.
>>
>> I figured I'd create the md5 digest of the sourceText of a query, and then
> look that up in a hash.  I don't think that will be very expensive.  I'll
> have another hash to keep track of which queries are dependent on which
> relations, so that when a relation is changed somehow (and committed), the
> query is then invalidated and removed from the query hash.

From the experience of implementing query cache in pgool-II there are
some suggestions:

- A query result cache should not be created if the transaction including the SELECT is not committed.

- Since a transaction could have many SELECTs, you need to keep those query results somewhere in a temporary storage.
Youcould either discard or register them to the query cache storage depending on the transaction's fate, either aborted
orcommitted.
 

- If a SELECT has non-immutable functions, then the query result should not be cached.

- If a SELECT uses temporary tables, then the query result should not be cached.

- If a SELECT uses unlogged tables, then the query result should not be cached because their data could vanish after
crashrecovery. Of course this is only applied if you plan to use cache storage which does not survive after crash.
 
--
Tatsuo Ishii
SRA OSS, Inc. Japan
English: http://www.sraoss.co.jp/index_en.php
Japanese: http://www.sraoss.co.jp


Re: query cache

From
Billy Earney
Date:
Thanks..  I'll keep those issues in mind.<br /><br /><div class="gmail_quote">On Sat, Mar 24, 2012 at 6:18 PM, Tatsuo
Ishii<span dir="ltr"><<a href="mailto:ishii@postgresql.org">ishii@postgresql.org</a>></span> wrote:<br
/><blockquoteclass="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc solid;padding-left:1ex"><div
class="im">>>Well, you'd have to start by demonstrating the benefit of it.  The<br /> >> advantage of query
cachesin proxies and clients is well-known, because you<br /> >> can offload some of the work of the database
ontoother servers, this<br /> >> increasing capacity.  Adding a query cache to the database server would<br />
>>require the "query identity recognition" of the cache to be far cheaper (as<br /> >> in 10X cheaper) than
planningand running the query, which seems unlikely<br /> >> at best.<br /> >><br /> >> I figured I'd
createthe md5 digest of the sourceText of a query, and then<br /> > look that up in a hash.  I don't think that will
bevery expensive.  I'll<br /> > have another hash to keep track of which queries are dependent on which<br /> >
relations,so that when a relation is changed somehow (and committed), the<br /> > query is then invalidated and
removedfrom the query hash.<br /><br /></div>From the experience of implementing query cache in pgool-II there are<br
/>some suggestions:<br /><br /> - A query result cache should not be created if the transaction<br />  including the
SELECTis not committed.<br /><br /> - Since a transaction could have many SELECTs, you need to keep those<br />  query
resultssomewhere in a temporary storage. You could either<br />  discard or register them to the query cache storage
dependingon the<br />  transaction's fate, either aborted or committed.<br /><br /> - If a SELECT has non-immutable
functions,then the query result<br />  should not be cached.<br /><br /> - If a SELECT uses temporary tables, then the
queryresult should not<br />  be cached.<br /><br /> - If a SELECT uses unlogged tables, then the query result should
not<br/>  be cached because their data could vanish after crash recovery. Of<br />  course this is only applied if you
planto use cache storage which<br />  does not survive after crash.<br /> --<br /> Tatsuo Ishii<br /> SRA OSS, Inc.
Japan<br/> English: <a href="http://www.sraoss.co.jp/index_en.php"
target="_blank">http://www.sraoss.co.jp/index_en.php</a><br/> Japanese: <a href="http://www.sraoss.co.jp"
target="_blank">http://www.sraoss.co.jp</a><br/></blockquote></div><br /> 

Re: query cache

From
Robert Haas
Date:
On Fri, Mar 23, 2012 at 2:54 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
> Robert Haas <robertmhaas@gmail.com> writes:
>> What I think is more common is the repeated submission of queries that
>> are *nearly* identical, but with either different parameter bindings
>> or different constants.  It would be nice to have some kind of cache
>> that would allow us to avoid the overhead of parsing and planning
>> nearly identical statements over and over again, but the trick is that
>> you have to fingerprint the query to notice that's happening in the
>> first place, and the fingerprinting has to cost less than what the
>> cache saves you.  I don't know whether that's possible, but I suspect
>> it's far from easy.
>
> The traditional solution to this is to make the application do it, ie,
> parameterized prepared statements.  Since that has direct benefits to
> the application as well, in that it can use out-of-line values and
> thereby avoid quoting and SQL-injection risks, it's not apparent that
> it's a good idea to expend lots of sweat to reverse-engineer
> parameterization from a collection of unparameterized queries.

But there are several disadvantages to unparameterized queries, too.
One, it requires more messages at the protocol level, which is not
free.  Two, whatever abstraction layer you're using to submit queries
to the database has to support it, and not all of them do.  And three,
you get worse query plans if lack of knowledge about the specific
query parameters proves to be essential.  I know you've done some work
on this last point for 9.2, but I'm fuzzy on the details and on how
much benefit we'll get out of it in real-world cases.

Interestingly, Peter Geoghegan's blog post on the pg_stat_statements
patch you just committed[1] claims that the overhead of fingerprinting
queries was only 1-2.5%, which is less than I would have thought, so
if we ever get to the point where we're fairly sure we've got problem
three licked, it might make sense to revisit this due to problems one
and two.  It's also probably worth keeping in mind the next time we
bump the protocol version: it would be nice to have a way of doing
prepare-bind-execute in a single protocol message, which I believe to
be not possible at present.

[1] http://pgeoghegan.blogspot.com/2012/03/much-improved-statement-statistics.html

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company


Re: query cache

From
Tom Lane
Date:
Robert Haas <robertmhaas@gmail.com> writes:
> Interestingly, Peter Geoghegan's blog post on the pg_stat_statements
> patch you just committed[1] claims that the overhead of fingerprinting
> queries was only 1-2.5%, which is less than I would have thought, so
> if we ever get to the point where we're fairly sure we've got problem
> three licked, it might make sense to revisit this due to problems one
> and two.

Maybe.

> It's also probably worth keeping in mind the next time we
> bump the protocol version: it would be nice to have a way of doing
> prepare-bind-execute in a single protocol message, which I believe to
> be not possible at present.

Huh?  That's the standard way of doing it, actually.  You send
Prepare/Bind/Execute/Sync in one packet, and wait for results.

This requires that you understand the query well enough to perform
Bind without seeing a Describe result, but that seems essential
to any one-round-trip case anyway.  It's not the protocol design
that is holding back anybody who wants to do that.
        regards, tom lane


Re: query cache

From
Robert Haas
Date:
On Thu, Mar 29, 2012 at 4:57 PM, Tom Lane <tgl@sss.pgh.pa.us> wrote:
>> It's also probably worth keeping in mind the next time we
>> bump the protocol version: it would be nice to have a way of doing
>> prepare-bind-execute in a single protocol message, which I believe to
>> be not possible at present.
>
> Huh?  That's the standard way of doing it, actually.  You send
> Prepare/Bind/Execute/Sync in one packet, and wait for results.

That precludes some optimizations, like doing the whole thing under
the same snapshot, since you don't know for sure when the Execute will
be sent.  And as a practical matter, it's slower.  So there's some
room for improvement there, any way you slice it.

--
Robert Haas
EnterpriseDB: http://www.enterprisedb.com
The Enterprise PostgreSQL Company