Thread: Re: eWeek Poll: Which database is most critical to your

Re: eWeek Poll: Which database is most critical to your

From

"Dann Corbit"

Date:

26 February 2002, 18:18:20

-----Original Message-----
From: Neil Conway [mailto:nconway@klamath.dyndns.org]
Sent: Tuesday, February 26, 2002 3:04 PM
To: pgsql-hackers@postgresql.org
Subject: Re: [HACKERS] eWeek Poll: Which database is most critical to
your

On Tue, 2002-02-26 at 15:30, Zak Greant wrote:
> Good Day All,
>
> eWeek has posted a poll that asks which database server is most
critical
> to your organization.

The article mentions a MySQL feature which apparently improved
performance considerably:

//
MySQL 4.0.1's new, extremely fast query cache is also quite notable, as
no other database we tested had this feature. If the text of an incoming
query has a byte-for-byte match with a cached query, MySQL can retrieve
the results directly from the cache without compiling the query, getting
locks or doing index accesses. This query caching will be effective only
for tables with few updates because any table updates that clear the
cache to guarantee correct results are always returned.
//

My guess is that it would be relatively simple to implement. Any
comments on this?

If I implemented this, any chance this would make it into the tree? Of
course, it would be:
   - disabled by default   - enabled on a table-by-table basis (maybe an ALTER TABLE command)
>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
>>
I don't see how it will do any good.  There is no "prepare" in
Postgresql
and therefore you will simply be reexecuting the queries every time any
way.  Also, parameter markers only work in embedded SQL and that is a
single tasking system.

I think it would be a major piece of work to do anything useful along
those lines.

If you look at how DB/2 works, you will see that they store prepared
statements.  Another alternative would be to keep some point in the
parser marked and somehow jump to that point, but you would have to
be able to save a parse tree somewhere and also recognize the query.

Here is where problems come in...
-- Someone wants blue and blue-green, etc shirts that are backordered
SELECT shirt, color, backorder_qty FROM garments WHERE color like
"BLUE%"

Now, another query comes along:

-- Someone else wants reddish, etc shirts that are backordered:
SELECT shirt, color, backorder_qty FROM garments WHERE color like "RED%"

It's the same query with different data.  Without parameter markers you
will never know it.  And yet this is exactly the sort of caching that is
useful.

<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
<<

Re: eWeek Poll: Which database is most critical to your

From

Neil Conway

Date:

26 February 2002, 18:43:07

On Tue, 2002-02-26 at 18:20, Dann Corbit wrote:
> I don't see how it will do any good.  There is no "prepare" in
> Postgresql
> and therefore you will simply be reexecuting the queries every time any
> way.  Also, parameter markers only work in embedded SQL and that is a 
> single tasking system.

Perhaps I wasn't clear. The feature I'm proposing is this:
   When processing SELECT queries but before any real work has been
done, lookup the query in a hash table. If it already exists, return the
cached result. If it doesn't exist, execute the query and cache the
result in the hash table. Optionally, we could not immediately cache the
query, just increment a "frequency" counter stored in the hash table. If
the counter goes above a certain constant, we decide that the query is
worth caching, so we cache the full result in the hash table.
   When processing INSERTs, UPDATEs and DELETEs, check if the query
would affect any of the tables for which we are maintaing this cache. If
so, flush the cache. This ensures that we will never return invalid
results. We could perhaps be fancy and keep stats on which columns our
cached queries utilize and which columns the modifying query will
affect, but that is unlikely to be an overall win.

HOWEVER -- I don't see this feature as something that will appeal to,
say, 75% of PgSQL users. If the table in question is being modified on a
regular basis, or if a wide variety of queries are being issued, this
cache isn't a good idea. Nevertheless, I think there are certainly some
situations in which this cache is useful -- and furthermore, these kinds
of "mostly read-only" situations are often where MySQL is chosen over
PostgreSQL.

Anyway, just putting this on the table -- if the consensus is that this
isn't a very worthwhile feature, I won't bother with it.

Cheers,

Neil

-- 
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC

Re: eWeek Poll: Which database is most critical to your

From

Thomas Swan

Date:

26 February 2002, 19:17:15

Dann Corbit wrote:<br /><blockquote cite="midD90A5A6C612A39408103E6ECDD77B82920CC1F@voyager.corporate.connx.com"
type="cite"><prewrap="">-----Original Message-----<br />From: Neil Conway [<a class="moz-txt-link-freetext"
href="mailto:nconway@klamath.dyndns.org">mailto:nconway@klamath.dyndns.org</a>]<br/>Sent: Tuesday, February 26, 2002
3:04PM<br />To: <a class="moz-txt-link-abbreviated"
href="mailto:pgsql-hackers@postgresql.org">pgsql-hackers@postgresql.org</a><br/>Subject: Re: [HACKERS] eWeek Poll:
Whichdatabase is most critical to<br />your<br /><br /><br />On Tue, 2002-02-26 at 15:30, Zak Greant wrote:<br
/></pre><blockquotetype="cite"><pre wrap="">Good Day All,<br /><br />eWeek has posted a poll that asks which database
serveris most<br /></pre></blockquote><pre wrap="">critical<br /></pre><blockquote type="cite"><pre wrap="">to your
organization.<br/></pre></blockquote><pre wrap=""><br />The article mentions a MySQL feature which apparently
improved<br/>performance considerably:<br /><br />//<br />MySQL 4.0.1's new, extremely fast query cache is also quite
notable,as<br />no other database we tested had this feature. If the text of an incoming<br />query has a byte-for-byte
matchwith a cached query, MySQL can retrieve<br />the results directly from the cache without compiling the query,
getting<br/>locks or doing index accesses. This query caching will be effective only<br />for tables with few updates
becauseany table updates that clear the<br />cache to guarantee correct results are always returned.<br />//<br /><br
/>Myguess is that it would be relatively simple to implement. Any<br />comments on this?<br /><br />If I implemented
this,any chance this would make it into the tree? Of<br />course, it would be:<br /><br />    - disabled by default<br
/>   - enabled on a table-by-table basis (maybe an ALTER TABLE command)<br /></pre> <pre wrap="">I don't see how it
willdo any good.  There is no "prepare" in<br />Postgresql<br />and therefore you will simply be reexecuting the
queriesevery time any<br />way.  Also, parameter markers only work in embedded SQL and that is a <br />single tasking
system.<br/><br />I think it would be a major piece of work to do anything useful along<br />those lines.<br /><br />If
youlook at how DB/2 works, you will see that they store prepared<br />statements.  Another alternative would be to keep
somepoint in the<br />parser marked and somehow jump to that point, but you would have to<br />be able to save a parse
treesomewhere and also recognize the query.<br /><br />Here is where problems come in...<br />-- Someone wants blue and
blue-green,etc shirts that are backordered<br />SELECT shirt, color, backorder_qty FROM garments WHERE color like<br
/>"BLUE%"<br/><br />Now, another query comes along:<br /><br />-- Someone else wants reddish, etc shirts that are
backordered:<br/>SELECT shirt, color, back
 
order_qty FROM garments WHERE color like "RED%"<br /><br />It's the same query with different data.  Without parameter
markersyou<br />will never know it.  And yet this is exactly the sort of caching that is<br
/>useful.</pre></blockquote>However, an exact match is still not a bad idea.   You might have a different execution
plandepending on the statistics of the data in your column.<br /><br /> If there were a way to store the execution plan
andstart executing from there it still might not be a bad idea.<br /><br /><blockquote
cite="midD90A5A6C612A39408103E6ECDD77B82920CC1F@voyager.corporate.connx.com"type="cite"><pre wrap=""><br /><br
/><<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<br
/><<<br/><br />---------------------------(end of broadcast)---------------------------<br />TIP 5: Have you
checkedour extensive FAQ?<br /><br /><a class="moz-txt-link-freetext"
href="http://www.postgresql.org/users-lounge/docs/faq.html">http://www.postgresql.org/users-lounge/docs/faq.html</a><br
/></pre></blockquote><br/><br />

Experimental Feature development in PostgreSQL

From

Justin Clift

Date:

26 February 2002, 21:29:36

Neil Conway wrote:
<snip>

Hi everyone,

This is getting me to think :

We don't have a really established process for PostgreSQL development
which allows for "experimental features".  aka the Linux kernel
"EXPERIMENTAL", and so forth.

aka ./configure --experimental-querycache
--experimental-pertablecostings --expermental-something?

This way, people can still use PostgreSQL as per normal, but it also
allows for development of code which might or might not actually see the
light of day, depending on if it turns out to really be useful.

By default, things should only become "experimental features" after the
correct consideration, not by default.  There are situations where it
would be beneficial (perhaps in Neil's example here).  If Neil were to
split off a project and do it on Sourceforge, then only some people
would check it out when kind of ready, etc, etc and it wouldn't really
be as available to a wider audience.  But if's it part of the present
source tree, I think the experimental features would see a wider
audience and we'd get a better indication of what's good/bad/etc.

Don't know how to support those features, but pretty much developers and
hard-core-users should be the only ones using them and therefore should
be able to figure most of the stuff out for themselves (i.e. we'd only
get the real questions).

Does this seems like a worthwhile viewpoint to consider, then perhaps we
can think about starting it implement it for 7.3?  (Wish I could code
better).  :)

Regards and best wishes,

Justin Clift

> 
> Perhaps I wasn't clear. The feature I'm proposing is this:
> 
>     When processing SELECT queries but before any real work has been
> done, lookup the query in a hash table. If it already exists, return the
> cached result. If it doesn't exist, execute the query and cache the
> result in the hash table. Optionally, we could not immediately cache the
> query, just increment a "frequency" counter stored in the hash table. If
> the counter goes above a certain constant, we decide that the query is
> worth caching, so we cache the full result in the hash table.
> 
>     When processing INSERTs, UPDATEs and DELETEs, check if the query
> would affect any of the tables for which we are maintaing this cache. If
> so, flush the cache. This ensures that we will never return invalid
> results. We could perhaps be fancy and keep stats on which columns our
> cached queries utilize and which columns the modifying query will
> affect, but that is unlikely to be an overall win.
> 
> HOWEVER -- I don't see this feature as something that will appeal to,
> say, 75% of PgSQL users. If the table in question is being modified on a
> regular basis, or if a wide variety of queries are being issued, this
> cache isn't a good idea. Nevertheless, I think there are certainly some
> situations in which this cache is useful -- and furthermore, these kinds
> of "mostly read-only" situations are often where MySQL is chosen over
> PostgreSQL.
> 
> Anyway, just putting this on the table -- if the consensus is that this
> isn't a very worthwhile feature, I won't bother with it.
> 
> Cheers,
> 
> Neil
> 
> --
> Neil Conway <neilconway@rogers.com>
> PGP Key ID: DB3C29FC
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 6: Have you searched our list archives?
> 
> http://archives.postgresql.org

-- 
"My grandfather once told me that there are two kinds of people: those
who work and those who take the credit. He told me to try to be in the
first group; there was less competition there."  - Indira Gandhi

Re: eWeek Poll: Which database is most critical to your

From

Tom Lane

Date:

27 February 2002, 00:39:45

Neil Conway <nconway@klamath.dyndns.org> writes:
>     When processing INSERTs, UPDATEs and DELETEs, check if the query
> would affect any of the tables for which we are maintaing this cache. If
> so, flush the cache. This ensures that we will never return invalid
> results.

Note that this would imply that the cache is *global* across all
backends; therefore it is a shared data structure and hence an access
bottleneck.  (Not to mention a memory-management headache, since the
size of shared memory can't easily be varied on-the-fly.)

I cannot believe that caching results for literally-identical queries
is a win, except perhaps for the most specialized (read brain dead)
applications.  Has anyone looked at the details of the test case that
MySQL uses to claim that this is a good idea?  Has it got any similarity
to your own usage patterns?

We have talked about caching query plans for suitably-parameterized
queries, but even that seems unlikely to be a huge win; at least I'd
not think it useful to try to drive the cache totally automatically.
If an application could say "here's a query I expect to use a lot,
varying these specific parameters" then caching a plan for that would
make sense.

Now, there are notions of "prepared statements" in many access APIs
that fit this description, and in fact the underlying capability exists
in the backend --- we've just not gotten around to building the
interfaces to tie it all together.  *That* would be worth working on.
        regards, tom lane

Re: eWeek Poll: Which database is most critical to your

From

Neil Conway

Date:

27 February 2002, 01:21:48

On Wed, 2002-02-27 at 00:39, Tom Lane wrote:
> I cannot believe that caching results for literally-identical queries
> is a win, except perhaps for the most specialized (read brain dead)
> applications.

According to MySQL:  "The query cache is extremely useful in an
environment where (some) tables don't change very often and you have a
lot of identical queries. This is a typical situation for many web
servers that use a lot of dynamic content."

Would people agree with the MySQL guys on this? In particular, that this
is a "typical situation" for many webapps?

> Has anyone looked at the details of the test case that
> MySQL uses to claim that this is a good idea?

I emailed the author of the eWeek benchmarks asking for more
information. My guess is that the benchmark results benefit from the
query cache because it executes exactly the same query over and over
again (e.g. selecting the same product from the database every time, not
simulating typical user behavior by selecting random products). If that
is the case, the results are clearly irrelevant.

>  Has it got any similarity to your own usage patterns?

I would be very interested to hear about this as well.

> Now, there are notions of "prepared statements" in many access APIs
> that fit this description, and in fact the underlying capability exists
> in the backend --- we've just not gotten around to building the
> interfaces to tie it all together.  *That* would be worth working on.

Okay, I'll take a look at this...

Cheers,

Neil

-- 
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC

Re: eWeek Poll: Which database is most critical to your

From

"Christopher Kings-Lynne"

Date:

27 February 2002, 01:54:20

> According to MySQL:  "The query cache is extremely useful in an
> environment where (some) tables don't change very often and you have a
> lot of identical queries. This is a typical situation for many web
> servers that use a lot of dynamic content."
>
> Would people agree with the MySQL guys on this? In particular, that this
> is a "typical situation" for many webapps?

Hmmm.  We have a lot of repeated _parameterised_ queries, but the recurrence
of identical queries is quite small.  It'd be an interesting thing to try
and measure.

> > Now, there are notions of "prepared statements" in many access APIs
> > that fit this description, and in fact the underlying capability exists
> > in the backend --- we've just not gotten around to building the
> > interfaces to tie it all together.  *That* would be worth working on.
>
> Okay, I'll take a look at this...

This is the more general solution, compared to MySQL's query cache - and can
speed up paramaterised queries as well as identical queries...

Chris

Re: eWeek Poll: Which database is most critical to

From

F Harvell

Date:

27 February 2002, 09:54:16

On Tue, 26 Feb 2002 15:20:17 PST, "Dann Corbit" wrote:
> -----Original Message-----
> From: Neil Conway [mailto:nconway@klamath.dyndns.org]
> Sent: Tuesday, February 26, 2002 3:04 PM
> To: pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] eWeek Poll: Which database is most critical to
> your
> 
> 
> On Tue, 2002-02-26 at 15:30, Zak Greant wrote:
> > Good Day All,
> > 
> > eWeek has posted a poll that asks which database server is most
> critical
> > to your organization.
> 
> The article mentions a MySQL feature which apparently improved
> performance considerably:
> 
> //
> MySQL 4.0.1's new, extremely fast query cache is also quite notable, as
> no other database we tested had this feature. If the text of an incoming
> query has a byte-for-byte match with a cached query, MySQL can retrieve
> the results directly from the cache without compiling the query, getting
> locks or doing index accesses. This query caching will be effective only
> for tables with few updates because any table updates that clear the
> cache to guarantee correct results are always returned.
> //
> 
> My guess is that it would be relatively simple to implement. Any
> comments on this?
> 
> If I implemented this, any chance this would make it into the tree? Of
> course, it would be:
> 
>     - disabled by default
>     - enabled on a table-by-table basis (maybe an ALTER TABLE command)
> >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
> >>
> I don't see how it will do any good.  There is no "prepare" in
> Postgresql
> and therefore you will simply be reexecuting the queries every time any
> way.  Also, parameter markers only work in embedded SQL and that is a 
> single tasking system.
> 
> I think it would be a major piece of work to do anything useful along
> those lines.
> 
> If you look at how DB/2 works, you will see that they store prepared
> statements.  Another alternative would be to keep some point in the
> parser marked and somehow jump to that point, but you would have to
> be able to save a parse tree somewhere and also recognize the query.
> 
> Here is where problems come in...
> -- Someone wants blue and blue-green, etc shirts that are backordered
> SELECT shirt, color, backorder_qty FROM garments WHERE color like
> "BLUE%"
> 
> Now, another query comes along:
> 
> -- Someone else wants reddish, etc shirts that are backordered:
> SELECT shirt, color, backorder_qty FROM garments WHERE color like "RED%"
> 
> It's the same query with different data.  Without parameter markers you
> will never know it.  And yet this is exactly the sort of caching that is
> useful.
> 
> <<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<<
> <<
 While the cache that MySQL implemented may be of limited value for
dynamic queries, it would be very useful for many, many queries.  For
example, many applications (especially the ones that I write), there
are literally hundreds of selects against static "lookup" tables that
maintain acceptable values for new data inputs (i.e., validated using
referential integrity).  Each and every one of those selects would
receive an improvement.
 Also, it is relatively easy to "parameterize" the select. After all,
it is a structured query.  Just by parsing it, you can identify the
"parameters".  As a matter of fact, I'm pretty certain that this is
already done by the optimizer when breaking the SQL down into the code
to perform the query.  The trick is to cache the "parameterized"
version of the query.  This is likely at or near the end of the
optimizer processing.  Also, you would want to capture the query plan
in the cache.  The query plan is not going to be interested at all in
the literal value of the parameters and therefore will be the same for
any query of the same form.
 For example, from above:

SELECT shirt, color, backorder_qty FROM garments WHERE color like 
'BLUE%'
 should become something on the order of:

SELECT shirt, color, backorder_qty FROM garments WHERE color like 
'{param0}%'
 The next query:

SELECT shirt, color, backorder_qty FROM garments WHERE color like 
'RED%'
 should also be "parameterized" on the order of:

SELECT shirt, color, backorder_qty FROM garments WHERE color like 
'{param0}%'
 A lookup into the query hash would match and therefore _at least_
the same query plan can be used.
 The commercial database "Cache" by Intersystems uses an approach
similar to this.  The performance of that database is phenomenal.  (Of
course, there is more going on in the internals of that system than
just the query cache.)

Re: eWeek Poll: Which database is most critical to

From

Tom Lane

Date:

27 February 2002, 16:37:22

F Harvell <fharvell@fts.net> writes:
> The query plan is not going to be interested at all in
> the literal value of the parameters and therefore will be the same for
> any query of the same form.

Unfortunately, this is completely false.

>   For example, from above:

> SELECT shirt, color, backorder_qty FROM garments WHERE color like 
> 'BLUE%'

>   should become something on the order of:

> SELECT shirt, color, backorder_qty FROM garments WHERE color like 
> '{param0}%'

You managed to pick an example that's perfectly suited to demolish your
assertion.  The query with "color like 'BLUE%'" can be optimized into an
indexscan (using index quals of the form "color >= 'BLUE' and color <
'BLUF'), at least in C locale.  The parameterized query cannot be
optimized at all, because the planner cannot know whether the
substituted parameter string will provide a left-anchored pattern.
What if param0 contains '_FOO' at runtime?  An indexscan will be
useless in that case.

In general, Postgres' query plans *do* depend on the values of
constants, and it's not always possible to produce an equally good plan
that doesn't assume anything about constants.  This is why I think it's
a lousy idea for the system to try to automatically abstract a
parameterized query plan from the actual queries it sees.  On the other
hand, an application programmer will have a very good idea of which
parts of a repeated query are really constant and which are parameters.
So what we really need is preparable parameterized queries, wherein the
application tells us what to parameterize, rather than having to guess
about it.
        regards, tom lane

Re: eWeek Poll: Which database is most critical to

From

Bruce Momjian

Date:

27 February 2002, 22:56:54

> In general, Postgres' query plans *do* depend on the values of
> constants, and it's not always possible to produce an equally good plan
> that doesn't assume anything about constants.  This is why I think it's
> a lousy idea for the system to try to automatically abstract a
> parameterized query plan from the actual queries it sees.  On the other
> hand, an application programmer will have a very good idea of which
> parts of a repeated query are really constant and which are parameters.
> So what we really need is preparable parameterized queries, wherein the
> application tells us what to parameterize, rather than having to guess
> about it.

I think we could store the constants that went with the saved plan and
re-use the plan if the new constants were _similar_ to the old ones. 
(Of course, figuring out _similar_ is the trick here.)

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026

Re: eWeek Poll: Which database is most critical to your

From

Michael Meskes

Date:

28 February 2002, 03:32:58

On Wed, Feb 27, 2002 at 12:39:16AM -0500, Tom Lane wrote:
> I cannot believe that caching results for literally-identical queries
> is a win, except perhaps for the most specialized (read brain dead)

I don't think they are brain dead. Well that is at first I thought so too,
but then thinking some more it made sense. After all MySQL is used mostly
for web pages and even your dynamic content doesn't change that often. But
in between there are thousands of concurrent access that all execute the
very same statement. This feature makes no sense IMO for the "normal" use we
both probably had in mind when first reading, but for this web usage I see a
benefit if it's implementable.

Michael
-- 
Michael Meskes
Michael@Fam-Meskes.De
Go SF 49ers! Go Rhein Fire!
Use Debian GNU/Linux! Use PostgreSQL!

Re: eWeek Poll: Which database is most critical to

From

F Harvell

Date:

28 February 2002, 11:09:44

On Wed, 27 Feb 2002 16:24:45 EST, Tom Lane wrote:
> F Harvell <fharvell@fts.net> writes:
> > The query plan is not going to be interested at all in
> > the literal value of the parameters and therefore will be the same for
> > any query of the same form.
> 
> Unfortunately, this is completely false.
>  
> >   For example, from above:
> 
> > SELECT shirt, color, backorder_qty FROM garments WHERE color like 
> > 'BLUE%'
> 
> >   should become something on the order of:
> 
> > SELECT shirt, color, backorder_qty FROM garments WHERE color like 
> > '{param0}%'
> 
> You managed to pick an example that's perfectly suited to demolish your
> assertion.  The query with "color like 'BLUE%'" can be optimized into an
> indexscan (using index quals of the form "color >= 'BLUE' and color <
> 'BLUF'), at least in C locale.  The parameterized query cannot be
> optimized at all, because the planner cannot know whether the
> substituted parameter string will provide a left-anchored pattern.
> What if param0 contains '_FOO' at runtime?  An indexscan will be
> useless in that case.
 Thanks for the feedback.  In the example that was used, it was
important to note that the {param0} was the string literal "BLUE" and
not the % "operator".  This IMHO ties the query to a left anchored
pattern.  I certainly do not think that the "parameter" can be
anything but a literal.  Functions and operators would very likely
affect any query plan.
 Is it true that the optimizer manipulates the literal?  It would
seem that that would require a huge amount of processing (due to
character sets, etc.).  It would appear that it would be more viable
to use a simpler optimization that does not manipulate the literal
such as an index quals of the form "color{0,4} == 'BLUE'" than to
generate a range comparison.  Of course, this is a very simple query
and I am likely missing a critical concept.

Thanks,
F Harvell

Re: eWeek Poll: Which database is most critical to your

From

Andrew McMillan

Date:

01 March 2002, 06:22:14

On Wed, 2002-02-27 at 18:39, Tom Lane wrote:
> 
> I cannot believe that caching results for literally-identical queries
> is a win, except perhaps for the most specialized (read brain dead)
> applications.

We have an application which we developed, and which we added
specifically added query caching of the type being discussed.  This was
added within the application, rather than within the database, and
provided huge gains for the client.

One difficulty is in knowing when the results of the exact same query
will result in a different result set (i.e. invalidating your cached
resultsets), but as far as I can see this is just as knowable at a
database level as it was for us in the application code.

If this were available I wouldn't want it applied to _all_ queries,
however - I'd want to be able to enable it against particular tables
and/or queries (po: a 'WITH RESULTCACHE' option on the end of a SELECT)
where I knew there was going to be a dramatic win.

The reason there was such a large win for our application was that when
people go to a job website on Monday morning to look for a better place
to work, there is a strong tendency for them to be searching for much
the same sort of thing (as each other) - especially if the site is
getting in the millions of page views.

The application I talk about here was written in Progress (WebSpeed),
and we have found it much harder to do this sort of thing more recently
in Oracle or PostgreSQL.

This would be nice.

Regards,                Andrew.
-- 
--------------------------------------------------------------------
Andrew @ Catalyst .Net.NZ Ltd, PO Box 11-053, Manners St, Wellington
WEB: http://catalyst.net.nz/        PHYS: Level 2, 150-154 Willis St
DDI: +64(4)916-7201    MOB: +64(21)635-694    OFFICE: +64(4)499-2267      Are you enrolled at
http://schoolreunions.co.nz/yet?

Re: eWeek Poll: Which database is most critical to your

From

"Rod Taylor"

Date:

01 March 2002, 08:59:05

If you were to expire the cache when the table was updated it wouldn't
be so bad, and you certainly don't want to cache everything under the
assumption something may come along -- as it probably won't.  But if
you're interested in speeding this up I can't see any harm in adding a
stat into statistics that stores query frequency.  If the frequency
goes above a certain notch (without the results changing -- static
table data) then cache results.

However, I think 99% of these cases could be easily rectified with a
php interface which caches the results into memory at the users
request as I really don't see hundreds of machines making the same
queries frequently -- rather 1 or 2 poorly written ones :)

Then again, I regularly hit one row database tables which act as
markers for data changes to see if the program should reload the
entire data set.  (Parallel machines for interface purposes cache
table data in ram normally, but test for changes with every request --
goal was that they shouldn't have to communicate to eachother).  But a
1 row table is very quick to select from.  Can this make it faster, if
so it'll be useful.


Anyway, write it up. If you can speed up some without making a hit
against others it'll be a large advantage.
--
Rod Taylor

This message represents the official view of the voices in my head

----- Original Message -----
From: "Neil Conway" <nconway@klamath.dyndns.org>
To: "Dann Corbit" <DCorbit@connx.com>
Cc: <pgsql-hackers@postgresql.org>
Sent: Tuesday, February 26, 2002 6:42 PM
Subject: Re: [HACKERS] eWeek Poll: Which database is most critical to
your


> On Tue, 2002-02-26 at 18:20, Dann Corbit wrote:
> > I don't see how it will do any good.  There is no "prepare" in
> > Postgresql
> > and therefore you will simply be reexecuting the queries every
time any
> > way.  Also, parameter markers only work in embedded SQL and that
is a
> > single tasking system.
>
> Perhaps I wasn't clear. The feature I'm proposing is this:
>
>     When processing SELECT queries but before any real work has been
> done, lookup the query in a hash table. If it already exists, return
the
> cached result. If it doesn't exist, execute the query and cache the
> result in the hash table. Optionally, we could not immediately cache
the
> query, just increment a "frequency" counter stored in the hash
table. If
> the counter goes above a certain constant, we decide that the query
is
> worth caching, so we cache the full result in the hash table.
>
>     When processing INSERTs, UPDATEs and DELETEs, check if the query
> would affect any of the tables for which we are maintaing this
cache. If
> so, flush the cache. This ensures that we will never return invalid
> results. We could perhaps be fancy and keep stats on which columns
our
> cached queries utilize and which columns the modifying query will
> affect, but that is unlikely to be an overall win.
>
> HOWEVER -- I don't see this feature as something that will appeal
to,
> say, 75% of PgSQL users. If the table in question is being modified
on a
> regular basis, or if a wide variety of queries are being issued,
this
> cache isn't a good idea. Nevertheless, I think there are certainly
some
> situations in which this cache is useful -- and furthermore, these
kinds
> of "mostly read-only" situations are often where MySQL is chosen
over
> PostgreSQL.
>
> Anyway, just putting this on the table -- if the consensus is that
this
> isn't a very worthwhile feature, I won't bother with it.
>
> Cheers,
>
> Neil
>
> --
> Neil Conway <neilconway@rogers.com>
> PGP Key ID: DB3C29FC
>
>
> ---------------------------(end of
broadcast)---------------------------
> TIP 6: Have you searched our list archives?
>
> http://archives.postgresql.org
>

Re: eWeek Poll: Which database is most critical to your

From

Hannu Krosing

Date:

01 March 2002, 14:42:12

On Wed, 2002-02-27 at 10:39, Tom Lane wrote:
> Neil Conway <nconway@klamath.dyndns.org> writes:
> >     When processing INSERTs, UPDATEs and DELETEs, check if the query
> > would affect any of the tables for which we are maintaing this cache. If
> > so, flush the cache. This ensures that we will never return invalid
> > results.
> 
> Note that this would imply that the cache is *global* across all
> backends; therefore it is a shared data structure and hence an access
> bottleneck.  (Not to mention a memory-management headache, since the
> size of shared memory can't easily be varied on-the-fly.)

I think that it would be enough if the hashes and bookkeeping info
(tmin,tmax,filename) were in a global table. We could also purge all
data more than a few minutes old. We also need an inverse lookup from
changed table to cached query for cache invalidation on
insert/update/delete

The result could be even saved in temp files and be mostly faster than
doing the full parse/plan/execute, both for complex queries returning a
few rows (saves planning time) or many rows (saves execute time).

The format used for saving should be exact wire protocol, so that
efficient system calls could be used where available (linux 2.4+ has a
system call that will transfer a whole file to a socket in one call
bypassing all copying and cacheing)

The lookup part will be pretty trivial - lookup using hash, check for
tmin/tmax, if ok push cached result out to client.This will make us as
fast or faster than MySQL for trivial_benchmark/busy_website case.

The cache creation/maintenance part will be much trickier - 

When creating cache
* the tables affected can be determined only from fully built plans  because of possible rule expansions.
* if there is a trigger on select for this query it can't be cached
* put some temporary insert/update/delete triggers on all real tables  used in query that will invalidate cache - as an
alternativelywe could always run the invalidate-query-cache code for affected table on insert/update/delete on a table
ifexact caching is enabled

* invalidate cache on schema changes
* run a periodic check and invalidate old cache entries.

Some of the above could also be needed for caching query plans.

> I cannot believe that caching results for literally-identical queries
> is a win, except perhaps for the most specialized (read brain dead)
> applications.

Actually a web app that looks up contents of a 5 row combobox is not
really brain-dead. 

Doing all the caching (and cache invalidation) on client side is hard
and fragle - what happens when someone adds a trigger in backend ?

> Has anyone looked at the details of the test case that
> MySQL uses to claim that this is a good idea?  Has it got any similarity
> to your own usage patterns?

Yes - for content management / web apps.

No - for bean-counting apps.

> We have talked about caching query plans for suitably-parameterized
> queries, but even that seems unlikely to be a huge win; at least I'd
> not think it useful to try to drive the cache totally automatically.
> If an application could say "here's a query I expect to use a lot,
> varying these specific parameters" then caching a plan for that would
> make sense.
> 
> Now, there are notions of "prepared statements" in many access APIs
> that fit this description, and in fact the underlying capability exists
> in the backend --- we've just not gotten around to building the
> interfaces to tie it all together.  *That* would be worth working on.

Sure. It would 

a) make many queries faster

b) make client libs (ODBC/JDBC/ECPG) faster and simpler by not forcing
them to fake it.

But there is also a big class of applications that would benefit much
more from caching exact queries.

And it will make us as fast as MySQL for 100000 consecutive calls of
SELECT MAX(N) FROM T ;)

---------------
Hannu