Thread: Again, sorry, caching.

Again, sorry, caching.

From
mlw
Date:
I traded a couple emails with a guy using one of my open source projects. To
make a long story short, he is going to the new version of MySQL for his
website because of the new caching feature. He is convinced that it will speed
up his web site, and he is probably right.

On a web site, a few specific queries get executed, unchanged, repeatedly.
Think about an ecommerce site, most of the time it is just a handful of basic
queries. These basic queries are usually against pretty large product tables. A
caching mechanism would make these queries pretty light weight.

The arguments against caching:

"It is an application issue"
This is completely wrong. Caching can not be done against a database without
knowledge of the database, i.e. when the data changes.

"If it is mostly static data, why not just make it a static page?"
Because a static page is a maintenance nightmare. One uses a database in a web
site to allow content to be changed and upgraded dynamically and with a minimum
of work.

"It isn't very useful"
I disagree completely. A cache of most frequently used queries, or specific
ones, could make for REALLY good performance in some very specific, but very
common, applications. Any system that has a hierarchical "drill down" interface
to a data set, ecommerce, libraries, document management systems, etc. will
greatly benefit from a query cache.

I was thinking that it could be implemented as a keyword or comment in a query.
Such as:

select * from table where column = 'foo' cacheable
or
select * from table where column = 'bar' /* cacheable */

Either way, it would speed up a lot of common application types. It would even
be very cool if you could just cache the results of sub queries, such as:

select * from (select * from table where col1 = 'foo' cacheable) as subset
where subset.col2 = 'bar' ;

Which would mean that the subquery gets cached, but the greater select need not
be. The cache could be like a global temp table. Perhaps the user could even
name the cache entry:

select * from table where column = 'foo' cache on foo

Where one could also do:

select * from cache_foo

Using a keyword is probably a better idea, it can be picked up by the parser
and instruct PostgreSQL to use the cache, otherwise there will be no additional
overhead.

Having caching within PostgreSQL will be good for data integrity. Application
caches can't tell when an update/delete/insert happens, they often have to use
a time-out mechanism.

OK, let me have it, tell me how terrible an idea this is. tell me how wrong I
am.


Re: Again, sorry, caching.

From
Greg Copeland
Date:
I previously replied to you vaguely describing a way you could implement
this by using a combination of client side caching and database tables
and triggers to allow you to determine if your cache is still valid.
Someone came right behind me, Tom maybe??, and indicated that the
proper/ideal way to do this would be to using postgres' asynchronous
database notification mechanisms (listen/notify I believe were the
semantics) to alert your application that your cache has become
invalid.  Basically, a couple of triggers and the use of the list/notify
model, and you should be all set.

Done properly, a client side cache which is asynchronously notified by
the database when it's contents become invalid should be faster than
relying on MySQL's database caching scheme.  Basically, a strong client
side cache is going to prevent your database from even having to return
a cached result set while a database side cache is going to always
return a result set.  Of course, one of the extra cool things you can do
is to cache a gzip'd copy of the data contents which would further act
as an optimization preventing the client or web server (in case they are
different) from having to recompress every result set.

In the long run, again, if properly done, you should be able to beat
MySQL's implementation without too extra much effort.  Why?  Because a
client side cache can be much smarter in the way that it uses it's
cached contents much in the same way an application is able to better
cache it's data then what the file system is able to do.  This is why an
client side cache should be preferred over that of a database result set
cache.

Greg

References:
http://www.postgresql.org/idocs/index.php?sql-notify.html
http://www.postgresql.org/idocs/index.php?sql-listen.html


On Sat, 2002-03-16 at 08:01, mlw wrote:
> I traded a couple emails with a guy using one of my open source projects. To
> make a long story short, he is going to the new version of MySQL for his
> website because of the new caching feature. He is convinced that it will speed
> up his web site, and he is probably right.
>
> On a web site, a few specific queries get executed, unchanged, repeatedly.
> Think about an ecommerce site, most of the time it is just a handful of basic
> queries. These basic queries are usually against pretty large product tables. A
> caching mechanism would make these queries pretty light weight.
>
> The arguments against caching:
>
> "It is an application issue"
> This is completely wrong. Caching can not be done against a database without
> knowledge of the database, i.e. when the data changes.
>
> "If it is mostly static data, why not just make it a static page?"
> Because a static page is a maintenance nightmare. One uses a database in a web
> site to allow content to be changed and upgraded dynamically and with a minimum
> of work.
>
> "It isn't very useful"
> I disagree completely. A cache of most frequently used queries, or specific
> ones, could make for REALLY good performance in some very specific, but very
> common, applications. Any system that has a hierarchical "drill down" interface
> to a data set, ecommerce, libraries, document management systems, etc. will
> greatly benefit from a query cache.
>
> I was thinking that it could be implemented as a keyword or comment in a query.
> Such as:
>
> select * from table where column = 'foo' cacheable
> or
> select * from table where column = 'bar' /* cacheable */
>
> Either way, it would speed up a lot of common application types. It would even
> be very cool if you could just cache the results of sub queries, such as:
>
> select * from (select * from table where col1 = 'foo' cacheable) as subset
> where subset.col2 = 'bar' ;
>
> Which would mean that the subquery gets cached, but the greater select need not
> be. The cache could be like a global temp table. Perhaps the user could even
> name the cache entry:
>
> select * from table where column = 'foo' cache on foo
>
> Where one could also do:
>
> select * from cache_foo
>
> Using a keyword is probably a better idea, it can be picked up by the parser
> and instruct PostgreSQL to use the cache, otherwise there will be no additional
> overhead.
>
> Having caching within PostgreSQL will be good for data integrity. Application
> caches can't tell when an update/delete/insert happens, they often have to use
> a time-out mechanism.
>
> OK, let me have it, tell me how terrible an idea this is. tell me how wrong I
> am.
>
> ---------------------------(end of broadcast)---------------------------
> TIP 3: if posting/reading through Usenet, please send an appropriate
> subscribe-nomail command to majordomo@postgresql.org so that your
> message can get through to the mailing list cleanly


Re: Again, sorry, caching.

From
mlw
Date:
Triggers and asynchronous notification are not substitutes for real hard ACID
complient caching. The way you suggest implies only one access model. Take the
notion of a library, they have both web and application access. These should
both be able to use the cache.

Also, your suggestion does not address the sub-select case, which I think is
much bigger, performance wise, and more efficient than MySQL's cache.

This whole discussion could be moot, and this could be developed as an
extension, if there were a function API that could return sets of whole rows.



Greg Copeland wrote:
> 
> I previously replied to you vaguely describing a way you could implement
> this by using a combination of client side caching and database tables
> and triggers to allow you to determine if your cache is still valid.
> Someone came right behind me, Tom maybe??, and indicated that the
> proper/ideal way to do this would be to using postgres' asynchronous
> database notification mechanisms (listen/notify I believe were the
> semantics) to alert your application that your cache has become
> invalid.  Basically, a couple of triggers and the use of the list/notify
> model, and you should be all set.
> 
> Done properly, a client side cache which is asynchronously notified by
> the database when it's contents become invalid should be faster than
> relying on MySQL's database caching scheme.  Basically, a strong client
> side cache is going to prevent your database from even having to return
> a cached result set while a database side cache is going to always
> return a result set.  Of course, one of the extra cool things you can do
> is to cache a gzip'd copy of the data contents which would further act
> as an optimization preventing the client or web server (in case they are
> different) from having to recompress every result set.
> 
> In the long run, again, if properly done, you should be able to beat
> MySQL's implementation without too extra much effort.  Why?  Because a
> client side cache can be much smarter in the way that it uses it's
> cached contents much in the same way an application is able to better
> cache it's data then what the file system is able to do.  This is why an
> client side cache should be preferred over that of a database result set
> cache.
> 
> Greg
> 
> References:
> http://www.postgresql.org/idocs/index.php?sql-notify.html
> http://www.postgresql.org/idocs/index.php?sql-listen.html
> 
> On Sat, 2002-03-16 at 08:01, mlw wrote:
> > I traded a couple emails with a guy using one of my open source projects. To
> > make a long story short, he is going to the new version of MySQL for his
> > website because of the new caching feature. He is convinced that it will speed
> > up his web site, and he is probably right.
> >
> > On a web site, a few specific queries get executed, unchanged, repeatedly.
> > Think about an ecommerce site, most of the time it is just a handful of basic
> > queries. These basic queries are usually against pretty large product tables. A
> > caching mechanism would make these queries pretty light weight.
> >
> > The arguments against caching:
> >
> > "It is an application issue"
> > This is completely wrong. Caching can not be done against a database without
> > knowledge of the database, i.e. when the data changes.
> >
> > "If it is mostly static data, why not just make it a static page?"
> > Because a static page is a maintenance nightmare. One uses a database in a web
> > site to allow content to be changed and upgraded dynamically and with a minimum
> > of work.
> >
> > "It isn't very useful"
> > I disagree completely. A cache of most frequently used queries, or specific
> > ones, could make for REALLY good performance in some very specific, but very
> > common, applications. Any system that has a hierarchical "drill down" interface
> > to a data set, ecommerce, libraries, document management systems, etc. will
> > greatly benefit from a query cache.
> >
> > I was thinking that it could be implemented as a keyword or comment in a query.
> > Such as:
> >
> > select * from table where column = 'foo' cacheable
> > or
> > select * from table where column = 'bar' /* cacheable */
> >
> > Either way, it would speed up a lot of common application types. It would even
> > be very cool if you could just cache the results of sub queries, such as:
> >
> > select * from (select * from table where col1 = 'foo' cacheable) as subset
> > where subset.col2 = 'bar' ;
> >
> > Which would mean that the subquery gets cached, but the greater select need not
> > be. The cache could be like a global temp table. Perhaps the user could even
> > name the cache entry:
> >
> > select * from table where column = 'foo' cache on foo
> >
> > Where one could also do:
> >
> > select * from cache_foo
> >
> > Using a keyword is probably a better idea, it can be picked up by the parser
> > and instruct PostgreSQL to use the cache, otherwise there will be no additional
> > overhead.
> >
> > Having caching within PostgreSQL will be good for data integrity. Application
> > caches can't tell when an update/delete/insert happens, they often have to use
> > a time-out mechanism.
> >
> > OK, let me have it, tell me how terrible an idea this is. tell me how wrong I
> > am.
> >
> > ---------------------------(end of broadcast)---------------------------
> > TIP 3: if posting/reading through Usenet, please send an appropriate
> > subscribe-nomail command to majordomo@postgresql.org so that your
> > message can get through to the mailing list cleanly
> 
>   -------------------------------------------------------------------------------
>                        Name: signature.asc
>    signature.asc       Type: application/pgp-signature
>                 Description: This is a digitally signed message part


Re: Again, sorry, caching.

From
Greg Copeland
Date:
On Sat, 2002-03-16 at 08:01, mlw wrote:
[snip]

> "If it is mostly static data, why not just make it a static page?"
> Because a static page is a maintenance nightmare. One uses a database in a web
> site to allow content to be changed and upgraded dynamically and with a minimum
> of work.
>


Oh ya, I forgot that reply to that part.  I think you are forgetting
that you can use a database to generate a static page.  That is, only
regenerate the static page when the data within the database changes.
Again, this is another example of efficient application caching.  If you
have an application which listens for your cache invalidation event, you
can then recreate your static page.  Again, database result set caching
is not required.  And again, then should be significantly faster than
MySQL's result set caching.  Also worth noting that you could then gzip
your static page (keeping both static pages -- compressed and
uncompressed) resulting in yet another optimization for most web servers
and browsers.

Greg


Re: Again, sorry, caching.

From
Greg Copeland
Date:
On Sat, 2002-03-16 at 08:36, mlw wrote:
> Triggers and asynchronous notification are not substitutes for real hard ACID
> complient caching. The way you suggest implies only one access model. Take the
> notion of a library, they have both web and application access. These should
> both be able to use the cache.
>

Well, obviously, you'd need to re-implement the client side cache in
each implementation of the client.  That is a down side and I certainly
won't argue that.  As for the "no substitute" comment, I'm guess I'll
plead ignorance because I'm not sure what I'm missing here.  What am I
missing that would not be properly covered by that model?

> Also, your suggestion does not address the sub-select case, which I think is
> much bigger, performance wise, and more efficient than MySQL's cache.

I'm really not sure what you mean by that.  Doesn't address it but is
more efficient?  Maybe it's because I've not had my morning coffee
yet... ;)

>
> This whole discussion could be moot, and this could be developed as an
> extension, if there were a function API that could return sets of whole rows.
>

Maybe...but you did ask for feedback.  :)

Greg




Re: Again, sorry, caching.

From
Stephan Szabo
Date:
> I was thinking that it could be implemented as a keyword or comment in a query.
> Such as:
>
> select * from table where column = 'foo' cacheable
> or
> select * from table where column = 'bar' /* cacheable */


> Having caching within PostgreSQL will be good for data integrity. Application
> caches can't tell when an update/delete/insert happens, they often have to use
> a time-out mechanism.
>
> OK, let me have it, tell me how terrible an idea this is. tell me how wrong I
> am.

I don't think it's a bad idea, but a cache that takes a query string (or
subquery string) and looks for a match based on that is flawed without
special consideration to non-cacheable functions and constructs
(CURRENT_USER, things that depend on timezone, things that depend on
datestyle). We'd also need to work out an appropriate mechanism to deal
with cache invalidation and adding things to the cache.



Re: Again, sorry, caching.

From
mlw
Date:
Andrew Sullivan wrote:
> 
> On Sat, Mar 16, 2002 at 09:01:28AM -0500, mlw wrote:
> 
> > "If it is mostly static data, why not just make it a static page?"
> > Because a static page is a maintenance nightmare. One uses a
> > database in a web site to allow content to be changed and upgraded
> > dynamically and with a minimum of work.
> 
> This seems wrong to me.  Why not build an extra bit of functionality
> so that when the admin makes a static-data change, the new static
> data gets pushed into the static files?
> 
> I was originally intrigued by the suggestion you made, but the more I
> thought about it (and read the arguments of others) the more
> convinced I became that the MySQL approach is a mistake.  It's
> probably worth it for their users, who seem not to care that much
> about ACID anyway.  But I think for a system that really wants to
> play in the big leagues, the cache is a big feature that requires a
> lot of development, but which is not adequately useful for most
> cases.  If we had infinite developer resources, it might be worth it.
> In the actual case, I think it's too low a priority.

Again, I can't speak to priority, but I can name a few common application where
caching would be a great benefit. The more I think about it, the more I like
the idea of a 'cacheable' keyword in the select statement.

My big problem with putting the cache outside of the database is that it is now
incumbent on the applications programmer to write a cache. A database should
manage the data, the application should handle how the data is presented.
Forcing the application to implement a cache feels wrong.


Re: Again, sorry, caching.

From
mlw
Date:
Greg Copeland wrote:
> 
> On Sat, 2002-03-16 at 08:36, mlw wrote:
> > Triggers and asynchronous notification are not substitutes for real hard ACID
> > complient caching. The way you suggest implies only one access model. Take
the
> > notion of a library, they have both web and application access. These should
> > both be able to use the cache.
> >
> 
> Well, obviously, you'd need to re-implement the client side cache in
> each implementation of the client.  That is a down side and I certainly
> won't argue that.  As for the "no substitute" comment, I'm guess I'll
> plead ignorance because I'm not sure what I'm missing here.  What am I
> missing that would not be properly covered by that model?

It would not be guarenteed to be up to date with the state of the database. By
implementing the cache within the database, PostgreSQL could maintain the
consistency.

> 
> > Also, your suggestion does not address the sub-select case, which I think is
> > much bigger, performance wise, and more efficient than MySQL's cache.
> 
> I'm really not sure what you mean by that.  Doesn't address it but is
> more efficient?  Maybe it's because I've not had my morning coffee
> yet... ;)

If an internal caching system can be implemented within PostgreSQL, and trust
me, I undersand what a hairball it would be with multiversion concurrency,
omplex queries such as:

select * from (select * from mytable where foo = 'bar' cacheable) as subset
where subset.col = 'value'

The 'cacheable' keyword applied to the query would mean that PostgreSQL could
keep that result set handy for later use. If mytable and that subselect always
does a table scan, no one can argue that this subquery caching could be a huge
win.

As a side note, I REALLY like the idea of a keyword for caching as apposed to
automated caching. t would allow the DBA or developer more control over
PostgreSQL's behavior, and poentially make the fature easier to implement.

> 
> >
> > This whole discussion could be moot, and this could be developed as an
> > extension, if there were a function API that could return sets of whole rows.
> >
> 
Currently a function can only return one value or a setof a single type,
implemented as one function call for each entry in a set. If there could be a
function interface which could return a row, and multiple rows similar to the
'setof' return, that would be very cool. That way caching can be implemented
as:

select * from pgcache('select * from mytable where foo='bar') as subset where
subset.col = 'value';


Re: Again, sorry, caching.

From
mlw
Date:
I think the notion that data is managed outside of the database is bogus. Query
caching can improve performance in some specific, but popular, scenarios.
Saying it does not belong within the database and is the job of the
application, is like saying file caching is not a job of the file system but is
the job of the application.

This is a functionality many users want, and can be justified by some very
specific, but very common, scenarios. It is not me to say if it is worth the
work, or if it should be done. From the perspective of the user, having this
capability within the database is an important feature, I want to make the
argument.

Greg Copeland wrote:
> 
> I previously replied to you vaguely describing a way you could implement
> this by using a combination of client side caching and database tables
> and triggers to allow you to determine if your cache is still valid.
> Someone came right behind me, Tom maybe??, and indicated that the
> proper/ideal way to do this would be to using postgres' asynchronous
> database notification mechanisms (listen/notify I believe were the
> semantics) to alert your application that your cache has become
> invalid.  Basically, a couple of triggers and the use of the list/notify
> model, and you should be all set.
> 
> Done properly, a client side cache which is asynchronously notified by
> the database when it's contents become invalid should be faster than
> relying on MySQL's database caching scheme.  Basically, a strong client
> side cache is going to prevent your database from even having to return
> a cached result set while a database side cache is going to always
> return a result set.  Of course, one of the extra cool things you can do
> is to cache a gzip'd copy of the data contents which would further act
> as an optimization preventing the client or web server (in case they are
> different) from having to recompress every result set.
> 
> In the long run, again, if properly done, you should be able to beat
> MySQL's implementation without too extra much effort.  Why?  Because a
> client side cache can be much smarter in the way that it uses it's
> cached contents much in the same way an application is able to better
> cache it's data then what the file system is able to do.  This is why an
> client side cache should be preferred over that of a database result set
> cache.
> 
> Greg
>


Re: Again, sorry, caching.

From
Karel Zak
Date:
On Sat, Mar 16, 2002 at 09:01:28AM -0500, mlw wrote:

> "If it is mostly static data, why not just make it a static page?"
> Because a static page is a maintenance nightmare. One uses a database in a web
> site to allow content to be changed and upgraded dynamically and with a minimum
> of work.
It's ugly argumentation for DB cache. What generate web page after data change and next time use it as static?

> I was thinking that it could be implemented as a keyword or comment in a query.
> Such as:
> 
> select * from table where column = 'foo' cacheable
You can insert "mostly static data" into temp table and in next queries use this temp table. After update/delete/insert
canyour applicationrebuild temp table (or by trigger?).
 
       Karel

-- Karel Zak  <zakkr@zf.jcu.cz>http://home.zf.jcu.cz/~zakkr/C, PostgreSQL, PHP, WWW, http://docs.linux.cz,
http://mape.jcu.cz


Re: Again, sorry, caching.

From
mlw
Date:
Karel Zak wrote:
> 
> On Sat, Mar 16, 2002 at 09:01:28AM -0500, mlw wrote:
> 
> > "If it is mostly static data, why not just make it a static page?"
> > Because a static page is a maintenance nightmare. One uses a database in a web
> > site to allow content to be changed and upgraded dynamically and with a minimum
> > of work.
> 
>  It's ugly argumentation for DB cache. What generate web page after data
>  change and next time use it as static?
> 
> > I was thinking that it could be implemented as a keyword or comment in a query.
> > Such as:
> >
> > select * from table where column = 'foo' cacheable
> 
>  You can insert "mostly static data" into temp table and in next queries
>  use this temp table. After update/delete/insert can your application
>  rebuild temp table (or by trigger?).

Yes, I could, as could most of the guys reading these messages. I am thinking
about a feature in PostgreSQL that would make that easier for the average DBA
or web producer.

Lets face it, MySQL wins a lot of people because they put in features that
people want. All the ways people have suggested to "compete" with MySQL's
caching have been ugly kludges. 

I understand the there is an amount of work involved with doing caching, and
the value of caching is debatable by some, however, it is demonstrable that
caching can improve a very common, albeit specific, set of deployments. Also,
managing data is the job of the database, not the application. It does belong
in PostgreSQL, if someone is forced to write a caching scheme around
PostgreSQL, it is because PostgreSQL lacks that feature.


Re: Again, sorry, caching.

From
Jean-Michel POURE
Date:
Le Lundi 18 Mars 2002 13:23, mlw a écrit :
> Lets face it, MySQL wins a lot of people because they put in features that
> people want.

MySQL is very interested in benchmarks.
It does not really care for data consistency.

Cheers, Jean-Michel POURE


Re: Again, sorry, caching.

From
Greg Copeland
Date:
Yes.  EVERY person that I've ever known which runs MySQL run for two
very simple reasons.  First, they believe it to be wicked fast.  Second,
they don't understand what ACID is, what a transaction is, or why
running a single session against a database to perform a benchmark is a
completely bogus concept.  In case it's not obvious, these are usually
people that are trying to take a step up from Access.  While I do
believe MySQL, from a performance perspective, is a step up from Access
I always tell my clients...if you wouldn't use an Access database for
this project, you shouldn't use MySQL either.

To me, this means we need better advertising, PR, and education rather
than a result set cache.  :P

Speaking of which, I'm wondering if there are any design patterns we can
look at which would address client side caching...well, at least make it
easier to implement as well as implement it in a consistent manner.

Greg


On Mon, 2002-03-18 at 07:32, Jean-Michel POURE wrote:
> Le Lundi 18 Mars 2002 13:23, mlw a écrit :
> > Lets face it, MySQL wins a lot of people because they put in features that
> > people want.
>
> MySQL is very interested in benchmarks.
> It does not really care for data consistency.
>
> Cheers, Jean-Michel POURE
>
> ---------------------------(end of broadcast)---------------------------
> TIP 1: subscribe and unsubscribe commands go to majordomo@postgresql.org


Re: Again, sorry, caching.

From
mlw
Date:
Jean-Michel POURE wrote:
> 
> Le Lundi 18 Mars 2002 13:23, mlw a écrit :
> > Lets face it, MySQL wins a lot of people because they put in features that
> > people want.
> 
> MySQL is very interested in benchmarks.
> It does not really care for data consistency.

In no way am I suggesting we avoid ACID compliance. In no way am I suggesting
that PostgreSQL change. All I am suggesting is that tables which change
infrequently can and should be cached.

select * from table where foo = 'bar'

Need not be executed twice if the table has not changed. 

select * from table1, (select * from table2 where foo='bar' cacheable) as
subset were subset.col1 = table1.col1;

In the above query, if table two changes 4 times a day, and it queried a couple
times a minute or second, the caching of the subset could save a huge amount of
disk I/O.

This sort of query could improve many catalog based implementations, from
music, to movies, to books. A library could implement a SQL query for book
lookups like this:

select * from authors, (select * from books where genre = 'scifi' cacheable) as
subset where authors.id = subset.auhorid and authors.id in (....)

Yes it is arguable that index scans may work better, and obviously, summary
tables may help, etc. but imagine a more complex join which produces fewer
records, but is executed frequently. Caching could help the performance of
PostgreSQL in some very real applications.

MySQL's quest for benchmarking numbers, I agree, is shameful because they
create numbers which are not really applicable in the real world. This time,
however, I think they may be on to something.

(1) PostgreSQL use a "cacheable" or "iscacheable" keyword.
(2) If the query uses functions which are not marked as "iscacheable," then it
is not cached.
(3) If any table contained within the cacheable portion of the query is
modified, the cache is marked as dirty.
(4) No provisions are made to recreate the cache after an insert/update/delete.
(5) The first query marked as "iscacheable" that encounters a "dirty" flag in a
table, does an exhaustive search on the cache and removes all entries that are
affected.


As far as I can see, if the above parameters are used to define caching, it
could improve performance on sites where a high number of transactions are
made, where there is also a large amount of static data, i.e. a ecommerce site,
library, etc. If the "iscacheable" keyword is not used, PostgreSQL will not
incur any performance degradation. However, if he "iscacheable" keyword is
used, the performance loss could very well be made up by the benefits of
caching.


Re: Again, sorry, caching.

From
mlw
Date:
"Mattew T. O'Connor" wrote:
> 
> > My big problem with putting the cache outside of the database is that it is
> > now incumbent on the applications programmer to write a cache. A database
> > should manage the data, the application should handle how the data is
> > presented. Forcing the application to implement a cache feels wrong.
> 
> I believe someone suggested a possible solution that was in the pg client
> using NOTICE and triggers.  The argument given against it, was that
> it would not be ACID compliant.  I say, who cares.  I would think that the
> "select cachable" would only be allowed for simple selects, it would not be
> used for select for update or anything else.  Anytime you are given the
> result of a simple select, you are not guaranteed that the data won't change
> underneath you.  

Not true, if you begin a transaction, you can be isolated of changes made to
the database.

>The primary use that you have suggested is for web sites,
> and they certainly won't mind of the cache is 0.3seconds out of date.

Again, if they don't care about accuracy, then they will use MySQL. PostgreSQL
is a far better system. Making PostgreSQL less accurate, less "correct" takes
away, IMHO, the very reasons to use it.


Re: Again, sorry, caching.

From
Greg Copeland
Date:
On Mon, 2002-03-18 at 08:15, mlw wrote:
> "Mattew T. O'Connor" wrote:
> >
[snip]

>
> >The primary use that you have suggested is for web sites,
> > and they certainly won't mind of the cache is 0.3seconds out of date.
>
> Again, if they don't care about accuracy, then they will use MySQL. PostgreSQL
> is a far better system. Making PostgreSQL less accurate, less "correct" takes
> away, IMHO, the very reasons to use it.
>

If you are using a web site and you need real time data within 0.3s,
you've implemented on the wrong platform.  It's as simple as that.  In
the web world, there are few applications where a "0.3s" of a window is
notable.  After all, that "0.3s" of a window can be anywhere within the
system, including the web server, network, any front end caches, dns
resolutions, etc.

I tend to agree with Mettew.  Granted, there are some application
domains where this can be critical...generally speaking, web serving
isn't one of them.

That's why all of the solutions I offered were pointedly addressing a
web server scenario and not a generalized database cache.  I completely
agree with you on that.  In a generalized situation, the database should
be managing and caching the data (which it already does).

Greg


Re: Again, sorry, caching.

From
mlw
Date:
Greg Copeland wrote:
> 
> On Mon, 2002-03-18 at 08:15, mlw wrote:
> > "Mattew T. O'Connor" wrote:
> > >
> [snip]
> 
> >
> > >The primary use that you have suggested is for web sites,
> > > and they certainly won't mind of the cache is 0.3seconds out of date.
> >
> > Again, if they don't care about accuracy, then they will use MySQL. PostgreSQL
> > is a far better system. Making PostgreSQL less accurate, less "correct" takes
> > away, IMHO, the very reasons to use it.
> >
> 
> If you are using a web site and you need real time data within 0.3s,
> you've implemented on the wrong platform.  It's as simple as that.  In
> the web world, there are few applications where a "0.3s" of a window is
> notable.  After all, that "0.3s" of a window can be anywhere within the
> system, including the web server, network, any front end caches, dns
> resolutions, etc.

This is totally wrong! An out of date cache can cause errors by returning
results that are no longer valid, thus causing lookup issues. That is what ACID
compliance is all about.

> 
> I tend to agree with Mettew.  Granted, there are some application
> domains where this can be critical...generally speaking, web serving
> isn't one of them.
> 
> That's why all of the solutions I offered were pointedly addressing a
> web server scenario and not a generalized database cache.  I completely
> agree with you on that.  In a generalized situation, the database should
> be managing and caching the data (which it already does).

But it does not cache a query. An expensive query which does an index range
scan and filters by a where clause could invalidate a good number of buffers in
the buffer cache. If this or a number of queries like it are frequently
repeated, verbatim, in a seldom changed table, why not cache them within
PostgreSQL? It would improve overall performance by preserving more blocks in
the buffer cache and eliminate a number of queries being executed.

I don't see how caching can be an argument of applicability. I can understand
it from a time/work point of view, but to debate that it is a useful feature
seems ludicrous.


Re: Again, sorry, caching.

From
Greg Copeland
Date:
On Mon, 2002-03-18 at 10:08, mlw wrote:
> Greg Copeland wrote:
> >
> > On Mon, 2002-03-18 at 08:15, mlw wrote:
> > > "Mattew T. O'Connor" wrote:
> > > >
[snip]

> >
> > If you are using a web site and you need real time data within 0.3s,
> > you've implemented on the wrong platform.  It's as simple as that.  In
> > the web world, there are few applications where a "0.3s" of a window is
> > notable.  After all, that "0.3s" of a window can be anywhere within the
> > system, including the web server, network, any front end caches, dns
> > resolutions, etc.
>
> This is totally wrong! An out of date cache can cause errors by returning
> results that are no longer valid, thus causing lookup issues. That is what ACID
> compliance is all about.

I understand what ACID is about.  Question.  Was the result set valid
when it was cached?  Yes.  So will it be valid when it's returned as a
cached result set?  Yes.  Might it be an out of date view.  Sure...with
a horribly small window for becoming "out of date".  Will it cause look
up problems?  Might.  No more than what you are proposing.  In the mean
time, the FE cached result set, performance wise, is beating the pants
off of the database cached solution on both a specific work load and
over all system performance.

I should point out that once the FE cache has been notified that it's
cache is invalid, the FE would no longer return the invalidated result
set.  I consider that to be a given, however, from some of your comments
I get the impression that you think the invalid result set would
continue to be served.  Another way of thinking about that is...it's
really not any different from the notification acting as the result
returned result set...from a validity perspective.  That is...if that
had been the returned result set (the notification) from the
database...it would be accurate (which in the case means the FE cache is
now dirty and treated as such)...if the query is refreshed because it is
now invalid..the result set is once again accurate and reflective of the
database.

Example...


Database cache
Query result set    Result set returned (cached on database)    local change to database (result set cache invalid)
new query based on out of date queried result set


Application cache
Query result set (cached)    Result set returned    local change to database (app cache invalid and signaled)
new query based on out of date queried result set

Both have that problem since transactional boundaries are hard to keep
across HTTP requests.  This again, is why for web applications, a FE
cache is perfectly acceptable for *most* needs.  Also notice that your
margin for error is more or less the same.

[snip]

> I don't see how caching can be an argument of applicability. I can understand
> it from a time/work point of view, but to debate that it is a useful feature
> seems ludicrous.

I don't think I'm arguing if it's applicable or useful.  Rather, I'm
saying that faster results can be yielded by implementing it in the
client with far less effort than it would take to implement in the BE.
I am arguing that it's impact on overall system performance (though I
really didn't do more than just touch on this topic) is
questionable...granted, it may greatly enhance specific work loads...at
the expense of others.  Which shouldn't be too surprising as trade offs
of some type are pretty common.

At this point in time, I think we've both pretty well beat this topic
up.  Obviously there are two primary ways of viewing the situation.  I
don't think anyone is saying it's a bad idea...I think everyone is
saying that it's easier to address elsewhere and that overall, the net
returns may be at the expense of some other work loads.  So, unless
there are new pearls to be shared and gleaned, I think the topics been
fairly well addressed.  Does more need to said?

Greg


Re: Again, sorry, caching.

From
Neil Conway
Date:
On Sat, 2002-03-16 at 09:01, mlw wrote:
> On a web site, a few specific queries get executed, unchanged, repeatedly.
> Think about an ecommerce site, most of the time it is just a handful of basic
> queries. These basic queries are usually against pretty large product tables. A
> caching mechanism would make these queries pretty light weight.
> 
> The arguments against caching:
> 
> "It is an application issue"
> This is completely wrong. Caching can not be done against a database without
> knowledge of the database, i.e. when the data changes.

But can't this be achieved by using a LISTEN/NOTIFY model, with
user-created rules to NOTIFY the appropriate listener when a table
changes? With a good notification scheme like this, you don't need to
continually poll the DB for changes. You don't need to teach your cache
a lot of things about the database, since most of that knowledge is
encapsulated inside the rules, and supporting tables.

My impression (I could be wrong) is that LISTEN/NOTIFY doesn't get the
press that it deserves. If this model isn't widely used because of some 
deficiencies in the LISTEN/NOTIFY implementation, IMHO our time would be
better spent fixing those problems than implementing the proposed
caching scheme.

If we're looking to provide a "quick and easy" caching scheme for users
attracted to MySQL's query cache, why not provide this functionality
through another application? I'm thinking about a generic "caching
layer" that would sit in between Postgres and the database client. It
could speak the FE/BE protocol as necessary; it would use LISTEN/NOTIFY
to allow it to efficiently be aware of database changes; it would create
the necessary rules for the user, providing a simple interface to
enabling query caching for a table or a set of tables?

What does everyone think?

> OK, let me have it, tell me how terrible an idea this is. tell me how wrong I
> am.

I think your goals are laudable (and I also appreciate the effort that
you and everyone else puts into Postgres); I just think we could get
most of the benefits without needing to implement potentially complex
changes to Postgres internals.

Cheers,

Neil

-- 
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC



Re: Again, sorry, caching.

From
Greg Copeland
Date:
On Mon, 2002-03-18 at 20:35, Neil Conway wrote:
[snip]

> My impression (I could be wrong) is that LISTEN/NOTIFY doesn't get the
> press that it deserves. If this model isn't widely used because of some
> deficiencies in the LISTEN/NOTIFY implementation, IMHO our time would be
> better spent fixing those problems than implementing the proposed
> caching scheme.
>
> If we're looking to provide a "quick and easy" caching scheme for users
> attracted to MySQL's query cache, why not provide this functionality
> through another application? I'm thinking about a generic "caching
> layer" that would sit in between Postgres and the database client. It
> could speak the FE/BE protocol as necessary; it would use LISTEN/NOTIFY
> to allow it to efficiently be aware of database changes; it would create
> the necessary rules for the user, providing a simple interface to
> enabling query caching for a table or a set of tables?
>
> What does everyone think?
>

Yes...I was thinking that a generic library interface with a nice design
pattern might meet this need rather well.  Done properly, I think we can
make it where all that, more or less, would be needed is application
hooks which accept the result set to be cached and a mechanism to signal
invalidation of the current cache....obviously that's not an exhaustive
list... :)

I haven't spent much time on this, but I'm fairly sure some library
routines can be put together which would greatly reduce the effort of
application coders to support fe-data caches and still be portable for
even the Win32 port.

Greg


Re: Again, sorry, caching.

From
Jeff Davis
Date:
> Yes...I was thinking that a generic library interface with a nice design
> pattern might meet this need rather well.  Done properly, I think we can
> make it where all that, more or less, would be needed is application
> hooks which accept the result set to be cached and a mechanism to signal
> invalidation of the current cache....obviously that's not an exhaustive
> list... :)

A library implies that the application is running long enough to actually 
hear the notofication. Web apps start up, read from the database, and before 
any cache is needed they're done and the next one starts up, reading again 
from the database. Only currently open connections receive the notification.

I think that you do need an entire layer... but that's not a bad thing 
necessarily. Have a daemon that stays connected for a long time and when a 
notification arrives, rewrite the cache (or mark it dirty). Other 
applications can read data from static files or shared memory, or even 
another communication socket with your daemon.

There may be some way around running a daemon, so if you have a better 
solution please let me know.

I think I am in favor of client caching in general, but "mlw" (sorry, I can't 
find your real name in the emails at hand) makes some good points. The most 
important one is that we don't want to change application architectures on 
everyone. It's easy if you just have to add "iscachable" to a query, it's 
hard if you have to start writing against a different set of routines (to 
grab from your client cache rather than a database). 

However, I am perfectly happy writing a client-side cache or using temp 
tables to store a result set. I also don't care that much if someone chooses 
PostgreSQL for their website (unless I'm responsible for it's success in some 
way :) ). That's me personally, if you want to attract more users from mysql, 
"iscachable" is very likely an attractive feature.

Regards,Jeff



Jeff Davis wrote:
> 
> > Yes...I was thinking that a generic library interface with a nice design
> > pattern might meet this need rather well.  Done properly, I think we can
> > make it where all that, more or less, would be needed is application
> > hooks which accept the result set to be cached and a mechanism to signal
> > invalidation of the current cache....obviously that's not an exhaustive
> > list... :)
> 
> A library implies that the application is running long enough to actually
> hear the notofication. Web apps start up, read from the database, and before
> any cache is needed they're done and the next one starts up, reading again
> from the database. Only currently open connections receive the notification.
> 
> I think that you do need an entire layer... but that's not a bad thing
> necessarily. Have a daemon that stays connected for a long time and when a
> notification arrives, rewrite the cache (or mark it dirty). Other
> applications can read data from static files or shared memory, or even
> another communication socket with your daemon.
> 
> There may be some way around running a daemon, so if you have a better
> solution please let me know.
> 
> I think I am in favor of client caching in general, but "mlw" (sorry, I can't
> find your real name in the emails at hand) makes some good points. The most
> important one is that we don't want to change application architectures on
> everyone. It's easy if you just have to add "iscachable" to a query, it's
> hard if you have to start writing against a different set of routines (to
> grab from your client cache rather than a database).
> 
> However, I am perfectly happy writing a client-side cache or using temp
> tables to store a result set. I also don't care that much if someone chooses
> PostgreSQL for their website (unless I'm responsible for it's success in some
> way :) ). That's me personally, if you want to attract more users from mysql,
> "iscachable" is very likely an attractive feature.

I was thinking about this. There seems to be a consensus that caching means no
ACID compliance. And everyone seems to think it needs to be limited. We can
implement a non-ACID cache as a contrib function with some work to the function
manager.

Right now, the function manager can only return one value, or one set of values
for a column. It should be possible, but require a lot of research, to enable
the function manager to return a set of rows. If we could get that working, it
could be fairly trivial to implement a cache as a contrib project. It would
work something like this:

select querycache("select * from mytable where foo='bar') ;

This does two things that I would like to see: The ability to cache subselects
independent of the full query. The ability to control which queries get cached.

If we can get this row functionality in the function manager for 7.3, we could
then implement MANY MANY enterprise level functionalities. Remote queries,
query caching, external tables, etc. as contrib projects rather than full blown
modifications to PostgreSQL.


Re: Again, sorry, caching.

From
Greg Copeland
Date:
On a side note, is it possible that we could add the "iscachable" which
for now, would give cache bias?  That is, allow for a mechanism to
indicate that the pages that are required for this query will be
frequently needed.  I've not looked at the buffer cache implementation.
Is it possible to somehow weight the corresponding pages in the cache so
that it works something like this?  For each query that does not use the
biased pages ages the biased pages.  Once the age threshold has been
hit, allow the pages to be flushed per normal page replacement
strategy.  Every time the biased pages get a hit, renew the bias on the
pages.

I'm not sure this holds water but I'm thinking it would at least help
insure that the pages in question are quickly available without having
to constantly re-read them from disk.

What ya think?  Cache already work like this?  Doable?


On Tue, 2002-03-19 at 07:17, Jeff Davis wrote:
> > Yes...I was thinking that a generic library interface with a nice design
> > pattern might meet this need rather well.  Done properly, I think we can
> > make it where all that, more or less, would be needed is application
> > hooks which accept the result set to be cached and a mechanism to signal
> > invalidation of the current cache....obviously that's not an exhaustive
> > list... :)
>
> A library implies that the application is running long enough to actually
> hear the notofication. Web apps start up, read from the database, and before
> any cache is needed they're done and the next one starts up, reading again
> from the database. Only currently open connections receive the notification.

I think you misunderstood me.  My intension was the creation of a
pattern library, whereby, creation of your "layer", without regard for
the implementation requirements, can more easily be implemented.  In
other words, every time you need to implement this "layer" for various
applications which address various problem domains, the library would
serve as the heart of it reducing the amount of common code that would
otherwise ave to be put in place in one form or another.  Thus my
reference to a design pattern.

Should also be noted that some fast and slick web applications
architecture often support some form of context shared persistence which
would allow for caching to be implementing even in web application
space.

>
> I think that you do need an entire layer... but that's not a bad thing
> necessarily. Have a daemon that stays connected for a long time and when a
> notification arrives, rewrite the cache (or mark it dirty). Other
> applications can read data from static files or shared memory, or even
> another communication socket with your daemon.

Exactly...all of which, I'm thinking, can be encompassed within a design
pattern, greatly reducing the effort required for a new "layer"
application requirement.

>
> There may be some way around running a daemon, so if you have a better
> solution please let me know.

I hadn't spent enough time thinking out it.  My initial thought was to
provide the support functionality and let the coder determine his own
route to achieve his goal.  Does he need a new daemon or can it be built
into his application?  This is why a library is appealing.

>
> I think I am in favor of client caching in general, but "mlw" (sorry, I can't
> find your real name in the emails at hand) makes some good points. The most
> important one is that we don't want to change application architectures on
> everyone. It's easy if you just have to add "iscachable" to a query, it's
> hard if you have to start writing against a different set of routines (to
> grab from your client cache rather than a database).

Yes.  I completely agree with that.

>
> However, I am perfectly happy writing a client-side cache or using temp
> tables to store a result set. I also don't care that much if someone chooses
> PostgreSQL for their website (unless I'm responsible for it's success in some
> way :) ). That's me personally, if you want to attract more users from mysql,
> "iscachable" is very likely an attractive feature.

And/or provide language bindings to a caching library which greatly
helps facilitate this.  Granted, "iscachable" concept is certainly a
powerful concept.

Greg


Re: Again, sorry, caching, (Tom What do you think: function

From
Greg Copeland
Date:
On Tue, 2002-03-19 at 07:46, mlw wrote:
[snip]

> Right now, the function manager can only return one value, or one set of values
> for a column. It should be possible, but require a lot of research, to enable
> the function manager to return a set of rows. If we could get that working, it
> could be fairly trivial to implement a cache as a contrib project. It would
> work something like this:
>
> select querycache("select * from mytable where foo='bar') ;

Interesting concept...but how would you know when the cache has become
dirty?  That would give you a set of rows...but I don't understand what
would let you know your result set is invalid?

Perhaps: select querycache( foobar_event, "select * from my table where
foo='bar'" ) ; would automatically create a listen for you??

>
> This does two things that I would like to see: The ability to cache subselects
> independent of the full query. The ability to control which queries get cached.
>
> If we can get this row functionality in the function manager for 7.3, we could
> then implement MANY MANY enterprise level functionalities. Remote queries,
> query caching, external tables, etc. as contrib projects rather than full blown
> modifications to PostgreSQL.

Correct me if I'm wrong, but this concept would also be applicable to
some clustering/distributed query (that what you meant by remote
queries) needs too?

Greg


Re: Again, sorry, caching, (Tom What do you think: function

From
Greg Copeland
Date:
On Tue, 2002-03-19 at 07:46, mlw wrote:
> I was thinking about this. There seems to be a consensus that caching means no
> ACID compliance. And everyone seems to think it needs to be limited. We can
> implement a non-ACID cache as a contrib function with some work to the function
> manager.

Until know, I hadn't really thought about it...I just took it for
granted since it was asserted...however, what isn't ACID about the
approach that I offered?

A - Not effected...it's read only and provided directly from the
database, thus, it's still a function of the database.  Any change
resulting from atomic changes are notified to the cache, whereby it is
repopulated.
C - Not effected...the database is still responsible for keeping
consistency.  The cache is still read only.  State is ensured as
invalidation is notified by the database and data set should be returned
consistent by the database or the database is broken.
I - Again, the database is still performing this task and notifies the
cache when updates need to take place.  Again, Isolation isn't an issue
because the cache is still read-only.
D - Durability isn't a question either as, again, the database is still
doing this.  In the event of cache failure...it would be repopulated
from the database...so it would be as durable as is the database.

Please help me understand.

Thanks,Greg


Re: Again, sorry, caching.

From
Doug McNaught
Date:
Jeff Davis <list-pgsql-hackers@dynworks.com> writes:

> A library implies that the application is running long enough to actually 
> hear the notofication. Web apps start up, read from the database, and before 
> any cache is needed they're done and the next one starts up, reading again 
> from the database. Only currently open connections receive the notification.

If your web app works this way than you already don't care about
performance.  People doing scalable web apps these days use connection
pooling and session data kept in memory, so you already have a
persistent layer running (whether it's your JVM, Apache process for
mod_perl or PHP, or whatever).  Really big apps definitely have a
long-running daemon process that handles caching, session management
(so you can have multiple webservers) etc etc...

-Doug
-- 
Doug McNaught       Wireboard Industries      http://www.wireboard.com/
     Custom software development, systems and network consulting.     Java PostgreSQL Enhydra Python Zope Perl Apache
LinuxBSD...
 


Re: Again, sorry, caching, (Tom What do you think: function

From
Gavin Sherry
Date:
On 19 Mar 2002, Greg Copeland wrote:

> On Tue, 2002-03-19 at 07:46, mlw wrote:
> [snip]
> 
> > Right now, the function manager can only return one value, or one set of values
> > for a column. It should be possible, but require a lot of research, to enable
> > the function manager to return a set of rows. If we could get that working, it
> > could be fairly trivial to implement a cache as a contrib project. It would
> > work something like this:
> > 
> > select querycache("select * from mytable where foo='bar') ;
> 
> Interesting concept...but how would you know when the cache has become
> dirty?  That would give you a set of rows...but I don't understand what
> would let you know your result set is invalid?
> 
> Perhaps: select querycache( foobar_event, "select * from my table where
> foo='bar'" ) ; would automatically create a listen for you??


Personally, I think this method of providing query caching is very
messy. Why not just implement this along side the system relation
cache? This maybe be slightly more time consuming but it will perform
better and will be able to take advantage of Postgres's current MVCC.

There would be three times when the cache would be interacted with

1) add a new result set

ExecRetrieve() would need to be modified to handle a
prepare-for-cache-update kind of feature. This would involve adding the
tuple table slot data into a linked list.

At the end of processing/transaction and if the query was successfuly, the
prepare-for-cache-update list could be processed by AtCommit_Cache() 
(called from CommitTransaction()) and the shared cache updated.

2) attempt to get result set from cache

Before planning in postgres.c, test if the query will produce an already
cached result set. If so, send the data off from cache.

3) modification of underlying heap

Like (1), produce a list inside the executor (ExecAppend(), ExecDelete(),
ExecReplace() -> RelationInvalidateHeapTuple() ->
PrepareForTupleInvalidation()) which gets processed by
AtEOXactInvalidationMessages(). This results in the affected entries being
purged.

---

I'm not sure that cached results is a direction postgres need move in. But
if it does, I think this a better way to do it (given that I may have
overlooked something) than modifying the function manager (argh!).

Gavin



Re: Again, sorry, caching.

From
Michael Alan Dorman
Date:
Neil Conway <nconway@klamath.dyndns.org> writes:
> My impression (I could be wrong) is that LISTEN/NOTIFY doesn't get
> the press that it deserves. If this model isn't widely used because
> of some deficiencies in the LISTEN/NOTIFY implementation, IMHO our
> time would be better spent fixing those problems than implementing
> the proposed caching scheme.

I would have to say I think a large part of the problem is lack of
press---I've been hanging around pgsql-hackers for two or three years
now, and until all this discussion, had never heard of LISTEN/NOTIFY.

That doesn't mean I didn't miss prior mentions, but it certainly
doesn't seem to come up often or get a lot of discussion when it does.

Now that I know about it, well, it looks like it would be trivial to
use it to implement cache invalidation logic in my HTML::Mason-based
application---I need only have a long-lived process running on the web
server(s) that uses the perl Pg interface, and sits listening, and
when it sees notifies on given conditions, flush the appropriate local
caches.

I'd actually been contemplating cramming logic to do this down into
the library I use for implementing the system logic, but had resisted
doing it because it would make the library too tied to the web---this
would be much easier.

I won't even have to re-grab the results from the DB and reformat and
all that crap, I can just spew the output from the last time the page
was assembled---sounds better to me than what MySQL provides.  Of
course, I get a lot of this for free as a result of the tools I'm
using, but surely this sort of thing shouldn't be all that hard to
implement in other systems.

Mike.


Re: Again, sorry, caching, (Tom What do you think: function

From
mlw
Date:
Gavin Sherry wrote:
> I'm not sure that cached results is a direction postgres need move in. But
> if it does, I think this a better way to do it (given that I may have
> overlooked something) than modifying the function manager (argh!).

I actually had an anterior motive.

Your comment about caching not being a direction in which PostgreSQL needs to
move, says it all. The general rank and file seems to agree. I think caching
could speed up a number of things, certainly some of the stuff I have been
working on. I think it would be more likely to get some sort of caching from a
contrib project rather than to sway the core team.

IMHO modifying the function manager to allow the return of a full row, and a
"set of" full rows, answers a LOT of issues I have seen over the years with
PostgreSQL extensibility.

With a full row function API we can implement:

(1) Remote Queries
select remotequery(hostname, port, 'select * from foo');

(2) External queries
select mysqlquery(hostname, port, 'select * from foo');

(3) Cached queries
select cachedquery('select * from foo');

(4) Full text search
select ftssquery(hostname, port, 'word1 and word2 and word3 not word4');

Again, with full row functions, we could prototype/implement many advanced
features in PostgreSQL as contrib projects.


Re: Again, sorry, caching.

From
"Ross J. Reedstrom"
Date:
On Mon, Mar 18, 2002 at 09:35:51PM -0500, Neil Conway wrote:
> > 
> > "It is an application issue"
> > This is completely wrong. Caching can not be done against a database without
> > knowledge of the database, i.e. when the data changes.
> 
> But can't this be achieved by using a LISTEN/NOTIFY model, with
> user-created rules to NOTIFY the appropriate listener when a table
> changes? With a good notification scheme like this, you don't need to
> continually poll the DB for changes. You don't need to teach your cache
> a lot of things about the database, since most of that knowledge is
> encapsulated inside the rules, and supporting tables.
> 
> My impression (I could be wrong) is that LISTEN/NOTIFY doesn't get the
> press that it deserves. If this model isn't widely used because of some 
> deficiencies in the LISTEN/NOTIFY implementation, IMHO our time would be
> better spent fixing those problems than implementing the proposed
> caching scheme.
> 
> If we're looking to provide a "quick and easy" caching scheme for users
> attracted to MySQL's query cache, why not provide this functionality
> through another application? I'm thinking about a generic "caching
> layer" that would sit in between Postgres and the database client. It
> could speak the FE/BE protocol as necessary; it would use LISTEN/NOTIFY
> to allow it to efficiently be aware of database changes; it would create
> the necessary rules for the user, providing a simple interface to
> enabling query caching for a table or a set of tables?
> 
> What does everyone think?

Neil, this sounds like exactly the approach to follow up on: the one part
of caching that _is_ the backends domain is knowing about invalidation
events.  And LISTEN/NOTIFY has _exactly_ the right behavior for that -
you don't get out of transaction NOTIFYs, for example. As it stands,
the application developer has to have intimate knowledge of the schema
to set up the correct NOTIFY triggers for any given query. This works
against developing a generic middleware solution, since one would have
to parse the SQL to guess at the affected tables.

How about an extension that autocreates INSERT/UPDATE/DELETE triggers
that send NOTIFYs, based on all tables accessed by a given SELECT? As
an example, I could see extending the Zope PsycoPG database adaptor,
(which already tries some simple timeout based caching) to tack on
something like:

SELECT foo,bar FROM baz CACHENOTIFY <notifyname>

whenever it creates a cache fora given query, then setting up the correct
LISTEN to invalidate that cache. Alternatively, the LISTEN could be
automatic. The name might be autogenerated, as well, to avoid collision
probelms. Or perhaps _allow_ collisions to extend the notification
set? (I could see setting _all_ the queries that generate one web page
to NOTIFY together, since the entire page needs to be regenerated on cache
invalidation)

Then, the existing interface to SQL queries would allow the app developer
to set logical caching policies for each query, independently. The backend
does only the part that it alone can do: determine all the tables touched
by a given query. The middleware and app developer are then free to cache
at the appropriate level (SQL result set, fully formatted web page, etc.)
This clearly is only useful in a  connection pooling environment,
so the long lived backends are around to receive the NOTIFYs. Hmm, no,
I think it would be possible with this to have a seperate process do
the LISTEN and cache invalidation, while a pool of other backends are
used for general access, no?

Seems like a win all around. Anyone else have comments? How insane
would the auto trigger creation get? It seems to me that this would be
similar in spirit to the referential integrity work, but more dynamic,
since simple SELECTs would be creating backend triggers. Potential for
DOS attacks, for ex. but not much worse I suppose than firing off big
nasty cartesian cross product queries.

Ross


Re: Again, sorry, caching, (Tom What do you think: function

From
Jan Wieck
Date:
mlw wrote:
> [...]
>
> IMHO modifying the function manager to allow the return of a full row, and a
> "set of" full rows, answers a LOT of issues I have seen over the years with
> PostgreSQL extensibility.
   Sure.  Actually I think you'll have an easy project with this   one, because all the work has been done by Tom
already.
   The function manager isn't the problem any more. It  is  that   you  cannot  have such a "set of" function in the
rangetable.  So you have no mechanism to USE the result.
 


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com



Re: Again, sorry, caching, (Tom What do you think: function

From
mlw
Date:
Jan Wieck wrote:
> 
> mlw wrote:
> > [...]
> >
> > IMHO modifying the function manager to allow the return of a full row, and a
> > "set of" full rows, answers a LOT of issues I have seen over the years with
> > PostgreSQL extensibility.
> 
>     Sure.  Actually I think you'll have an easy project with this
>     one, because all the work has been done by Tom already.
> 
>     The function manager isn't the problem any more. It  is  that
>     you  cannot  have such a "set of" function in the rangetable.
>     So you have no mechanism to USE the result.

I'm not sure I follow you. OK, maybe I identified the wrong portion of code. 

The idea is that the first return value could return an array of varlenas, one
for each column, then a set of varlenas, one for each column.

Is there a way to return this to PostgreSQL?


Re: Again, sorry, caching.

From
F Harvell
Date:
On Tue, 19 Mar 2002 12:12:52 CST, "Ross J. Reedstrom" wrote:
> On Mon, Mar 18, 2002 at 09:35:51PM -0500, Neil Conway wrote:
> > > 
> > > "It is an application issue"
> > > This is completely wrong. Caching can not be done against a database without
> > > knowledge of the database, i.e. when the data changes.
> > ...
> > 
> > If we're looking to provide a "quick and easy" caching scheme for users
> > attracted to MySQL's query cache, why not provide this functionality
> > through another application?
> > ...
> > 
> > What does everyone think?
> 
> Neil, this sounds like exactly the approach to follow up on: 
> ...
> 
> Seems like a win all around. Anyone else have comments?
> ...
 I'm not certain the full direction of the thinking here, however, it
seems to me that there are a few considerations that I would like to
see/keep in mind:

I feel that the caching should be SQL transparent.  If it is
implemented reasonably well, the performance gain should be pretty
much universal.  Yes, a large number of queries would never be called
again, however, the results still need to be fetched into memory and
"caching" them for later reuse should involve little more than a
skipped free (think filesystem cache).  It makes more sense to specify
"non-cachable" in a query for tuning than "cacheable".  This also
means that just switching databases to PostgreSQL improves my
performance.

Also, it is very important that the caching should be transparent to
the application.  This means that the application should be able to
connect to the database using the "standard" application interface
(i.e., ODBC, PHP, Perl/DBI, etc.)  This allows me to port my existing
Oracle/DB2/MySQL/etc. application to pgsql through normal porting.  If
I have to implement a non-standard interface, I can likely gain even
more performance through custom code and maintain reasonable database
independence.

While I am a strong believer in PostgreSQL, many of my customers have
other demands/requirements.  I still want to be able to use my
existing code and libraries when using their database.  Sticking with
the "standards" allows me to develop best of class applications and
interface to best of class databases.  It also allows others to easily
realize the value of PostgreSQL.

Thanks,
F Harvell





Re: Again, sorry, caching.

From
Neil Conway
Date:
On Tue, 2002-03-19 at 19:20, F Harvell wrote:
> I feel that the caching should be SQL transparent.  If it is
> implemented reasonably well, the performance gain should be pretty
> much universal.

Well, the simple query cache scheme that is currently being proposed
would use a byte-by-byte comparison of the incoming query. I think the
consensus is that for a lot of workloads, this would be a bad idea.

Cheers,

Neil

-- 
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC



Re: Again, sorry, caching.

From
mlw
Date:
Neil Conway wrote:
> 
> On Tue, 2002-03-19 at 19:20, F Harvell wrote:
> > I feel that the caching should be SQL transparent.  If it is
> > implemented reasonably well, the performance gain should be pretty
> > much universal.
> 
> Well, the simple query cache scheme that is currently being proposed
> would use a byte-by-byte comparison of the incoming query. I think the
> consensus is that for a lot of workloads, this would be a bad idea.

And this is what I have been trying to argue. Many SQL deployments execute a
set of hard coded queries as the majority of the work load. The dynamic
queries, obviously, will not be cached, but the vast majority of work will come
out of the cache.


Re: Again, sorry, caching.

From
"Christopher Kings-Lynne"
Date:
> > Well, the simple query cache scheme that is currently being proposed
> > would use a byte-by-byte comparison of the incoming query. I think the
> > consensus is that for a lot of workloads, this would be a bad idea.
>
> And this is what I have been trying to argue. Many SQL
> deployments execute a
> set of hard coded queries as the majority of the work load. The dynamic
> queries, obviously, will not be cached, but the vast majority of
> work will come
> out of the cache.

I feel like jumping into the fray!

I think first we need to decide on some facts:

1. Implementing a byte-compatible query cache WILL improve the speed of
repetitive queries over static data.

2. This can be incredibly useful for some web applications.

3. It is really hard to implement such a cache whilst keeping postgres
maintainable and ACID compliant.

4. An application layer cache can be smarter and faster than a database
layer cache, and this is currently the standard way of doing things.  MySQL
is bringing db layer caches to the mainstream.  In a few years time -
everyone might be doing it...

5. The main developers, or in fact the people with the ability to implement
such a thing, either won't do it or can't be stuffed doing it...

6. Implementing prepared statements in postgres is a reasonable, valid and
standard addition that will improve performance all over the place.  This
might also lead to "prepared views" - another performance improvement.

7. Improving the buffer manager's LRU policy can reduce problem of seq. scan
wiping out cache.

So, given the above it seems to me that:

1. The main developers are more interested in implementing prepared
statements - which is cool, as this is a good performance improvement.

2. The main developers can look at replacing LRU to futher improve cache
use.

3. We agree that such a query cache can be useful in some circumstances and
could help postgres's performance in certain environments, but the will
doesn't exist to implement it at the moment and it would also be difficult
and messy.  Put it on the TODO list maybe.

4. If someone happens to submit a magic patch that implements query caching
in a perfectly ACID-compliant way, then it should be considered for
inclusion.  Why the heck not?

Chris



Re: Again, sorry, caching.

From
Neil Conway
Date:
On Tue, 2002-03-19 at 21:40, Christopher Kings-Lynne wrote:
> 1. Implementing a byte-compatible query cache WILL improve the speed of
> repetitive queries over static data.

For some specific workloads, I think it has the potential to
significantly improve performance.

> 3. It is really hard to implement such a cache whilst keeping postgres
> maintainable and ACID compliant.

If we only consider implementations within Postgres itself, this is
probably true. However, I haven't seen anyone comment that there are
ACID-related concerns with the NOTIFY/LISTEN scheme that has been
suggested (and therefore, with the middle-tier caching daemon I
proposed).

> 5. The main developers, or in fact the people with the ability to implement
> such a thing, either won't do it or can't be stuffed doing it...

I don't think it's a particularly good idea to implement the query cache
within the database itself. As for the middle-tier caching daemon I
suggested, I'm working on a design but I haven't begun implementation
yet.

> 3. We agree that such a query cache can be useful in some circumstances and
> could help postgres's performance in certain environments, but the will
> doesn't exist to implement it at the moment and it would also be difficult
> and messy.  Put it on the TODO list maybe.

I agree that a query cache implemented inside Postgres proper would be
messy and of dubious value, but I haven't heard of any show-stoppers WRT
my proposal (of course, if anyone knows of one, please speak up).

Cheers,

Neil
-- 
Neil Conway <neilconway@rogers.com>
PGP Key ID: DB3C29FC



Re: Again, sorry, caching.

From
"Ross J. Reedstrom"
Date:
On Tue, Mar 19, 2002 at 08:28:19PM -0500, Neil Conway wrote:
> On Tue, 2002-03-19 at 19:20, F Harvell wrote:
> > I feel that the caching should be SQL transparent.  If it is
> > implemented reasonably well, the performance gain should be pretty
> > much universal.
> 
> Well, the simple query cache scheme that is currently being proposed
> would use a byte-by-byte comparison of the incoming query. I think the
> consensus is that for a lot of workloads, this would be a bad idea.

Apparently, no one actually read _my_ proposal, they just replied to it.
All the arguements about if this kind of cache is any good have been
thrashed out, up-thread. Apparently Mr. Harvell didn't feel the need
to go back and read them. Going over them again is not productive -
the next step is to see if there is anything to actually _code_ here.

Caching is a hard problem, once you start digging into it. Going from
no cache to some cache is (almost) always a win, but multiple caches in
a datapath can interact in non-intuitive ways. And we _already_ have
several, well tuned, clever caches in place. Anything that messes with
them is going to be rejected, for sure. 

What I proposed is a sort of compromise: it is clear to me that the core
developers are not very interested in the kind of cache Neil is talking
about above, and would rather see query caching done in the app. What I
proposed is extending the backends support for client-side caching, to
make it easier (or possible) for middleware to automate the task.

The bare bones are: flag a query in some way so the backend auto generates
the appropriate NOTIFY triggers, so the middleware can do proper cache
maintenance by LISTENing. 

I think I'll go away and write up my compromise proposal a little more
clearly, and post it under a new subject, later. Perhaps we can then
have a productive discussion about _it_, and not rehash old arguments.

Ross

P.S. 

HACKER sociological opinion below - feel free to skip - 

There are only three reasons to discuss features on HACKERS: to
see if a proposed feature would be rejected, so you don't waste time
implementing it; to refine a proposed implementation, so it doesn't have
to be reworked; and to discuss an actual in-hand implementation. Notice
that there's no way to skip step one: if the CVS committers don't like
the feature, arguing for it on HACKERS won't make it magically better:
providing an implementation that doesn't do bad things _might_. And you
can always maintain an independent patch, or fork.

So, we have a number of people who think a query cache would be a
good idea.  And core developers who are not convinced.  I think one
of the reasons is that, while it might be useful in some situations
(even fairly common situations) it's neither elegant nor flexible. The
PostgreSQL project has a long tradition of turning down narrow, 'good
enough - it works for me' solutions, while looking for a better, more
inclusive solution. Sometimes this has been a problem with missing
features, but in the long run, it's been a win.


Re: Again, sorry, caching, (Tom What do you think: function

From
Jan Wieck
Date:
mlw wrote:
> Jan Wieck wrote:
> >
> > mlw wrote:
> > > [...]
> > >
> > > IMHO modifying the function manager to allow the return of a full row, and a
> > > "set of" full rows, answers a LOT of issues I have seen over the years with
> > > PostgreSQL extensibility.
> >
> >     Sure.  Actually I think you'll have an easy project with this
> >     one, because all the work has been done by Tom already.
> >
> >     The function manager isn't the problem any more. It  is  that
> >     you  cannot  have such a "set of" function in the rangetable.
> >     So you have no mechanism to USE the result.
>
> I'm not sure I follow you. OK, maybe I identified the wrong portion of code.
>
> The idea is that the first return value could return an array of varlenas, one
> for each column, then a set of varlenas, one for each column.
>
> Is there a way to return this to PostgreSQL?
   There  is a way to return anything. The problem in PostgreSQL   is to actually USE it.
   Our idea originally was  to  extend  the  capabilities  of  a   rangetable  entry.   Currently,  rangetable  entries
canonly   hold a relation, which is a table or a view. After rewriting,   they are down to real tables only.
 
   But  basically,  a  rangetable  entry  should  just be a row-   source, so that a function returning a  row-set
could occur   there too.
 
   In  order  to avoid multiple calls to the function because of   nestloops and the like, I think when a set function
occursin   a RTE, it's result should be dumped into a sort-tape and that   is used as the row source in the rest of the
plan.


Jan

--

#======================================================================#
# It's easier to get forgiveness for being wrong than for being right. #
# Let's break this rule - forgive me.                                  #
#================================================== JanWieck@Yahoo.com #



_________________________________________________________
Do You Yahoo!?
Get your free @yahoo.com address at http://mail.yahoo.com



Procedures returning row sources

From
"Christopher Kings-Lynne"
Date:
Jan wrote:

>     There  is a way to return anything. The problem in PostgreSQL
>     is to actually USE it.
>
>     Our idea originally was  to  extend  the  capabilities  of  a
>     rangetable  entry.   Currently,  rangetable  entries can only
>     hold a relation, which is a table or a view. After rewriting,
>     they are down to real tables only.
>
>     But  basically,  a  rangetable  entry  should  just be a row-
>     source, so that a function returning a  row-set  could  occur
>     there too.
>
>     In  order  to avoid multiple calls to the function because of
>     nestloops and the like, I think when a set function occurs in
>     a RTE, it's result should be dumped into a sort-tape and that
>     is used as the row source in the rest of the plan.

Hmmm...now that my SET NOT NULL patch is on the list, I'm thinking about
what to tackle next.  This is something that would be incredibly useful to
me, but sounds pretty difficult (for someone unfamiliar with the code).

So, some questions:

1. Can someone give me some pointers as to whereabouts I should look in the
source code, and what I should be looking for, given that I've never played
in the rewriter/executor before?

2. Maybe a general plan-of-attack?  ie. What things would need to be changed
and what order should I change them in...

3. Tell me it's worth me spending time on this - that it's not something a
main developer could just code up in an evening?

4. What stuff has Tom done that should make it 'easy'?

Cheers,

Chris

(Sick of returning arrays and comma delimited lists from functions!)



Re: Again, sorry, caching.

From
Bruce Momjian
Date:
Greg Copeland wrote:
> At this point in time, I think we've both pretty well beat this topic
> up.  Obviously there are two primary ways of viewing the situation.  I
> don't think anyone is saying it's a bad idea...I think everyone is
> saying that it's easier to address elsewhere and that overall, the net
> returns may be at the expense of some other work loads.  So, unless
> there are new pearls to be shared and gleaned, I think the topics been
> fairly well addressed.  Does more need to said?

With a PREPARE/EXECUTE patch now out for approval, can I assume we will
go with that first and see how far it gets us, and then revisit the idea
of cached results.  In this case, we are caching the query plan.  The
query still executes again in the executor, so the data is always fresh.
In a sense, the buffer cache and disk are the caches, which don't need
separate invalidation if some data changes in the table.

The plan can get invalid if it calls a non-cachable function or the
schema changes, or the constants used to generate the plan in the
optimizer would generate a different plan from the constants used in a
later query, or the analyze statistics changed.

The MVCC ramifications of cached results and invalidation could be quite
complex.  The commit of a transaction could change tuple visibility
rules even if the data modify statement was executed much earlier in the
transaction.

Also, on the NOTIFY/trigger idea, triggers are called on statement end,
not transaction end, so if an UPDATE query is in a multi-statement
transaction, another backend looking for the NOTIFY will receive it
before the transaction commits, meaning it will not see the update. 
That seems like a problem.  We do have deferrable constraints which will
only do checking on transaction end, but I am not sure if that can be
used for NOTIFY on transaction commit.  Anyone?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Again, sorry, caching.

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
> Also, on the NOTIFY/trigger idea, triggers are called on statement end,
> not transaction end, so if an UPDATE query is in a multi-statement
> transaction, another backend looking for the NOTIFY will receive it
> before the transaction commits, meaning it will not see the update. 

No it won't.
        regards, tom lane


Re: Again, sorry, caching.

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> > Also, on the NOTIFY/trigger idea, triggers are called on statement end,
> > not transaction end, so if an UPDATE query is in a multi-statement
> > transaction, another backend looking for the NOTIFY will receive it
> > before the transaction commits, meaning it will not see the update. 
> 
> No it won't.

Is this because NOTIFY is held for transaction end or because the
triggers are held until transaction commit?

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026
 


Re: Again, sorry, caching.

From
Tom Lane
Date:
Bruce Momjian <pgman@candle.pha.pa.us> writes:
>> No it won't.

> Is this because NOTIFY is held for transaction end or because the
> triggers are held until transaction commit?

The former.
        regards, tom lane


Re: Again, sorry, caching.

From
Bruce Momjian
Date:
Tom Lane wrote:
> Bruce Momjian <pgman@candle.pha.pa.us> writes:
> >> No it won't.
> 
> > Is this because NOTIFY is held for transaction end or because the
> > triggers are held until transaction commit?
> 
> The former.

Thanks.  I see it in the NOTIFY manual page now.  Very nice.

--  Bruce Momjian                        |  http://candle.pha.pa.us pgman@candle.pha.pa.us               |  (610)
853-3000+  If your life is a hard drive,     |  830 Blythe Avenue +  Christ can be your backup.        |  Drexel Hill,
Pennsylvania19026